Skip to content

Fix big performance issue in string serialization #1848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 13, 2025

Conversation

lovasoa
Copy link
Contributor

@lovasoa lovasoa commented May 11, 2025

The old code was handling string escaping character by character. For every literal string in the AST, it would push it to the underlying writer character by character, resulting in thousands of write calls for long strings.

The new code calls the write function only once, with the entire string, in most cases.

Only when the string is stored unescaped do we really need to call write multiple times; and even then, we don't need to call it more than the total number of characters to escape plus one.

Here are benchmark results for serializing the following sql statement: "SELECT 'xxx...(x 10000)' as long_string" to a string in memory:

sqlparser-rs parsing benchmark/format_long_string
                        time:   [10.544 µs 10.645 µs 10.743 µs]
                        change: [-84.871% -84.195% -83.553%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

image

lovasoa added 3 commits May 12, 2025 00:36
The previous implementation wrote each character individually using write! macro,
which is inefficient for string formatting. The new implementation uses write_str
to write larger chunks of the string at once, significantly reducing the number
of write operations and formatting overhead.

This change maintains the same escaping behavior but improves performance by
avoiding character-by-character writes.
@alamb
Copy link
Contributor

alamb commented May 13, 2025

When you say "old" code do you know what PR introduced this regression?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me -- thanks @lovasoa

@lovasoa
Copy link
Contributor Author

lovasoa commented May 13, 2025

@alamb : I meant that the code before this PR handled the string char by char. I don't think this was a regression.

@alamb alamb merged commit 178a351 into apache:main May 13, 2025
9 checks passed
@alamb
Copy link
Contributor

alamb commented May 13, 2025

Thanks @lovasoa and @jayzhan211 for the review

cc @iffyio

@lovasoa lovasoa deleted the string-literal-display-perf branch May 13, 2025 15:37
@lovasoa
Copy link
Contributor Author

lovasoa commented May 13, 2025

thanks for merging !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants