Skip to content

Improve performance of mbfl_name2encoding() by using perfect hashing #12707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 17, 2023

Conversation

nielsdos
Copy link
Member

mbfl_name2encoding() uses a linear loop through the encodings, comparing the name one by one, which is very slow. For the benchmark [1] just looking up the name takes about 50% of run-time.

By using perfect hashing instead, we no longer have to loop over the list, and the number of string comparisons is reduced to just a single one. The perfect hashing table is generated using GNU gperf and amended manually to fit in with mbstring and manually changed to reduce the cache size.

Previously the linked benchmark took 2.39s on my system, now it takes 1.50s. This is a nice improvement that can be felt when processing large amounts of data.

[1] #12684 (comment)

mbfl_name2encoding() uses a linear loop through the encodings, comparing
the name one by one, which is very slow. For the benchmark [1] just
looking up the name takes about 50% of run-time.

By using perfect hashing instead, we no longer have to loop over the
list, and the number of string comparisons is reduced to just a single
one. The perfect hashing table is generated using GNU gperf and amended
manually to fit in with mbstring and manually changed to  reduce the
cache size.

[1] php#12684 (comment)
@alexdowad
Copy link
Contributor

Nice job!!

I haven't looked at this code for a while, but I seem to recall that it cached the last looked-up encoding and checked if the next requested one was the same using something like strcmp. Or did it just use a pointer equality comparison on the string (so the optimization would only work if the encoding name was interned)? Don't remember. Or perhaps I am thinking about something completely different?

@nielsdos
Copy link
Member Author

Thanks for the review Alex!
The caching seems to be for calls to php_mb_get_encoding, and it only caches the last used encoding.
I'm not sure if the benchmark actually calls php_mb_get_encoding, but even if it does, it is alternating between different encodings so it doesn't benefit from the caching.

@nielsdos nielsdos merged commit 7658220 into php:master Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants