You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cache UTF-8-validity status of strings in GC flags
The PCRE extension is already doing this. The flag is set when a string
is determined to be valid UTF-8, and cleared in
zend_string_forget_hash_val.
We might as well make good use of it in mbstring as well.
This should result in a negligible slowdown for non-UTF-8 strings,
bad UTF-8 strings, and good UTF-8 strings which are checked only once.
However, when microbenchmarking this change using a variety of text
encodings and string lengths, I found that in most of these cases,
the 'new' code was a few percent faster. In a couple of cases, the 'old'
code was a few percent faster. This was not a result of sampling error,
since I could reproduce these test results repeatedly, and even when
running a large number of iterations. Both the new and old code
were compiled with -O3 -march=native. My (unproved) hypothesis is that
although the new code appears to only add one more conditional branch,
the compiler may emit slightly different code from before (perhaps
with different register allocation and so on), and this may cause some
cases to run slightly faster and others to run slightly slower. I have
not disassembled the old and new binaries to see if an examination of
the emitted assembly code would support this hypothesis.
For good UTF-8 strings which are checked repeatedly, the speedup is
about 40% even for strings 1-5 bytes in length. For ~100 byte strings,
it is ~700%, and for ~10000 byte strings, it is ~80000%.
I tried fuzzing MBString's php_mb_check_encoding function and
pcre2lib's valid_utf function to see if I could find any cases where
their output would be different. After running the fuzzer for a couple
of minutes, it had tried more than 1 million test cases without finding
any where the output was different. Therefore, it appears that
MBString's UTF-8 validation is compatible with PCRE's.
0 commit comments