Skip to content

More performance increases for mbstring functions on UTF-8 text #10313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

alexdowad
Copy link
Contributor

@alexdowad
Copy link
Contributor Author

I must say I was hoping for better than a 3x speedup here... sigh.

@kamil-tekiela
Copy link
Member

Can you please fix CI errors? Does this need new unit tests or the existing ones cover it?

@cmb69
Copy link
Member

cmb69 commented Jan 14, 2023

I must say I was hoping for better than a 3x speedup here... sigh.

A 3x speedup looks rather considerable to me. :)

@alexdowad
Copy link
Contributor Author

Can you please fix CI errors? Does this need new unit tests or the existing ones cover it?

Let me do that.

This change is well covered by (extensive) existing unit tests.

@alexdowad
Copy link
Contributor Author

A 3x speedup looks rather considerable to me. :)

Yeah, it is... it would have been nice to get 10x or better, but it's still decent.

@cmb69 @nikic @Girgias @dstogov Do we have anything in PHP like what the Linux kernel does, whereby it does a CPUID on startup, checks what ISA extensions the host CPU supports, and patches in accelerated versions of some functions based on the result? If not, are you interested in doing something like that?

It makes it possible for packagers to create binaries with broad support for older CPUs, while still taking advantage of ISA extensions on newer CPUs.

@cmb69
Copy link
Member

cmb69 commented Jan 14, 2023

Do we have anything in PHP like what the Linux kernel does, whereby it does a CPUID on startup, checks what ISA extensions the host CPU supports, and patches in accelerated versions of some functions based on the result?

See https://github.com/php/php-src/blob/master/Zend/zend_cpuinfo.h. (The obvious drawback of using this is that it's unlikely that all variations are tested in CI.)

@Girgias
Copy link
Member

Girgias commented Jan 14, 2023

This is clearly out of my depth here as I have no idea what any of these SSE2 instruction do, so I will trust the test suite and everything works

@alexdowad
Copy link
Contributor Author

This is clearly out of my depth here as I have no idea what any of these SSE2 instruction do...

The intention here is that any experienced programmer should be able to understand what is going on by reading the code comments, without having to consult the Intel intrinsics documentation.

Are there some places where I could make the comments clearer?

@kamil-tekiela
Copy link
Member

I don't think that the concern is whether the comments are understandable. The nature of code comments is that they often lie. Not on purpose but because code comments describe what the author intended to do, not necessarily what the code actually does. It would be good if an SSE expert could verify that the functions you used actually do what the code comments say they should do.

@alexdowad
Copy link
Contributor Author

Do we have anything in PHP like what the Linux kernel does, whereby it does a CPUID on startup, checks what ISA extensions the host CPU supports, and patches in accelerated versions of some functions based on the result?

See https://github.com/php/php-src/blob/master/Zend/zend_cpuinfo.h. (The obvious drawback of using this is that it's unlikely that all variations are tested in CI.)

Cool, thanks!

It looks like there are just a couple users of this feature, including Base64 encoding/decoding, CRC32, and a couple of built-in string functions.

Just noticed something really interesting here...

php-src/ext/standard/string.c

Lines 3878 to 3931 in abc41c2

# elif defined(ZEND_INTRIN_SSE4_2_RESOLVER)
void php_stripslashes_sse42(zend_string *str)
# endif
{
const char *s = ZSTR_VAL(str);
char *t = ZSTR_VAL(str);
size_t l = ZSTR_LEN(str);
if (l > 15) {
const __m128i slash = _mm_set1_epi8('\\');
do {
__m128i in = _mm_loadu_si128((__m128i *)s);
__m128i any_slash = _mm_cmpeq_epi8(in, slash);
uint32_t res = _mm_movemask_epi8(any_slash);
if (res) {
int i, n = zend_ulong_ntz(res);
const char *e = s + 15;
l -= n;
for (i = 0; i < n; i++) {
*t++ = *s++;
}
for (; s < e; s++) {
if (*s == '\\') {
s++;
l--;
if (*s == '0') {
*t = '\0';
} else {
*t = *s;
}
} else {
*t = *s;
}
t++;
l--;
}
} else {
_mm_storeu_si128((__m128i *)t, in);
s += 16;
t += 16;
l -= 16;
}
} while (l > 15);
}
t = php_stripslashes_impl(s, t, l);
if (t != (ZSTR_VAL(str) + ZSTR_LEN(str))) {
ZSTR_LEN(str) = t - ZSTR_VAL(str);
ZSTR_VAL(str)[ZSTR_LEN(str)] = '\0';
}
}
#endif

Although this accelerated function is only used if the host CPU supports SSE4.2... there are no SSE4.2 instructions there! Everything is just SSE2, which (as @nielsdos kindly educated me the other day) is automatically supported by all x86-64 CPUs.

@kamil-tekiela
Copy link
Member

Off topic, but this is an interesting read: Convert a String In C++ To Upper Case

@cmb69
Copy link
Member

cmb69 commented Jan 14, 2023

It looks like there are just a couple users of this feature, including Base64 encoding/decoding, CRC32, and a couple of built-in string functions.

Yeah, unfortunately a lack of developer bandwidth.

Although this accelerated function is only used if the host CPU supports SSE4.2... there are no SSE4.2 instructions there! Everything is just SSE2, […]

I think you are right! Would be great if we could actually test that, but it may be very hard to find a machine which only supports SSE2, but no other SIMD extensions nowadays.

@nielsdos
Copy link
Member

nielsdos commented Jan 14, 2023

Although this accelerated function is only used if the host CPU supports SSE4.2... there are no SSE4.2 instructions there! Everything is just SSE2, […]

I think you are right! Would be great if we could actually test that, but it may be very hard to find a machine which only supports SSE2, but no other SIMD extensions nowadays.

You can test that statically on GCC at compile time by passing in -mno-sse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2. If the code were to use the SSE4 intrinsics then it will fail to compile with an error.

@alexdowad
Copy link
Contributor Author

You can test that statically on GCC at compile time by passing in -mno-sse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2. If the code were to use the SSE4 intrinsics then it will fail to compile with an error.

Nice suggestion, I may try that.

@nielsdos
Copy link
Member

I used both the compiler trick and Intel's Intrinsics Guide and can confirm that that stripslashes function indeed only relies on SSE2 and not SSE4.x

…UTF-8

The new SSE2-based implementation of mb_check_encoding for UTF-8 is
about 10% faster for 0-5 byte strings, more than 3 times faster for
~100-byte strings, and just under 4 times faster for ~10,000-byte
strings.

I believe it may be possible to make this function much faster again.
Some possible directions for further performance optimization include:

• If other ISA extensions like AVX or AVX-512 are available, use a
  similar algorithm, but process text in blocks of 32 or 64 bytes
  (instead of 16 bytes).
• If other SIMD ISA extensions are available, use the greater variety
  of available instructions to make some of the checks tighter.
• Even if only SSE/SSE2 are available, find clever ways to squeeze
  instructions out of the hot path. This would probably require a lot
  of perusing instruction mauals and thinking hard about which SIMD
  instructions could be used to perform the same checks with fewer
  instructions.
• Find a better algorithm, possibly one where more checks could be
  combined (just as the current algorithm combines the checks for
  certain overlong code units and reserved codepoints).
On longer strings, this gives a small speed boost of 10% or less.
@alexdowad
Copy link
Contributor Author

I just tried fuzzing this accelerated UTF-8 validation function to look for cases where its output differs from the unaccelerated one... and guess what, the fuzzer found a bug.

After fixing the bug, I was able to run the fuzzer for an extended period of time without finding any other bugs.

@alexdowad
Copy link
Contributor Author

I have added a regression test for the bug which was found by the fuzzer.

Since there are some code changes, I will leave this for a bit longer to see if anyone wants to review again, but if not, I will land on master soon.

@alexdowad
Copy link
Contributor Author

Looks like no more comments are forthcoming... Landed on master.

Many thanks to all who commented.

I'm not sure if I will clean up the issue with stripslashes which was identified, since I still have a lot of stuff to do on mbstring.

@alexdowad alexdowad closed this Jan 17, 2023
@alexdowad alexdowad deleted the checkutf8 branch January 17, 2023 08:10
nielsdos added a commit to nielsdos/php-src that referenced this pull request Jan 21, 2023
Alex Dowad noticed[1] that the SIMD stripslashes implementation actually
only depended on SSE2, and not on SSE4.2 instructions. Remove the
checking for SSE4.2 and only check for SSE2. This also greatly
simplifies the supporting code.

[1] php#10313 (comment)
nielsdos added a commit to nielsdos/php-src that referenced this pull request Jan 21, 2023
Alex Dowad noticed[1] that the SIMD stripslashes implementation actually
only depended on SSE2, and not on SSE4.2 instructions. Remove the
checking for SSE4.2 and only check for SSE2. This also greatly
simplifies the supporting code.

[1] php#10313 (comment)
Girgias pushed a commit that referenced this pull request Jan 23, 2023
Alex Dowad noticed[1] that the SIMD stripslashes implementation actually
only depended on SSE2, and not on SSE4.2 instructions. Remove the
checking for SSE4.2 and only check for SSE2. This also greatly
simplifies the supporting code.

[1] #10313 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants