More performance increases for mbstring functions on UTF-8 text #10313

alexdowad · 2023-01-14T12:17:42Z

FYA @cmb69 @Girgias @nikic @kamil-tekiela @youkidearitai

alexdowad · 2023-01-14T12:20:04Z

I must say I was hoping for better than a 3x speedup here... sigh.

kamil-tekiela · 2023-01-14T12:44:42Z

Can you please fix CI errors? Does this need new unit tests or the existing ones cover it?

ext/mbstring/mbstring.c

cmb69 · 2023-01-14T13:01:34Z

I must say I was hoping for better than a 3x speedup here... sigh.

A 3x speedup looks rather considerable to me. :)

alexdowad · 2023-01-14T13:34:25Z

Can you please fix CI errors? Does this need new unit tests or the existing ones cover it?

Let me do that.

This change is well covered by (extensive) existing unit tests.

alexdowad · 2023-01-14T13:45:44Z

A 3x speedup looks rather considerable to me. :)

Yeah, it is... it would have been nice to get 10x or better, but it's still decent.

@cmb69 @nikic @Girgias @dstogov Do we have anything in PHP like what the Linux kernel does, whereby it does a CPUID on startup, checks what ISA extensions the host CPU supports, and patches in accelerated versions of some functions based on the result? If not, are you interested in doing something like that?

It makes it possible for packagers to create binaries with broad support for older CPUs, while still taking advantage of ISA extensions on newer CPUs.

cmb69 · 2023-01-14T13:57:54Z

Do we have anything in PHP like what the Linux kernel does, whereby it does a CPUID on startup, checks what ISA extensions the host CPU supports, and patches in accelerated versions of some functions based on the result?

See https://github.com/php/php-src/blob/master/Zend/zend_cpuinfo.h. (The obvious drawback of using this is that it's unlikely that all variations are tested in CI.)

Girgias · 2023-01-14T14:05:19Z

This is clearly out of my depth here as I have no idea what any of these SSE2 instruction do, so I will trust the test suite and everything works

alexdowad · 2023-01-14T15:27:03Z

This is clearly out of my depth here as I have no idea what any of these SSE2 instruction do...

The intention here is that any experienced programmer should be able to understand what is going on by reading the code comments, without having to consult the Intel intrinsics documentation.

Are there some places where I could make the comments clearer?

kamil-tekiela · 2023-01-14T15:39:47Z

I don't think that the concern is whether the comments are understandable. The nature of code comments is that they often lie. Not on purpose but because code comments describe what the author intended to do, not necessarily what the code actually does. It would be good if an SSE expert could verify that the functions you used actually do what the code comments say they should do.

alexdowad · 2023-01-14T15:40:32Z

Do we have anything in PHP like what the Linux kernel does, whereby it does a CPUID on startup, checks what ISA extensions the host CPU supports, and patches in accelerated versions of some functions based on the result?

See https://github.com/php/php-src/blob/master/Zend/zend_cpuinfo.h. (The obvious drawback of using this is that it's unlikely that all variations are tested in CI.)

Cool, thanks!

It looks like there are just a couple users of this feature, including Base64 encoding/decoding, CRC32, and a couple of built-in string functions.

Just noticed something really interesting here...

php-src/ext/standard/string.c

Lines 3878 to 3931 in abc41c2

    
           # elif defined(ZEND_INTRIN_SSE4_2_RESOLVER) 
        
           void php_stripslashes_sse42(zend_string *str) 
        
           # endif 
        
           { 
        
           	const char *s = ZSTR_VAL(str); 
        
           	char *t = ZSTR_VAL(str); 
        
           	size_t l = ZSTR_LEN(str); 
        
           	if (l > 15) { 
        
           		const __m128i slash = _mm_set1_epi8('\\'); 
        
           		do { 
        
           			__m128i in = _mm_loadu_si128((__m128i *)s); 
        
           			__m128i any_slash = _mm_cmpeq_epi8(in, slash); 
        
           			uint32_t res = _mm_movemask_epi8(any_slash); 
        
           			if (res) { 
        
           				int i, n = zend_ulong_ntz(res); 
        
           				const char *e = s + 15; 
        
           				l -= n; 
        
           				for (i = 0; i < n; i++) { 
        
           					*t++ = *s++; 
        
           				} 
        
           				for (; s < e; s++) { 
        
           					if (*s == '\\') { 
        
           						s++; 
        
           						l--; 
        
           						if (*s == '0') { 
        
           							*t = '\0'; 
        
           						} else { 
        
           							*t = *s; 
        
           						} 
        
           					} else { 
        
           						*t = *s; 
        
           					} 
        
           					t++; 
        
           					l--; 
        
           				} 
        
           			} else { 
        
           				_mm_storeu_si128((__m128i *)t, in); 
        
           				s += 16; 
        
           				t += 16; 
        
           				l -= 16; 
        
           			} 
        
           		} while (l > 15); 
        
           	} 
        
           	t = php_stripslashes_impl(s, t, l); 
        
           	if (t != (ZSTR_VAL(str) + ZSTR_LEN(str))) { 
        
           		ZSTR_LEN(str) = t - ZSTR_VAL(str); 
        
           		ZSTR_VAL(str)[ZSTR_LEN(str)] = '\0'; 
        
           	} 
        
           } 
        
           #endif

Although this accelerated function is only used if the host CPU supports SSE4.2... there are no SSE4.2 instructions there! Everything is just SSE2, which (as @nielsdos kindly educated me the other day) is automatically supported by all x86-64 CPUs.

kamil-tekiela · 2023-01-14T15:57:38Z

Off topic, but this is an interesting read: Convert a String In C++ To Upper Case

cmb69 · 2023-01-14T16:18:44Z

It looks like there are just a couple users of this feature, including Base64 encoding/decoding, CRC32, and a couple of built-in string functions.

Yeah, unfortunately a lack of developer bandwidth.

Although this accelerated function is only used if the host CPU supports SSE4.2... there are no SSE4.2 instructions there! Everything is just SSE2, […]

I think you are right! Would be great if we could actually test that, but it may be very hard to find a machine which only supports SSE2, but no other SIMD extensions nowadays.

nielsdos · 2023-01-14T16:26:01Z

Although this accelerated function is only used if the host CPU supports SSE4.2... there are no SSE4.2 instructions there! Everything is just SSE2, […]

I think you are right! Would be great if we could actually test that, but it may be very hard to find a machine which only supports SSE2, but no other SIMD extensions nowadays.

You can test that statically on GCC at compile time by passing in -mno-sse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2. If the code were to use the SSE4 intrinsics then it will fail to compile with an error.

alexdowad · 2023-01-14T16:46:59Z

You can test that statically on GCC at compile time by passing in -mno-sse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2. If the code were to use the SSE4 intrinsics then it will fail to compile with an error.

Nice suggestion, I may try that.

nielsdos · 2023-01-14T17:52:31Z

I used both the compiler trick and Intel's Intrinsics Guide and can confirm that that stripslashes function indeed only relies on SSE2 and not SSE4.x

…UTF-8 The new SSE2-based implementation of mb_check_encoding for UTF-8 is about 10% faster for 0-5 byte strings, more than 3 times faster for ~100-byte strings, and just under 4 times faster for ~10,000-byte strings. I believe it may be possible to make this function much faster again. Some possible directions for further performance optimization include: • If other ISA extensions like AVX or AVX-512 are available, use a similar algorithm, but process text in blocks of 32 or 64 bytes (instead of 16 bytes). • If other SIMD ISA extensions are available, use the greater variety of available instructions to make some of the checks tighter. • Even if only SSE/SSE2 are available, find clever ways to squeeze instructions out of the hot path. This would probably require a lot of perusing instruction mauals and thinking hard about which SIMD instructions could be used to perform the same checks with fewer instructions. • Find a better algorithm, possibly one where more checks could be combined (just as the current algorithm combines the checks for certain overlong code units and reserved codepoints).

On longer strings, this gives a small speed boost of 10% or less.

alexdowad · 2023-01-15T12:46:43Z

I just tried fuzzing this accelerated UTF-8 validation function to look for cases where its output differs from the unaccelerated one... and guess what, the fuzzer found a bug.

After fixing the bug, I was able to run the fuzzer for an extended period of time without finding any other bugs.

alexdowad · 2023-01-15T13:03:28Z

I have added a regression test for the bug which was found by the fuzzer.

Since there are some code changes, I will leave this for a bit longer to see if anyone wants to review again, but if not, I will land on master soon.

alexdowad · 2023-01-17T08:09:52Z

Looks like no more comments are forthcoming... Landed on master.

Many thanks to all who commented.

I'm not sure if I will clean up the issue with stripslashes which was identified, since I still have a lot of stuff to do on mbstring.

Alex Dowad noticed[1] that the SIMD stripslashes implementation actually only depended on SSE2, and not on SSE4.2 instructions. Remove the checking for SSE4.2 and only check for SSE2. This also greatly simplifies the supporting code. [1] php#10313 (comment)

Alex Dowad noticed[1] that the SIMD stripslashes implementation actually only depended on SSE2, and not on SSE4.2 instructions. Remove the checking for SSE4.2 and only check for SSE2. This also greatly simplifies the supporting code. [1] #10313 (comment)

github-actions bot added the Extension: mbstring label Jan 14, 2023

alexdowad force-pushed the checkutf8 branch from 915ef9f to f68308d Compare January 14, 2023 12:18

cmb69 reviewed Jan 14, 2023

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

alexdowad force-pushed the checkutf8 branch from f68308d to 791e69d Compare January 14, 2023 13:35

alexdowad added 2 commits January 15, 2023 14:43

Tweaks for accelerated implementation of mb_strlen for UTF-8

e327bc2

On longer strings, this gives a small speed boost of 10% or less.

alexdowad force-pushed the checkutf8 branch from 791e69d to e327bc2 Compare January 15, 2023 12:44

alexdowad closed this Jan 17, 2023

alexdowad deleted the checkutf8 branch January 17, 2023 08:10

nielsdos mentioned this pull request Jan 21, 2023

Make stripslashes() only dependent on SSE2 configuration. #10408

Merged

More performance increases for mbstring functions on UTF-8 text #10313

More performance increases for mbstring functions on UTF-8 text #10313

Uh oh!

Conversation

alexdowad commented Jan 14, 2023

Uh oh!

alexdowad commented Jan 14, 2023

Uh oh!

kamil-tekiela commented Jan 14, 2023

Uh oh!

Uh oh!

cmb69 commented Jan 14, 2023

Uh oh!

alexdowad commented Jan 14, 2023

Uh oh!

alexdowad commented Jan 14, 2023

Uh oh!

cmb69 commented Jan 14, 2023

Uh oh!

Girgias commented Jan 14, 2023

Uh oh!

alexdowad commented Jan 14, 2023

Uh oh!

kamil-tekiela commented Jan 14, 2023

Uh oh!

alexdowad commented Jan 14, 2023

Uh oh!

kamil-tekiela commented Jan 14, 2023

Uh oh!

cmb69 commented Jan 14, 2023

Uh oh!

nielsdos commented Jan 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexdowad commented Jan 14, 2023

Uh oh!

nielsdos commented Jan 14, 2023

Uh oh!

alexdowad commented Jan 15, 2023

Uh oh!

alexdowad commented Jan 15, 2023

Uh oh!

alexdowad commented Jan 17, 2023

Uh oh!

Uh oh!

nielsdos commented Jan 14, 2023 •

edited

Loading