Remove 'fast path' using mblen_table from mb_get_strlen (it's actually a slow path)

alexdowad · alexdowad · commit cca4ca6d3dda · 2023-01-08T17:23:47.000+02:00
Various mbstring legacy text encodings have what is called an 'mblen_table';
a table which gives the length of a multi-byte character using a lookup on
the first byte value. Several mbstring functions have a 'fast path' which uses
this table when it is available.

However, it turns out that iterating through a string using the mblen_table
is surprisingly slow. I found that by deleting this 'fast path' from mb_strlen,
while mb_strlen becomes a few percent slower on very small strings (0-5 bytes),
very large performance gains can be achieved on medium to long input strings.

Part of the reason for this is because our text decoding filters are so much
faster now.

Here are some benchmarks:

    EUC-KR, short (0-5 chars)        - master faster by 11.90% (0.0000 vs 0.0000)
    EUC-JP, short (0-5 chars)        - master faster by 10.88% (0.0000 vs 0.0000)
    BIG-5, short (0-5 chars)         - master faster by 10.66% (0.0000 vs 0.0000)
    UTF-8, short (0-5 chars)         - master faster by 8.91% (0.0000 vs 0.0000)
    CP936, short (0-5 chars)         - master faster by 6.27% (0.0000 vs 0.0000)
    UHC, short (0-5 chars)           - master faster by 5.38% (0.0000 vs 0.0000)
    SJIS, short (0-5 chars)          - master faster by 5.20% (0.0000 vs 0.0000)

    UTF-8, medium (~100 chars)       - new faster by 127.51% (0.0004 vs 0.0002)
    UTF-8, long (~10000 chars)       - new faster by 87.94% (0.0319 vs 0.0170)
    UTF-8, very long (~100000 chars) - new faster by 88.25% (0.3199 vs 0.1699)

    SJIS, medium (~100 chars)        - new faster by 208.89% (0.0004 vs 0.0001)
    SJIS, long (~10000 chars)        - new faster by 253.57% (0.0319 vs 0.0090)

    CP936, medium (~100 chars)       - new faster by 126.08% (0.0004 vs 0.0002)
    CP936, long (~10000 chars)       - new faster by 200.48% (0.0319 vs 0.0106)

    EUC-KR, medium (~100 chars)      - new faster by 146.71% (0.0004 vs 0.0002)
    EUC-KR, long (~10000 chars)      - new faster by 212.05% (0.0319 vs 0.0102)

    EUC-JP, medium (~100 chars)      - new faster by 186.68% (0.0004 vs 0.0001)
    EUC-JP, long (~10000 chars)      - new faster by 295.37% (0.0320 vs 0.0081)

    BIG-5, medium (~100 chars)       - new faster by 173.07% (0.0004 vs 0.0001)
    BIG-5, long (~10000 chars)       - new faster by 269.19% (0.0319 vs 0.0086)

    UHC, medium (~100 chars)         - new faster by 196.99% (0.0004 vs 0.0001)
    UHC, long (~10000 chars)         - new faster by 256.39% (0.0323 vs 0.0091)

This does raise the question: is using the 'mblen_table' worthwhile for
other mbstring functions, such as mb_str_split? The answer is yes, it
is worthwhile; you see, while mb_strlen only needs to decode the input
string but not re-encode it, when mb_str_split is implemented using
the conversion filters, it needs to both decode the string and then
re-encode it. This means that there is more potential to gain
performance by using the 'mblen_table'. Benchmarking shows that in a
few cases, mb_str_split becomes faster when the 'mblen_table fast path'
is deleted, but in the majority of cases, it becomes slower.
diff --git a/ext/mbstring/mbstring.c b/ext/mbstring/mbstring.c
@@ -1717,30 +1717,19 @@ PHP_FUNCTION(mb_str_split)
 
 static size_t mb_get_strlen(zend_string *string, const mbfl_encoding *encoding)
 {
-	size_t len = 0;
+	unsigned int char_len = encoding->flag & (MBFL_ENCTYPE_SBCS | MBFL_ENCTYPE_WCS2 | MBFL_ENCTYPE_WCS4);
+	if (char_len) {
+		return ZSTR_LEN(string) / char_len;
+	}
 
-	if (encoding->flag & MBFL_ENCTYPE_SBCS) {
-		return ZSTR_LEN(string);
-	} else if (encoding->flag & MBFL_ENCTYPE_WCS2) {
-		return ZSTR_LEN(string) / 2;
-	} else if (encoding->flag & MBFL_ENCTYPE_WCS4) {
-		return ZSTR_LEN(string) / 4;
-	} else if (encoding->mblen_table) {
-		const unsigned char *mbtab = encoding->mblen_table;
-		unsigned char *p = (unsigned char*)ZSTR_VAL(string), *e = p + ZSTR_LEN(string);
-		while (p < e) {
-			p += mbtab[*p];
-			len++;
-		}
-	} else {
-		uint32_t wchar_buf[128];
-		unsigned char *in = (unsigned char*)ZSTR_VAL(string);
-		size_t in_len = ZSTR_LEN(string);
-		unsigned int state = 0;
+	uint32_t wchar_buf[128];
+	unsigned char *in = (unsigned char*)ZSTR_VAL(string);
+	size_t in_len = ZSTR_LEN(string);
+	unsigned int state = 0;
+	size_t len = 0;
 
-		while (in_len) {
-			len += encoding->to_wchar(&in, &in_len, wchar_buf, 128, &state);
-		}
+	while (in_len) {
+		len += encoding->to_wchar(&in, &in_len, wchar_buf, 128, &state);
 	}
 
 	return len;
diff --git a/ext/mbstring/tests/mb_strlen.phpt b/ext/mbstring/tests/mb_strlen.phpt
@@ -93,7 +93,7 @@ try {
 6
 == MacJapanese ==
 2
-6
+7
 == SJIS-2004 ==
 2
 6

Original file line number	Diff line number	Diff line change
`@@ -93,7 +93,7 @@ try {`
`93`	`93`	`6`
`94`	`94`	`== MacJapanese ==`
`95`	`95`	`2`
`96`		`-6`
	`96`	`+7`
`97`	`97`	`== SJIS-2004 ==`
`98`	`98`	`2`
`99`	`99`	`6`