Skip to content

Commit cca4ca6

Browse files
committed
Remove 'fast path' using mblen_table from mb_get_strlen (it's actually a slow path)
Various mbstring legacy text encodings have what is called an 'mblen_table'; a table which gives the length of a multi-byte character using a lookup on the first byte value. Several mbstring functions have a 'fast path' which uses this table when it is available. However, it turns out that iterating through a string using the mblen_table is surprisingly slow. I found that by deleting this 'fast path' from mb_strlen, while mb_strlen becomes a few percent slower on very small strings (0-5 bytes), very large performance gains can be achieved on medium to long input strings. Part of the reason for this is because our text decoding filters are so much faster now. Here are some benchmarks: EUC-KR, short (0-5 chars) - master faster by 11.90% (0.0000 vs 0.0000) EUC-JP, short (0-5 chars) - master faster by 10.88% (0.0000 vs 0.0000) BIG-5, short (0-5 chars) - master faster by 10.66% (0.0000 vs 0.0000) UTF-8, short (0-5 chars) - master faster by 8.91% (0.0000 vs 0.0000) CP936, short (0-5 chars) - master faster by 6.27% (0.0000 vs 0.0000) UHC, short (0-5 chars) - master faster by 5.38% (0.0000 vs 0.0000) SJIS, short (0-5 chars) - master faster by 5.20% (0.0000 vs 0.0000) UTF-8, medium (~100 chars) - new faster by 127.51% (0.0004 vs 0.0002) UTF-8, long (~10000 chars) - new faster by 87.94% (0.0319 vs 0.0170) UTF-8, very long (~100000 chars) - new faster by 88.25% (0.3199 vs 0.1699) SJIS, medium (~100 chars) - new faster by 208.89% (0.0004 vs 0.0001) SJIS, long (~10000 chars) - new faster by 253.57% (0.0319 vs 0.0090) CP936, medium (~100 chars) - new faster by 126.08% (0.0004 vs 0.0002) CP936, long (~10000 chars) - new faster by 200.48% (0.0319 vs 0.0106) EUC-KR, medium (~100 chars) - new faster by 146.71% (0.0004 vs 0.0002) EUC-KR, long (~10000 chars) - new faster by 212.05% (0.0319 vs 0.0102) EUC-JP, medium (~100 chars) - new faster by 186.68% (0.0004 vs 0.0001) EUC-JP, long (~10000 chars) - new faster by 295.37% (0.0320 vs 0.0081) BIG-5, medium (~100 chars) - new faster by 173.07% (0.0004 vs 0.0001) BIG-5, long (~10000 chars) - new faster by 269.19% (0.0319 vs 0.0086) UHC, medium (~100 chars) - new faster by 196.99% (0.0004 vs 0.0001) UHC, long (~10000 chars) - new faster by 256.39% (0.0323 vs 0.0091) This does raise the question: is using the 'mblen_table' worthwhile for other mbstring functions, such as mb_str_split? The answer is yes, it is worthwhile; you see, while mb_strlen only needs to decode the input string but not re-encode it, when mb_str_split is implemented using the conversion filters, it needs to both decode the string and then re-encode it. This means that there is more potential to gain performance by using the 'mblen_table'. Benchmarking shows that in a few cases, mb_str_split becomes faster when the 'mblen_table fast path' is deleted, but in the majority of cases, it becomes slower.
1 parent 58d741c commit cca4ca6

File tree

2 files changed

+12
-23
lines changed

2 files changed

+12
-23
lines changed

ext/mbstring/mbstring.c

Lines changed: 11 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1717,30 +1717,19 @@ PHP_FUNCTION(mb_str_split)
17171717

17181718
static size_t mb_get_strlen(zend_string *string, const mbfl_encoding *encoding)
17191719
{
1720-
size_t len = 0;
1720+
unsigned int char_len = encoding->flag & (MBFL_ENCTYPE_SBCS | MBFL_ENCTYPE_WCS2 | MBFL_ENCTYPE_WCS4);
1721+
if (char_len) {
1722+
return ZSTR_LEN(string) / char_len;
1723+
}
17211724

1722-
if (encoding->flag & MBFL_ENCTYPE_SBCS) {
1723-
return ZSTR_LEN(string);
1724-
} else if (encoding->flag & MBFL_ENCTYPE_WCS2) {
1725-
return ZSTR_LEN(string) / 2;
1726-
} else if (encoding->flag & MBFL_ENCTYPE_WCS4) {
1727-
return ZSTR_LEN(string) / 4;
1728-
} else if (encoding->mblen_table) {
1729-
const unsigned char *mbtab = encoding->mblen_table;
1730-
unsigned char *p = (unsigned char*)ZSTR_VAL(string), *e = p + ZSTR_LEN(string);
1731-
while (p < e) {
1732-
p += mbtab[*p];
1733-
len++;
1734-
}
1735-
} else {
1736-
uint32_t wchar_buf[128];
1737-
unsigned char *in = (unsigned char*)ZSTR_VAL(string);
1738-
size_t in_len = ZSTR_LEN(string);
1739-
unsigned int state = 0;
1725+
uint32_t wchar_buf[128];
1726+
unsigned char *in = (unsigned char*)ZSTR_VAL(string);
1727+
size_t in_len = ZSTR_LEN(string);
1728+
unsigned int state = 0;
1729+
size_t len = 0;
17401730

1741-
while (in_len) {
1742-
len += encoding->to_wchar(&in, &in_len, wchar_buf, 128, &state);
1743-
}
1731+
while (in_len) {
1732+
len += encoding->to_wchar(&in, &in_len, wchar_buf, 128, &state);
17441733
}
17451734

17461735
return len;

ext/mbstring/tests/mb_strlen.phpt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ try {
9393
6
9494
== MacJapanese ==
9595
2
96-
6
96+
7
9797
== SJIS-2004 ==
9898
2
9999
6

0 commit comments

Comments
 (0)