Skip to content

Commit 0e7160b

Browse files
committed
Implement mb_detect_encoding using fast text conversion filters
Regarding the optional 3rd `strict` argument to mb_detect_encoding, the documentation states: Controls the behaviour when string is not valid in any of the listed encodings. If strict is set to false, the closest matching encoding will be returned; if strict is set to true, false will be returned. (Ref: https://www.php.net/manual/en/function.mb-detect-encoding.php) Because of bugs in the implementation, mb_detect_encoding did not always behave according to this description when `strict` was false. For example: <?php echo var_export(mb_detect_encoding("\xc0\x00", "UTF-8", false)); // Before this commit, prints: false // After this commit, prints: 'UTF-8' Because `strict` is false in the above example, mb_detect_encoding should return the 'closest matching encoding', which is UTF-8, since that is the only candidate encoding. (Incidentally, this example shows that using mb_detect_encoding with a single candidate encoding in non-strict mode is useless.) The new implementation fixes this bug. It also fixes another problem with the old implementation as regards non-strict detection mode: The old implementation would stop processing of the input string using a particular candidate encoding as soon as it saw an error in that encoding, even in non-strict mode. This means that it could not really detect the 'closest matching encoding'; rather, what it would return in non-strict mode was 'the encoding in which the first decoding error is furthest from the beginning of the input string'. In non-strict mode, the new implementation continues trying to process the input string to its end even after seeing an error. This makes it possible to determine in which candidate encoding the string has the smallest number of errors, i.e. the 'closest matching encoding'. Rejecting candidate encodings as soon as it saw an error gave the old implementation a marked performance advantage in non-strict mode; however, the new implementation still beats it in most cases. Here are a few sample microbenchmark results: UTF-8, ~100 codepoints, strict mode Old: 0.080s (100,000 calls) New: 0.026s (" " ) UTF-8, ~100 codepoints, non-strict mode Old: 0.079s (100,000 calls) New: 0.033s (" " ) UTF-8, ~10000 codepoints, strict mode Old: 6.708s (60,000 calls) New: 1.383s (" " ) UTF-8, ~10000 codepoints, non-strict mode Old: 6.705s (60,000 calls) New: 3.044s (" " ) Notice that the old implementation had almost identical performance between strict and non-strict mode, while the new suffers a significant performance penalty for non-strict detection. This is the cost of implementing the behavior specified in the documentation. A couple more sample results: SJIS, ~10000 codepoints, strict mode Old: 4.563s New: 1.084s SJIS, ~10000 codepoints, non-strict mode Old: 4.569s New: 2.863s This is the only case I found where the new implementation loses: UTF-16LE, ~10000 codepoints, non-strict mode Old: 1.514s New: 2.813s The reason is because the test strings happened to be invalid right from the first few bytes for all the candidate encodings except for UTF-16LE; so the old implementation would immediately reject all those encodings and only process the entire string in UTF-16LE. I believe mb_detect_encoding could be made much faster if we identified good criteria for when to reject candidate encodings before reaching the end of the input string.
1 parent 9538646 commit 0e7160b

File tree

6 files changed

+206
-65
lines changed

6 files changed

+206
-65
lines changed

NEWS

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,9 @@ PHP NEWS
4242

4343
- MBString:
4444
. mb_detect_encoding is better able to identify the correct encoding for Turkish text. (Alex Dowad)
45+
. mb_detect_encoding's "non-strict" mode now behaves as described in the
46+
documentation. Previously, it would return false if the very first byte
47+
of the input string was invalid in all candidate encodings. (Alex Dowad)
4548

4649
- Opcache:
4750
. Added start, restart and force restart time to opcache's

ext/mbstring/libmbfl/mbfl/mbfilter.c

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -290,21 +290,6 @@ mbfl_convert_encoding(
290290
return mbfl_memory_device_result(&device, result);
291291
}
292292

293-
/*
294-
* identify encoding
295-
*/
296-
const mbfl_encoding *mbfl_identify_encoding(mbfl_string *string, const mbfl_encoding **elist, int elistsz, int strict)
297-
{
298-
if (!elistsz) {
299-
return NULL;
300-
}
301-
mbfl_encoding_detector *identd = mbfl_encoding_detector_new(elist, elistsz, strict);
302-
mbfl_encoding_detector_feed(identd, string);
303-
const mbfl_encoding *enc = mbfl_encoding_detector_judge(identd);
304-
mbfl_encoding_detector_delete(identd);
305-
return enc;
306-
}
307-
308293
/*
309294
* strcut
310295
*/

ext/mbstring/libmbfl/mbfl/mbfilter.h

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -155,12 +155,6 @@ MBFLAPI extern mbfl_string *
155155
mbfl_convert_encoding(mbfl_string *string, mbfl_string *result, const mbfl_encoding *toenc);
156156

157157

158-
/*
159-
* identify encoding
160-
*/
161-
MBFLAPI extern const mbfl_encoding *
162-
mbfl_identify_encoding(mbfl_string *string, const mbfl_encoding **elist, int elistsz, int strict);
163-
164158
/* Lengths -1 through -16 are reserved for error return values */
165159
static inline int mbfl_is_error(size_t len) {
166160
return len >= (size_t) -16;

ext/mbstring/mbstring.c

Lines changed: 169 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,9 @@
6262

6363
#include "zend_multibyte.h"
6464
#include "mbstring_arginfo.h"
65+
66+
#include "rare_cp_bitvec.h"
67+
6568
/* }}} */
6669

6770
/* {{{ prototypes */
@@ -84,6 +87,8 @@ static inline bool php_mb_is_no_encoding_utf8(enum mbfl_no_encoding no_enc);
8487

8588
static bool mb_check_str_encoding(zend_string *str, const mbfl_encoding *encoding);
8689

90+
static const mbfl_encoding* mb_guess_encoding(unsigned char *in, size_t in_len, const mbfl_encoding **elist, unsigned int elist_size, bool strict);
91+
8792
/* See mbfilter_cp5022x.c */
8893
uint32_t mb_convert_kana_codepoint(uint32_t c, uint32_t next, bool *consumed, uint32_t *second, int mode);
8994
/* }}} */
@@ -437,17 +442,12 @@ static bool php_mb_zend_encoding_lexer_compatibility_checker(const zend_encoding
437442

438443
static const zend_encoding *php_mb_zend_encoding_detector(const unsigned char *arg_string, size_t arg_length, const zend_encoding **list, size_t list_size)
439444
{
440-
mbfl_string string;
441-
442445
if (!list) {
443-
list = (const zend_encoding **)MBSTRG(current_detect_order_list);
446+
list = (const zend_encoding**)MBSTRG(current_detect_order_list);
444447
list_size = MBSTRG(current_detect_order_list_size);
445448
}
446449

447-
mbfl_string_init(&string);
448-
string.val = (unsigned char *)arg_string;
449-
string.len = arg_length;
450-
return (const zend_encoding *) mbfl_identify_encoding(&string, (const mbfl_encoding **)list, list_size, 0);
450+
return (const zend_encoding*)mb_guess_encoding((unsigned char*)arg_string, arg_length, (const mbfl_encoding **)list, list_size, false);
451451
}
452452

453453
static size_t php_mb_zend_encoding_converter(unsigned char **to, size_t *to_length, const unsigned char *from, size_t from_length, const zend_encoding *encoding_to, const zend_encoding *encoding_from)
@@ -2602,12 +2602,7 @@ MBSTRING_API zend_string* php_mb_convert_encoding(const char *input, size_t leng
26022602
from_encoding = *from_encodings;
26032603
} else {
26042604
/* auto detect */
2605-
mbfl_string string;
2606-
mbfl_string_init(&string);
2607-
string.val = (unsigned char *)input;
2608-
string.len = length;
2609-
from_encoding = mbfl_identify_encoding(
2610-
&string, from_encodings, num_from_encodings, MBSTRG(strict_detection));
2605+
from_encoding = mb_guess_encoding((unsigned char*)input, length, from_encodings, num_from_encodings, MBSTRG(strict_detection));
26112606
if (!from_encoding) {
26122607
php_error_docref(NULL, E_WARNING, "Unable to detect character encoding");
26132608
return NULL;
@@ -2712,7 +2707,7 @@ PHP_FUNCTION(mb_convert_encoding)
27122707
HashTable *input_ht, *from_encodings_ht = NULL;
27132708
const mbfl_encoding **from_encodings;
27142709
size_t num_from_encodings;
2715-
bool free_from_encodings;
2710+
bool free_from_encodings = false;
27162711

27172712
ZEND_PARSE_PARAMETERS_START(2, 3)
27182713
Z_PARAM_ARRAY_HT_OR_STR(input_ht, input_str)
@@ -2730,18 +2725,17 @@ PHP_FUNCTION(mb_convert_encoding)
27302725
if (php_mb_parse_encoding_array(from_encodings_ht, &from_encodings, &num_from_encodings, 3) == FAILURE) {
27312726
RETURN_THROWS();
27322727
}
2733-
free_from_encodings = 1;
2728+
free_from_encodings = true;
27342729
} else if (from_encodings_str) {
27352730
if (php_mb_parse_encoding_list(ZSTR_VAL(from_encodings_str), ZSTR_LEN(from_encodings_str),
27362731
&from_encodings, &num_from_encodings,
27372732
/* persistent */ 0, /* arg_num */ 3, /* allow_pass_encoding */ 0) == FAILURE) {
27382733
RETURN_THROWS();
27392734
}
2740-
free_from_encodings = 1;
2735+
free_from_encodings = true;
27412736
} else {
27422737
from_encodings = &MBSTRG(current_internal_encoding);
27432738
num_from_encodings = 1;
2744-
free_from_encodings = 0;
27452739
}
27462740

27472741
if (num_from_encodings > 1) {
@@ -2847,16 +2841,163 @@ static const mbfl_encoding **duplicate_elist(const mbfl_encoding **elist, size_t
28472841
return new_elist;
28482842
}
28492843

2844+
static unsigned int mb_estimate_encoding_demerits(uint32_t w)
2845+
{
2846+
/* Receive wchars decoded from input string using candidate encoding.
2847+
* Give the candidate many 'demerits' for each 'rare' codepoint found,
2848+
* a smaller number for each ASCII punctuation character, and 1 for
2849+
* all other codepoints.
2850+
*
2851+
* The 'common' codepoints should cover the vast majority of
2852+
* codepoints we are likely to see in practice, while only covering
2853+
* a small minority of the entire Unicode encoding space. Why?
2854+
* Well, if the test string happens to be valid in an incorrect
2855+
* candidate encoding, the bogus codepoints which it decodes to will
2856+
* be more or less random. By treating the majority of codepoints as
2857+
* 'rare', we ensure that in almost all such cases, the bogus
2858+
* codepoints will include plenty of 'rares', thus giving the
2859+
* incorrect candidate encoding lots of demerits. See
2860+
* common_codepoints.txt for the actual list used.
2861+
*
2862+
* So, why give extra demerits for ASCII punctuation characters? It's
2863+
* because there are some text encodings, like UTF-7, HZ, and ISO-2022,
2864+
* which deliberately only use bytes in the ASCII range. When
2865+
* misinterpreted as ASCII/UTF-8, strings in these encodings will
2866+
* have an unusually high number of ASCII punctuation characters.
2867+
* So giving extra demerits for such characters will improve
2868+
* detection accuracy for UTF-7 and similar encodings.
2869+
*
2870+
* Finally, why 1 demerit for all other characters? That penalizes
2871+
* long strings, meaning we will tend to choose a candidate encoding
2872+
* in which the test string decodes to a smaller number of
2873+
* codepoints. That prevents single-byte encodings in which almost
2874+
* every possible input byte decodes to a 'common' codepoint from
2875+
* being favored too much. */
2876+
if (w > 0xFFFF) {
2877+
return 40;
2878+
} else if (w >= 0x21 && w <= 0x2F) {
2879+
return 6;
2880+
} else if ((rare_codepoint_bitvec[w >> 5] >> (w & 0x1F)) & 1) {
2881+
return 30;
2882+
} else {
2883+
return 1;
2884+
}
2885+
return 0;
2886+
}
2887+
2888+
/* When doing 'strict' detection, any string which is invalid in the candidate encoding
2889+
* is rejected. With non-strict detection, we just continue, but apply demerits for
2890+
* each invalid byte sequence */
2891+
static const mbfl_encoding* mb_guess_encoding(unsigned char *in, size_t in_len, const mbfl_encoding **elist, unsigned int elist_size, bool strict)
2892+
{
2893+
if (elist_size == 0) {
2894+
return NULL;
2895+
}
2896+
if (elist_size == 1) {
2897+
if (strict) {
2898+
return php_mb_check_encoding((const char*)in, in_len, *elist) ? *elist : NULL;
2899+
} else {
2900+
return *elist;
2901+
}
2902+
}
2903+
if (in_len == 0) {
2904+
return *elist;
2905+
}
2906+
2907+
uint32_t wchar_buf[128];
2908+
struct conversion_data {
2909+
const mbfl_encoding *enc;
2910+
unsigned char *in;
2911+
size_t in_len;
2912+
uint64_t demerits; /* Wide bit size to prevent overflow */
2913+
unsigned int state;
2914+
};
2915+
/* Allocate on stack; when we return, this array is automatically freed */
2916+
struct conversion_data *data = alloca(elist_size * sizeof(struct conversion_data));
2917+
2918+
for (unsigned int i = 0; i < elist_size; i++) {
2919+
data[i].enc = elist[i];
2920+
data[i].in = in;
2921+
data[i].in_len = in_len;
2922+
data[i].state = 0;
2923+
data[i].demerits = 0;
2924+
}
2925+
2926+
unsigned int finished = 0; /* For how many candidate encodings have we processed all the input? */
2927+
while (elist_size > 1 && finished < elist_size) {
2928+
unsigned int i = 0;
2929+
try_next_encoding:
2930+
while (i < elist_size) {
2931+
/* Do we still have more input to process for this candidate encoding? */
2932+
if (data[i].in_len) {
2933+
const mbfl_encoding *enc = data[i].enc;
2934+
size_t out_len = enc->to_wchar(&data[i].in, &data[i].in_len, wchar_buf, 128, &data[i].state);
2935+
ZEND_ASSERT(out_len <= 128);
2936+
/* Check this batch of decoded codepoints; are there any error markers?
2937+
* Also sum up the number of demerits */
2938+
while (out_len) {
2939+
uint32_t w = wchar_buf[--out_len];
2940+
if (w == MBFL_BAD_INPUT) {
2941+
if (strict) {
2942+
/* This candidate encoding is not valid, eliminate it from consideration */
2943+
elist_size--;
2944+
memmove(&data[i], &data[i+1], (elist_size - i) * sizeof(struct conversion_data));
2945+
goto try_next_encoding;
2946+
} else {
2947+
data[i].demerits += 1000;
2948+
}
2949+
} else {
2950+
data[i].demerits += mb_estimate_encoding_demerits(w);
2951+
}
2952+
}
2953+
if (data[i].in_len == 0) {
2954+
finished++;
2955+
}
2956+
}
2957+
i++;
2958+
}
2959+
}
2960+
2961+
if (strict) {
2962+
if (elist_size == 0) {
2963+
/* All candidates were eliminated */
2964+
return NULL;
2965+
}
2966+
/* The above loop might have broken because there was only 1 candidate encoding left
2967+
* If in strict mode, we still need to process any remaining input for that candidate */
2968+
if (elist_size == 1 && data[0].in_len) {
2969+
const mbfl_encoding *enc = data[0].enc;
2970+
unsigned char *in = data[0].in;
2971+
size_t in_len = data[0].in_len;
2972+
unsigned int state = data[0].state;
2973+
while (in_len) {
2974+
size_t out_len = enc->to_wchar(&in, &in_len, wchar_buf, 128, &state);
2975+
while (out_len) {
2976+
if (wchar_buf[--out_len] == MBFL_BAD_INPUT) {
2977+
return NULL;
2978+
}
2979+
}
2980+
}
2981+
}
2982+
}
2983+
2984+
/* See which remaining candidate encoding has the least demerits */
2985+
unsigned int best = 0;
2986+
for (unsigned int i = 1; i < elist_size; i++) {
2987+
if (data[i].demerits < data[best].demerits) {
2988+
best = i;
2989+
}
2990+
}
2991+
return data[best].enc;
2992+
}
2993+
28502994
/* {{{ Encodings of the given string is returned (as a string) */
28512995
PHP_FUNCTION(mb_detect_encoding)
28522996
{
28532997
zend_string *str, *encoding_str = NULL;
28542998
HashTable *encoding_ht = NULL;
28552999
bool strict = false;
2856-
2857-
mbfl_string string;
2858-
const mbfl_encoding *ret;
2859-
const mbfl_encoding **elist;
3000+
const mbfl_encoding *ret, **elist;
28603001
size_t size;
28613002

28623003
ZEND_PARSE_PARAMETERS_START(1, 3)
@@ -2896,14 +3037,10 @@ PHP_FUNCTION(mb_detect_encoding)
28963037
strict = MBSTRG(strict_detection);
28973038
}
28983039

2899-
if (strict && size == 1) {
2900-
/* If there is only a single candidate encoding, mb_check_encoding is faster */
2901-
ret = (mb_check_str_encoding(str, *elist)) ? *elist : NULL;
3040+
if (size == 1 && *elist == &mbfl_encoding_utf8 && (GC_FLAGS(str) & IS_STR_VALID_UTF8)) {
3041+
ret = &mbfl_encoding_utf8;
29023042
} else {
2903-
mbfl_string_init(&string);
2904-
string.val = (unsigned char*)ZSTR_VAL(str);
2905-
string.len = ZSTR_LEN(str);
2906-
ret = mbfl_identify_encoding(&string, elist, size, strict);
3043+
ret = mb_guess_encoding((unsigned char*)ZSTR_VAL(str), ZSTR_LEN(str), elist, size, strict);
29073044
}
29083045

29093046
efree(ZEND_VOIDP(elist));
@@ -4086,9 +4223,8 @@ PHP_FUNCTION(mb_send_mail)
40864223
orig_str.val = (unsigned char *)subject;
40874224
orig_str.len = subject_len;
40884225
orig_str.encoding = MBSTRG(current_internal_encoding);
4089-
if (orig_str.encoding->no_encoding == mbfl_no_encoding_invalid
4090-
|| orig_str.encoding->no_encoding == mbfl_no_encoding_pass) {
4091-
orig_str.encoding = mbfl_identify_encoding(&orig_str, MBSTRG(current_detect_order_list), MBSTRG(current_detect_order_list_size), MBSTRG(strict_detection));
4226+
if (orig_str.encoding->no_encoding == mbfl_no_encoding_invalid || orig_str.encoding->no_encoding == mbfl_no_encoding_pass) {
4227+
orig_str.encoding = mb_guess_encoding((unsigned char*)subject, subject_len, MBSTRG(current_detect_order_list), MBSTRG(current_detect_order_list_size), MBSTRG(strict_detection));
40924228
}
40934229
pstr = mbfl_mime_header_encode(&orig_str, &conv_str, tran_cs, head_enc, CRLF, sizeof("Subject: [PHP-jp nnnnnnnn]" CRLF) - 1);
40944230
if (pstr != NULL) {
@@ -4100,9 +4236,8 @@ PHP_FUNCTION(mb_send_mail)
41004236
orig_str.len = message_len;
41014237
orig_str.encoding = MBSTRG(current_internal_encoding);
41024238

4103-
if (orig_str.encoding->no_encoding == mbfl_no_encoding_invalid
4104-
|| orig_str.encoding->no_encoding == mbfl_no_encoding_pass) {
4105-
orig_str.encoding = mbfl_identify_encoding(&orig_str, MBSTRG(current_detect_order_list), MBSTRG(current_detect_order_list_size), MBSTRG(strict_detection));
4239+
if (orig_str.encoding->no_encoding == mbfl_no_encoding_invalid || orig_str.encoding->no_encoding == mbfl_no_encoding_pass) {
4240+
orig_str.encoding = mb_guess_encoding((unsigned char*)message, message_len, MBSTRG(current_detect_order_list), MBSTRG(current_detect_order_list_size), MBSTRG(strict_detection));
41064241
}
41074242

41084243
pstr = NULL;

ext/mbstring/tests/bug49536.phpt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,16 @@ var_dump(mb_detect_encoding("A\x81", "SJIS", true));
1212
var_dump(mb_detect_encoding("\xc0\x00", "UTF-8", false));
1313
// strict mode
1414
var_dump(mb_detect_encoding("\xc0\x00", "UTF-8", true));
15+
16+
// Strict mode with multiple candidate encodings
17+
// This input string is invalid in ALL the candidate encodings:
18+
echo "== INVALID STRING - UTF-8 and SJIS ==\n";
19+
var_dump(mb_detect_encoding("\xFF\xFF", ['SJIS', 'UTF-8'], true));
1520
?>
1621
--EXPECT--
1722
string(4) "SJIS"
1823
bool(false)
24+
string(5) "UTF-8"
1925
bool(false)
26+
== INVALID STRING - UTF-8 and SJIS ==
2027
bool(false)

0 commit comments

Comments
 (0)