You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement mb_detect_encoding using fast text conversion filters
Regarding the optional 3rd `strict` argument to mb_detect_encoding,
the documentation states:
Controls the behaviour when string is not valid in any of the listed encodings.
If strict is set to false, the closest matching encoding will be returned;
if strict is set to true, false will be returned.
(Ref: https://www.php.net/manual/en/function.mb-detect-encoding.php)
Because of bugs in the implementation, mb_detect_encoding did not always
behave according to this description when `strict` was false.
For example:
<?php
echo var_export(mb_detect_encoding("\xc0\x00", "UTF-8", false));
// Before this commit, prints: false
// After this commit, prints: 'UTF-8'
Because `strict` is false in the above example, mb_detect_encoding
should return the 'closest matching encoding', which is UTF-8, since
that is the only candidate encoding. (Incidentally, this example shows
that using mb_detect_encoding with a single candidate encoding in
non-strict mode is useless.)
The new implementation fixes this bug. It also fixes another problem
with the old implementation as regards non-strict detection mode:
The old implementation would stop processing of the input string using
a particular candidate encoding as soon as it saw an error in that
encoding, even in non-strict mode. This means that it could not really
detect the 'closest matching encoding'; rather, what it would return
in non-strict mode was 'the encoding in which the first decoding error
is furthest from the beginning of the input string'.
In non-strict mode, the new implementation continues trying to process
the input string to its end even after seeing an error. This makes it
possible to determine in which candidate encoding the string has the
smallest number of errors, i.e. the 'closest matching encoding'.
Rejecting candidate encodings as soon as it saw an error gave the old
implementation a marked performance advantage in non-strict mode;
however, the new implementation still beats it in most cases. Here are
a few sample microbenchmark results:
UTF-8, ~100 codepoints, strict mode
Old: 0.080s (100,000 calls)
New: 0.026s (" " )
UTF-8, ~100 codepoints, non-strict mode
Old: 0.079s (100,000 calls)
New: 0.033s (" " )
UTF-8, ~10000 codepoints, strict mode
Old: 6.708s (60,000 calls)
New: 1.383s (" " )
UTF-8, ~10000 codepoints, non-strict mode
Old: 6.705s (60,000 calls)
New: 3.044s (" " )
Notice that the old implementation had almost identical performance
between strict and non-strict mode, while the new suffers a significant
performance penalty for non-strict detection. This is the cost of
implementing the behavior specified in the documentation.
A couple more sample results:
SJIS, ~10000 codepoints, strict mode
Old: 4.563s
New: 1.084s
SJIS, ~10000 codepoints, non-strict mode
Old: 4.569s
New: 2.863s
This is the only case I found where the new implementation loses:
UTF-16LE, ~10000 codepoints, non-strict mode
Old: 1.514s
New: 2.813s
The reason is because the test strings happened to be invalid right from
the first few bytes for all the candidate encodings except for UTF-16LE;
so the old implementation would immediately reject all those encodings
and only process the entire string in UTF-16LE.
I believe mb_detect_encoding could be made much faster if we identified
good criteria for when to reject candidate encodings before reaching
the end of the input string.
0 commit comments