You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
grapheme_extract should pass over invalid surrogate halves
Many systems incorrectly encode surrogate halves from a UTF-16 stream
into UTF-8 as two three-byte characters instead of the proper four-byte
sequence. These are invalid charaters in UTF-8 and should be skipped
when decoding with `grapheme_extract` but it’s not currently handling
these properly.
> If offset does not point to the first byte of a UTF-8 character,
> the start position is moved to the next character boundary.
For example, U+1F170 (d83c dd70) should encode as F0 9F 85 B0, but
when applying the UTF-8 encoder invalidly to d83c, the output would
be ED A0 BD. This entire span of bytes is invalid UTF-8.
```php
grapheme_extract( "\xED\xA0\xBDa", 1, GRAPHEME_EXTR_COUNT, 0, $next );
// returns "\xED", an invalid UTF-8 byte sequence
// $next === 1, pointing into the middle of the invalid sequence
```
Instead, it should return “a” and point `$next` to the end of the string.
0 commit comments