-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Add support for GB18030-2022 text encoding to mbstring #13047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ison When developing mbstring, and a unit test fails, this will make it easier and quicker to identify the cause of the test failure.
e916273
to
8ee073a
Compare
@nielsdos Updating the new perfect hash of encoding names is sure a pain. It would be nice if it was scripted. Do you foresee any problems with this line (changed by me)?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check the algorithm
The previous version of the GB-18030 standard was published in 2005. This commit adds support for the updated (2022) version of this text encoding. The existing GB18030 implementation has been left unchanged for backwards compatibility; users who want to use the new standard must explicitly indicate the desired text encoding is 'GB18030-2022'. The document which defines GB18030-2022, published by the government of the People's Republic of China, defines three levels of standards compliance. This implementation is intended to achieve Implementation Level 3, which is the highest level of compliance. Experts in the GB18030 standard are requested to assess this implementation and report any deviation from the standard.
You would have to be mad to do that 😆 |
That should be I'll write a script to automate it, hang on ;) |
@alexdowad I have pushed a script to your branch to automate the generation of the perfect hash table. And I've used that script to generate a more optimal one (the difference is that I used the argument |
I have a question. Probably |
You are sharp! I didn't include The lines in that file can all be generated using a simple algorithm, so I figured that it is not necessary to test each and every line. We could increase the number of sampled lines to 1000 or 10,000... it wouldn't make a big difference to the test file size. Does that sound better? |
Thanks for asking!
Indeed. There is no problem if it was intended. I think enough to 100 lines. |
@nielsdos Thanks very much for adding the script, that is great! Thanks also for catching my mistake with |
Landed on master. Thanks very much, everyone! |
@nielsdos FYI, I squashed your last commit (the one which disabled the debug check) with the one before it. |
Sounds good! |
@alexdowad It looks like the test gb18030_2022_encoding.phpt is broken on 32-bit. |
// We may be on a 32-bit machine and testing a text encoding with 4-byte codes | ||
// (which can't be represented in a PHP integer) | ||
$char = ""; | ||
for ($i = 2; $i < strlen($line); $i += 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet the 32-bit failure is caused by this line, the offset 2
doesn't make sense, I think it should be $i = strpos($line, "\t") + 1
Will check tonight.
Also this test would need ctype
extension as dependency on 32-bit platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That totally makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #13069
The relevant standards authorities in the PRC defined a new version of the GB18030 text encoding in July 2022. Starting from August 2023, support for the new version of GB18030 is mandated for all software products sold in China.
Fortunately, GB18030-2022 is very similar to GB18030-2005, so just a few tweaks were needed to the existing code (like changing some entries in various tables) to comply with the new standard.
The existing support for GB18030 is unchanged; users who want to follow the new standard must explicitly specify
GB18030-2022
as their desired text encoding.Chinese PHP developers will likely have a special interest in this feature; any who are willing to do so are requested to test the new GB18030-2022 support and report any bugs via GitHub.
FYA @Girgias @cmb69 @youkidearitai @nielsdos @kamil-tekiela @iluuu1994