Skip to content

Add support for GB18030-2022 text encoding to mbstring #13047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

alexdowad
Copy link
Contributor

The relevant standards authorities in the PRC defined a new version of the GB18030 text encoding in July 2022. Starting from August 2023, support for the new version of GB18030 is mandated for all software products sold in China.

Fortunately, GB18030-2022 is very similar to GB18030-2005, so just a few tweaks were needed to the existing code (like changing some entries in various tables) to comply with the new standard.

The existing support for GB18030 is unchanged; users who want to follow the new standard must explicitly specify GB18030-2022 as their desired text encoding.

Chinese PHP developers will likely have a special interest in this feature; any who are willing to do so are requested to test the new GB18030-2022 support and report any bugs via GitHub.

FYA @Girgias @cmb69 @youkidearitai @nielsdos @kamil-tekiela @iluuu1994

…ison

When developing mbstring, and a unit test fails, this will make it
easier and quicker to identify the cause of the test failure.
@alexdowad alexdowad force-pushed the gb18030 branch 2 times, most recently from e916273 to 8ee073a Compare December 29, 2023 18:17
@alexdowad
Copy link
Contributor Author

@nielsdos Updating the new perfect hash of encoding names is sure a pain. It would be nice if it was scripted.

Do you foresee any problems with this line (changed by me)?

if (key <= sizeof(mbfl_encoding_ptr_list_after_hashing)) {

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check the algorithm

The previous version of the GB-18030 standard was published in 2005.
This commit adds support for the updated (2022) version of this text
encoding. The existing GB18030 implementation has been left unchanged
for backwards compatibility; users who want to use the new standard
must explicitly indicate the desired text encoding is 'GB18030-2022'.

The document which defines GB18030-2022, published by the government
of the People's Republic of China, defines three levels of standards
compliance. This implementation is intended to achieve Implementation
Level 3, which is the highest level of compliance.

Experts in the GB18030 standard are requested to assess this
implementation and report any deviation from the standard.
@alexdowad
Copy link
Contributor Author

I didn't check the algorithm

You would have to be mad to do that 😆

@nielsdos
Copy link
Member

@nielsdos Updating the new perfect hash of encoding names is sure a pain. It would be nice if it was scripted.

Do you foresee any problems with this line (changed by me)?

if (key <= sizeof(mbfl_encoding_ptr_list_after_hashing)) {

That should be < when using sizeof. It was not a problem with <= 186 because the array had size 187. But now that you use sizeof you should use <.

I'll write a script to automate it, hang on ;)

@nielsdos
Copy link
Member

nielsdos commented Dec 30, 2023

@alexdowad I have pushed a script to your branch to automate the generation of the perfect hash table. And I've used that script to generate a more optimal one (the difference is that I used the argument -m 1000 in the script to perform a random search for more optimal values). I've also fixed the if (key <= ... line.

@youkidearitai
Copy link
Contributor

I have a question. Probably GB18030-2022MappingTableBMP.txt is get from http://www.nits.org.cn/index/article/4034 , That page include GB18030-2022MappingTableSMP.txt. Is okay not include GB18030-2022MappingTableSMP.txt ?

@alexdowad
Copy link
Contributor Author

I have a question. Probably GB18030-2022MappingTableBMP.txt is get from http://www.nits.org.cn/index/article/4034 , That page include GB18030-2022MappingTableSMP.txt. Is okay not include GB18030-2022MappingTableSMP.txt ?

You are sharp!

I didn't include GB18030-2022MappingTableSMP.txt because it's 15MB and I didn't want to bloat the repo. Instead, I randomly sampled 100 lines from the file and included them directly in gb18020_2022_encoding.phpt.

The lines in that file can all be generated using a simple algorithm, so I figured that it is not necessary to test each and every line.

We could increase the number of sampled lines to 1000 or 10,000... it wouldn't make a big difference to the test file size. Does that sound better?

@youkidearitai
Copy link
Contributor

Thanks for asking!

I didn't include GB18030-2022MappingTableSMP.txt because it's 15MB and I didn't want to bloat the repo. Instead, I randomly sampled 100 lines from the file and included them directly in gb18020_2022_encoding.phpt.

Indeed. There is no problem if it was intended. I think enough to 100 lines.

@alexdowad
Copy link
Contributor Author

@nielsdos Thanks very much for adding the script, that is great!

Thanks also for catching my mistake with < vs <=.

@alexdowad alexdowad closed this Dec 30, 2023
@alexdowad alexdowad deleted the gb18030 branch December 30, 2023 16:30
@alexdowad
Copy link
Contributor Author

Landed on master.

Thanks very much, everyone!

@alexdowad
Copy link
Contributor Author

@nielsdos FYI, I squashed your last commit (the one which disabled the debug check) with the one before it.

@nielsdos
Copy link
Member

@nielsdos FYI, I squashed your last commit (the one which disabled the debug check) with the one before it.

Sounds good!

@iluuu1994
Copy link
Member

@alexdowad It looks like the test gb18030_2022_encoding.phpt is broken on 32-bit.

// We may be on a 32-bit machine and testing a text encoding with 4-byte codes
// (which can't be represented in a PHP integer)
$char = "";
for ($i = 2; $i < strlen($line); $i += 2) {
Copy link
Member

@nielsdos nielsdos Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet the 32-bit failure is caused by this line, the offset 2 doesn't make sense, I think it should be $i = strpos($line, "\t") + 1
Will check tonight.

Also this test would need ctype extension as dependency on 32-bit platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That totally makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #13069

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants