Skip to content

ext/mbstring: Do some clean-up refactoring #12716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 24, 2024

Conversation

Girgias
Copy link
Member

@Girgias Girgias commented Nov 18, 2023

To rebase after #12714 for the _php_mb_ini_mbstring_http_output_set() to make sense.

@Girgias Girgias force-pushed the mbstring-strlen-pass branch from a6f2619 to 4453b09 Compare November 19, 2023 07:14
@Girgias Girgias marked this pull request as ready for review November 19, 2023 07:14
@Girgias Girgias requested a review from alexdowad as a code owner November 19, 2023 07:14
@Girgias Girgias requested a review from nielsdos November 19, 2023 07:14
Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK.
I made a remark in a previous PR that there could be a subtly behaviour change because of strings containing \0, and I think those should be blocked to make sure we can always use ZSTR_LEN.
Leaving the final decision to Alex however as it's his extension.

@@ -2537,7 +2537,7 @@ static zend_string* mb_trim_string(zend_string *input, zend_string *marker, cons
if (out_len <= to_skip) {
to_skip -= out_len;
} else {
for (int i = to_skip; i < out_len; i++) {
for (unsigned int i = to_skip; i < out_len; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, there is an inconsistency here with how you did the previous changes. Previously you changed this to size_t because out_len was size_t. Here you changed it to unsigned int.
unsigned int, or even int, suffices in all cases because the length is limited to 128 anyway. Performance-wise it's hard to predict what these would do. I would expect no difference because it would just use a larger register , albeit with a larger stack spill.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, there is an inconsistency here with how you did the previous changes. Previously you changed this to size_t because out_len was size_t. Here you changed it to unsigned int. unsigned int, or even int, suffices in all cases because the length is limited to 128 anyway. Performance-wise it's hard to predict what these would do. I would expect no difference because it would just use a larger register , albeit with a larger stack spill.

Right, I did kinda blindly do those changes. But if those are limited to 128, might as well use an unsigned char then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I remember how I got to this inconsistency, I used the type of the from parameter which is unsigned int here. But I wonder if this might not cause issues and that should be of type size_t in the first place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The from variable should probably be of type size_t, otherwise >4GiB strings (okay unlikely, but still) won't work correctly I think.

@alexdowad
Copy link
Contributor

This looks good to me, but I need to investigate a bit more to fully understand the issue with embedded NUL bytes mentioned by @nielsdos.

@Girgias
Copy link
Member Author

Girgias commented Nov 20, 2023

This looks good to me, but I need to investigate a bit more to fully understand the issue with embedded NUL bytes mentioned by @nielsdos.

Basically it boils down to a string "utf-8\0blah" would be interpreted as "utf-8" as the null byte terminates a C string.

@alexdowad
Copy link
Contributor

Basically it boils down to a string "utf-8\0blah" would be interpreted as "utf-8" as the null byte terminates a C string.

Sorry for my mental slowness, but does this mean that "utf-8\0blah" would be interpreted as "utf-8" before, or now?

@Girgias
Copy link
Member Author

Girgias commented Nov 20, 2023

Basically it boils down to a string "utf-8\0blah" would be interpreted as "utf-8" as the null byte terminates a C string.

Sorry for my mental slowness, but does this mean that "utf-8\0blah" would be interpreted as "utf-8" before, or now?

Before :) as the lengths wasn't passed to the function. Now with the patch it is getting interpreted as "utf-8\0blah" as it will compare the 10 bytes.

@alexdowad
Copy link
Contributor

I suggest a note can be added in UPGRADING about this. Personally I think the BC break is not so big.

@alexdowad
Copy link
Contributor

I suggest a note can be added in UPGRADING about this. Personally I think the BC break is not so big.

Ah, we should also add a test case for this.

@nielsdos
Copy link
Member

It's actually quite subtle.
For example for name2encoding.
The function uses hashing and strncasecmp. If we pass ZSTR_LEN to the hashing function then the \0 byte will also be hashed.

Before this PR "utf8\0foo" would be interpreted as "utf8" because it internally used strlen. Then it uses strncasecmp (stops at \0) to compare the string. However now the length and string matching don't agree anymore, which causes the name to never match.
The mime types and aliases all use strncasecmp without hashing so those will match "utf8\0foo" as "utf8".
I would block NUL bytes at the parameter parsing side and avoid these subtle details.

@Girgias Girgias force-pushed the mbstring-strlen-pass branch from 4453b09 to 36c90c3 Compare November 22, 2023 05:26
@alexdowad
Copy link
Contributor

@Girgias How would it be if you merge some of the non-controversial commits in this patch series?

@Girgias
Copy link
Member Author

Girgias commented Dec 4, 2023

@Girgias How would it be if you merge some of the non-controversial commits in this patch series?

This was my plan, I got distracted with other stuff :/

@Girgias Girgias force-pushed the mbstring-strlen-pass branch from 36c90c3 to 6b88b01 Compare December 5, 2023 05:21
@alexdowad
Copy link
Contributor

I think 39960e25a6fe0756c4ea5449568210b3f44eedca looks good to land, as long as the tests pass.

@alexdowad
Copy link
Contributor

48bff2b625ee2d5ba06bfd53b8f3bfe54bdcc5f6 also looks good.

@Girgias Girgias force-pushed the mbstring-strlen-pass branch from 6b88b01 to c995fc9 Compare December 6, 2023 23:48
@Girgias
Copy link
Member Author

Girgias commented Dec 6, 2023

I have merged the two commits @alexdowad

@alexdowad
Copy link
Contributor

I think 8fb651d61764b37c7cb8bdeca6d6f4ad616c74af is good and can be merged. It's just unfortunate that the update of UPGRADING is in a different commit.

@Girgias
Copy link
Member Author

Girgias commented Dec 7, 2023

I think 8fb651d61764b37c7cb8bdeca6d6f4ad616c74af is good and can be merged. It's just unfortunate that the update of UPGRADING is in a different commit.

When I merge I will update UPGRADING to be part of the commit :)

@Girgias Girgias force-pushed the mbstring-strlen-pass branch from c995fc9 to 871b151 Compare December 7, 2023 17:28
@alexdowad
Copy link
Contributor

I think 0048a3623469eb51b9ade358678c7dedc15bfd3d is good now.

Hopefully the unit tests we have are robust enough to catch any bugs. I added a number of tests for mb_strimwidth, though it is always nice to have more...

@alexdowad
Copy link
Contributor

I think 1e94d7941685295a1a1815c3a452ea84b6db50b5 is also good to merge.

@Girgias Girgias force-pushed the mbstring-strlen-pass branch from 871b151 to ba4a66a Compare December 18, 2023 00:32
Girgias added 4 commits April 23, 2024 23:43
We have access to this information, so propagate it instead of calling strlen().
This also removes the newly introduced _ex() variant.
@Girgias Girgias force-pushed the mbstring-strlen-pass branch from ba4a66a to 1016b00 Compare April 23, 2024 22:48
@Girgias Girgias requested a review from alexdowad April 23, 2024 22:48
@alexdowad
Copy link
Contributor

@Girgias Thanks for working on these refactorings. I wonder how much it is worth to fix up mb_http_output... are we not planning to remove that in a future PHP release? But anyways, if you feel it is worth it, no objection. Again, thanks very much.

@Girgias
Copy link
Member Author

Girgias commented Apr 24, 2024

@Girgias Thanks for working on these refactorings. I wonder how much it is worth to fix up mb_http_output... are we not planning to remove that in a future PHP release? But anyways, if you feel it is worth it, no objection. Again, thanks very much.

Is it deprecated? Or is that only the INI setting?

Anyway, I was just trying to finish one of my outstanding PRs :)

@Girgias Girgias merged commit f68d725 into php:master Apr 24, 2024
@Girgias Girgias deleted the mbstring-strlen-pass branch April 24, 2024 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants