-
Notifications
You must be signed in to change notification settings - Fork 7.9k
ext/mbstring: Do some clean-up refactoring #12716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a6f2619
to
4453b09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK.
I made a remark in a previous PR that there could be a subtly behaviour change because of strings containing \0, and I think those should be blocked to make sure we can always use ZSTR_LEN.
Leaving the final decision to Alex however as it's his extension.
ext/mbstring/mbstring.c
Outdated
@@ -2537,7 +2537,7 @@ static zend_string* mb_trim_string(zend_string *input, zend_string *marker, cons | |||
if (out_len <= to_skip) { | |||
to_skip -= out_len; | |||
} else { | |||
for (int i = to_skip; i < out_len; i++) { | |||
for (unsigned int i = to_skip; i < out_len; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, there is an inconsistency here with how you did the previous changes. Previously you changed this to size_t because out_len was size_t. Here you changed it to unsigned int.
unsigned int, or even int, suffices in all cases because the length is limited to 128 anyway. Performance-wise it's hard to predict what these would do. I would expect no difference because it would just use a larger register , albeit with a larger stack spill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, there is an inconsistency here with how you did the previous changes. Previously you changed this to size_t because out_len was size_t. Here you changed it to unsigned int. unsigned int, or even int, suffices in all cases because the length is limited to 128 anyway. Performance-wise it's hard to predict what these would do. I would expect no difference because it would just use a larger register , albeit with a larger stack spill.
Right, I did kinda blindly do those changes. But if those are limited to 128, might as well use an unsigned char then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I remember how I got to this inconsistency, I used the type of the from
parameter which is unsigned int
here. But I wonder if this might not cause issues and that should be of type size_t
in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The from variable should probably be of type size_t, otherwise >4GiB strings (okay unlikely, but still) won't work correctly I think.
This looks good to me, but I need to investigate a bit more to fully understand the issue with embedded NUL bytes mentioned by @nielsdos. |
Basically it boils down to a string |
Sorry for my mental slowness, but does this mean that |
Before :) as the lengths wasn't passed to the function. Now with the patch it is getting interpreted as |
I suggest a note can be added in UPGRADING about this. Personally I think the BC break is not so big. |
Ah, we should also add a test case for this. |
It's actually quite subtle. Before this PR "utf8\0foo" would be interpreted as "utf8" because it internally used strlen. Then it uses strncasecmp (stops at \0) to compare the string. However now the length and string matching don't agree anymore, which causes the name to never match. |
4453b09
to
36c90c3
Compare
@Girgias How would it be if you merge some of the non-controversial commits in this patch series? |
This was my plan, I got distracted with other stuff :/ |
36c90c3
to
6b88b01
Compare
I think |
|
6b88b01
to
c995fc9
Compare
I have merged the two commits @alexdowad |
I think |
When I merge I will update UPGRADING to be part of the commit :) |
c995fc9
to
871b151
Compare
I think Hopefully the unit tests we have are robust enough to catch any bugs. I added a number of tests for |
I think |
871b151
to
ba4a66a
Compare
We have access to this information, so propagate it instead of calling strlen(). This also removes the newly introduced _ex() variant.
ba4a66a
to
1016b00
Compare
@Girgias Thanks for working on these refactorings. I wonder how much it is worth to fix up |
Is it deprecated? Or is that only the INI setting? Anyway, I was just trying to finish one of my outstanding PRs :) |
To rebase after #12714 for the
_php_mb_ini_mbstring_http_output_set()
to make sense.