-
Notifications
You must be signed in to change notification settings - Fork 7.9k
[RFC] Add trim for multibyte function #12459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
memo (JST is 0:00 am, I'll fix later):
|
Have you considered also adding U+FEFF / ZERO WIDTH NO-BREAK / "Byte order mark" as mentioned in #9216 (comment) ? |
Maybe it would be best to define whitespace similarly to Go This would make it somewhat future-proof from a maintenance point whenever unicode changes what whitespace means |
@kbergha Thank you very much. Please let me consider it. I was thinking only about regex |
@sztanpet Thank you for your comment. I was thinking about use case of |
@kbergha |
I just want to this function is only "delete space". If delete other character, please use second parameter of |
@kbergha Use case of "trim BOM", I think would work. I wrote RFC Introduction section.
|
mb_trim, mb_ltrim and mb_rtrim is accepted (https://wiki.php.net/rfc/mb_trim). Thanks. |
@youkidearitai Aside from the points I mentioned above, I think the other important thing is to fuzz this code. Maybe add a bit of code to Even if you don't add any assertions, this will still help to catch memory leaks, bad pointer accesses, etc. |
@youkidearitai One more thing... it would be good to have a few test cases for what happens when invalid strings (bad encoding) are passed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments and questions :)
ext/mbstring/mbstring.c
Outdated
static bool is_trim_wchar(uint32_t w, uint32_t *characters, size_t characters_len) | ||
{ | ||
for (size_t i = 0; i < characters_len; i++) { | ||
if (w == characters[i]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop could get very slow for every character, I wonder if it makes sense to build a HashTable if the nr of characters is > 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of characters would have to be a lot bigger than 1 to make building a hashtable worthwhile. I would guess at least 20, but of course, a benchmark would be the best way to find out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uum.. I think this problem, but I couldn't find a way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the distinction between 1 character and >1 character shouldn't be made.
Anyway, I still think a HashTable would be a good idea, combined with avoiding decoding the what
string repeatedly. This would improve performance.
To do so:
- Create a HashTable variable on the stack
- Use
zend_hash_init
to initialize that HashTable; andzend_hash_destroy
to destroy it. - Fill the HashTable with the unicode codepoints of
what
as keys and an empty value as value usingzend_hash_index_add_new
. Do this once before the loop. - Pass the HashTable instead of a string to
trim_each_wchar
- In
is_trim_wchar
you can usezend_hash_index_exists
to check if the unicode codepoint is in the HashTable. This takes constant time so it's much faster than a loop. - In
trim_each_wchar
it's not necessary anymore to repeatedly decodewhat
, because the HashTable now takes that role.
Note that it's then no longer necessary too to encode the trim_default_chars
array in mb_trim_default_chars
to a zend_string. Instead you can fill in the HashTable directly with a for loop.
Let me know if you need help 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much!😊 This is make sense.
I was replace to HashTable
. If any mistake, feel free comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, test failed when enable JIT. What's happened?
@alexdowad Thank you for your advice. I'll try put from |
@youkidearitai I think you don't have any test case where |
@alexdowad Thanks, I added empty string when |
I good idea would be to copy all relevant test cases for trim-family functions from https://github.com/php/php-src/tree/master/ext/standard/tests/strings |
@alecpl Thank you! Copied from trim.phpt to mb_trim.phpt. |
I've been looking at this back and forth, and I think this implementation can be simplified and speed up a lot. I would internally always work on unicode codepoints such that you don't have to encode in "mb_trim_default_chars". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round of reviewing.
Looks pretty good now, thanks!
I have some minor suggestions for improvements, but overall it looks pretty good.
Can you also add the testcases I wrote in my earlier review to the .phpt file? That makes sure we don't accidentally re-introduce those issues from before when refactoring or optimizing.
@nielsdos Thank you very much for deeply review. Added crashed test case. |
Looks good, thanks. The only remaining improvement I can see is to merge the two for loops to avoid duplicating the codepoint looking in what_ht. Something like this: diff --git a/ext/mbstring/mbstring.c b/ext/mbstring/mbstring.c
index bb82c4b268..7400f3d460 100644
--- a/ext/mbstring/mbstring.c
+++ b/ext/mbstring/mbstring.c
@@ -2968,24 +2968,18 @@ static zend_string* trim_each_wchar(zend_string *str, const HashTable *what_ht,
ZEND_ASSERT(out_len <= 128);
total_len += out_len;
- if (mode & MB_LTRIM) {
- for (size_t i = 0; i < out_len; i++) {
- uint32_t w = wchar_buf[i];
- if (is_trim_wchar(w, what_ht)) {
+ for (size_t i = 0; i < out_len; i++) {
+ uint32_t w = wchar_buf[i];
+ if (is_trim_wchar(w, what_ht)) {
+ if (mode & MB_LTRIM) {
left += 1;
- } else {
- mode ^= MB_LTRIM;
- break;
}
- }
- }
-
- if (mode & MB_RTRIM) {
- for (size_t j = 0; j < out_len; j++) {
- uint32_t w = wchar_buf[j];
- if (is_trim_wchar(w, what_ht)) {
+ if (mode & MB_RTRIM) {
right += 1;
- } else {
+ }
+ } else {
+ mode &= ~MB_LTRIM;
+ if (mode & MB_RTRIM) {
right = 0;
}
}
This makes a minor (8-10%) performance improvement in my test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks.
@youkidearitai Can you squash this into one commit? |
3ad5121
to
94ac067
Compare
@youkidearitai Thanks very much! |
@alexdowad Thank you very much!
Thanks, again. |
The code looks good, all tests are passing. I would just like to fuzz it a bit. |
Discovered an interesting test case when fuzzing... I thought that |
OK, well, my fuzzer hasn't found any bugs. |
ext/mbstring/mbstring.c
Outdated
}; | ||
size_t trim_default_chars_length = sizeof(trim_default_chars) / sizeof(uint32_t); | ||
mb_convert_buf buf; | ||
mb_convert_buf_init(&buf, trim_default_chars_length, MBSTRG(current_filter_illegal_substchar), MBSTRG(current_filter_illegal_mode)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkidearitai Congratulations on submitting and implementing your first RFC!!
I don't think it is necessary to convert trim_default_chars
from an array of wchars to a string, because in trim_each_wchar
, you convert the string back to wchars again.
I think it would be better if trim_each_wchar
received an array of wchars (with length parameter). Then in php_do_mb_strim
, you would need to convert the received string to an array of wchars inside the if (what) {
clause.
Of course, the problem with this is that you can't control how big of a string the caller will pass. It could be huge. So it may or may not fit into a static array (allocated on the stack like what you are doing with what_wchar_buf
). I think you would need to:
- Have a static, stack-allocated array like
what_wchar_buf
- If the
what
string fits completely inwhat_wchar_buf
when converted to wchars, just pass that totrim_each_wchar
. - If the
what
string does not completely fit inwhat_wchar_buf
after conversion, then dynamically allocate a buffer for it and fill that dynamic buffer usingwhat_wchar_buf
. Free the dynamic buffer before returning.
It's unfortunate that you would need to allocate memory when the what
string is big, but it shouldn't usually be big.
If you really don't want to allocate memory when the what
string is big, you could make two versions of trim_each_wchar
... one which receives an array of wchars, and one which receives a string.
I just feel that converting trim_default_chars
to a string and then immediately converting it back to wchars is very wasteful and will make this function unnecessarily slow when the default list of whitespace chars is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexdowad It seems that this comment was made for older code. He no longer does the double conversion in the current version.
Z_PARAM_STR(str) | ||
Z_PARAM_OPTIONAL | ||
Z_PARAM_STR_OR_NULL(what) | ||
Z_PARAM_STR_OR_NULL(encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, just thinking about this. Is it true that most mbstring functions take the encoding last as an optional parameter? If so, then this is good. 👍🏻
It just means that if a user wants to pass an explicit encoding, they have to pass the list of whitespace chars as well. If the order was (str, encoding, what)
, then the user could pass an explicit encoding, but use the default value of what
.
What do you think is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(If this was already discussed at the RFC stage, then please disregard my comment.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my 2c: The mbstring functions take encoding as last. If you don't want to pass what
then you can use named arguments:
mb_trim("my string", encoding: "GB18030")
for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @nielsdos said, I am assuming to use mb_trim("my string", encoding: "GB18030");
ext/mbstring/mbstring.c
Outdated
zend_string_release(str); | ||
zend_string_release(what); | ||
|
||
return mb_get_substr(str, left, total_len + (right - left), enc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important performance optimization: if left == 0 && right == 0
, then just return zend_string_copy(str)
. (zend_string_copy
just increments the refcount of a string.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I checked the source code for mb_get_substr
to see if it performs the same optimization, and it looks like it doesn't.)
ext/mbstring/mbstring.c
Outdated
static bool is_trim_wchar(uint32_t w, uint32_t *characters, size_t characters_len) | ||
{ | ||
for (size_t i = 0; i < characters_len; i++) { | ||
if (w == characters[i]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of characters would have to be a lot bigger than 1 to make building a hashtable worthwhile. I would guess at least 20, but of course, a benchmark would be the best way to find out.
@youkidearitai I just remembered, something should be added to NEWS for this change. Then I guess I can go ahead and merge into |
94ac067
to
9dbe715
Compare
@alexdowad Thanks, Added |
It is not NEWS that needs to be updated but UPGRADING. |
Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com> Co-authored-by: Gina Peter Banyard <girgias@php.net>
9dbe715
to
60db3fb
Compare
@Girgias Thanks. Added |
OK, well, I really think this is ready to merge now... |
Landed on Thanks very, very much to @youkidearitai and all the others! |
Thanks very, very much for all reviewers! |
Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com> Co-authored-by: Gina Peter Banyard <girgias@php.net>
Related: #9216
I make a propose a multibyte version of the trim function.
On the other hand, The ".." notation for
$characters
that was in the trim function was not supported. ex:\u{0000}..\u{FFFF}
Because the reason is below:
At later, I'll writing to RFC. Just a moment.RFC: https://wiki.php.net/rfc/mb_trim