Skip to content

[RFC] Add trim for multibyte function #12459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

youkidearitai
Copy link
Contributor

@youkidearitai youkidearitai commented Oct 17, 2023

Related: #9216

I make a propose a multibyte version of the trim function.

function mb_trim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}", ?string $encoding = null): string {}
function mb_ltrim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}", ?string $encoding = null): string {}
function mb_rtrim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}", ?string $encoding = null): string {}

On the other hand, The ".." notation for $characters that was in the trim function was not supported. ex: \u{0000}..\u{FFFF}
Because the reason is below:

  • Unicode character is very wide
    • Difficult to search
    • Difficult to store in memory

At later, I'll writing to RFC. Just a moment.
RFC: https://wiki.php.net/rfc/mb_trim

@youkidearitai
Copy link
Contributor Author

memo (JST is 0:00 am, I'll fix later):

  • Fix test
  • Fix warning

@kbergha
Copy link

kbergha commented Oct 19, 2023

Have you considered also adding U+FEFF / ZERO WIDTH NO-BREAK / "Byte order mark" as mentioned in #9216 (comment) ?

@sztanpet
Copy link

Maybe it would be best to define whitespace similarly to Go
so basically defining it as the Z Pattern_White_Space property PLUS the ascii '\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP)

This would make it somewhat future-proof from a maintenance point whenever unicode changes what whitespace means

@youkidearitai
Copy link
Contributor Author

@kbergha Thank you very much. Please let me consider it. I was thinking only about regex \s.

@youkidearitai
Copy link
Contributor Author

@sztanpet Thank you for your comment. I was thinking about use case of preg_replace(“/^\s+|\s+$/u”, '', $string);. Therefore, I feel like Go language's IsSpace is not enough.

@youkidearitai
Copy link
Contributor Author

Have you considered also adding U+FEFF / ZERO WIDTH NO-BREAK / "Byte order mark" as mentioned in #9216 (comment) ?

@kbergha U+FEFF is "ZERO WIDTH NO-BREAK" as well as "BYTE ORDER MARK" at old Unicode. Given this historical background, I think to recommend to use mb_ltrim in each of them.

@youkidearitai
Copy link
Contributor Author

I just want to this function is only "delete space". If delete other character, please use second parameter of $character.

@youkidearitai
Copy link
Contributor Author

@kbergha Use case of "trim BOM", I think would work. I wrote RFC Introduction section.

One of use case is “trim Byte Order Mark”. I think mb_ltrim would be work:
mb_ltrim($string, "\u{FEFF}\u{FFFE}");

@youkidearitai youkidearitai marked this pull request as ready for review November 18, 2023 00:52
@youkidearitai
Copy link
Contributor Author

mb_trim, mb_ltrim and mb_rtrim is accepted (https://wiki.php.net/rfc/mb_trim).
Change from Draft Pull Request to Normal Pull Request.

Thanks.

@alexdowad
Copy link
Contributor

@youkidearitai Aside from the points I mentioned above, I think the other important thing is to fuzz this code. Maybe add a bit of code to sapi/fuzzer/fuzzer-mbstring.c so that aside from testing conversion, it also tries passing its test string to mb_trim.

Even if you don't add any assertions, this will still help to catch memory leaks, bad pointer accesses, etc.

@alexdowad
Copy link
Contributor

Any other comments from @Girgias? I see that @nielsdos and @mvorisek have already reviewed.

@alexdowad
Copy link
Contributor

@youkidearitai One more thing... it would be good to have a few test cases for what happens when invalid strings (bad encoding) are passed to mb_trim.

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments and questions :)

static bool is_trim_wchar(uint32_t w, uint32_t *characters, size_t characters_len)
{
for (size_t i = 0; i < characters_len; i++) {
if (w == characters[i]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop could get very slow for every character, I wonder if it makes sense to build a HashTable if the nr of characters is > 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of characters would have to be a lot bigger than 1 to make building a hashtable worthwhile. I would guess at least 20, but of course, a benchmark would be the best way to find out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uum.. I think this problem, but I couldn't find a way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the distinction between 1 character and >1 character shouldn't be made.

Anyway, I still think a HashTable would be a good idea, combined with avoiding decoding the what string repeatedly. This would improve performance.

To do so:

  • Create a HashTable variable on the stack
  • Use zend_hash_init to initialize that HashTable; and zend_hash_destroy to destroy it.
  • Fill the HashTable with the unicode codepoints of what as keys and an empty value as value using zend_hash_index_add_new. Do this once before the loop.
  • Pass the HashTable instead of a string to trim_each_wchar
  • In is_trim_wchar you can use zend_hash_index_exists to check if the unicode codepoint is in the HashTable. This takes constant time so it's much faster than a loop.
  • In trim_each_wchar it's not necessary anymore to repeatedly decode what, because the HashTable now takes that role.

Note that it's then no longer necessary too to encode the trim_default_chars array in mb_trim_default_chars to a zend_string. Instead you can fill in the HashTable directly with a for loop.

Let me know if you need help 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!😊 This is make sense.
I was replace to HashTable. If any mistake, feel free comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, test failed when enable JIT. What's happened?

@TimWolla TimWolla added the RFC label Nov 18, 2023
@youkidearitai
Copy link
Contributor Author

@youkidearitai Aside from the points I mentioned above, I think the other important thing is to fuzz this code. Maybe add a bit of code to sapi/fuzzer/fuzzer-mbstring.c so that aside from testing conversion, it also tries passing its test string to mb_trim.

@alexdowad Thank you for your advice. I'll try put from trim_each_wchar(and mb_trim_default_chars) code to sapi/fuzzer/fuzzer-mbstring.c.
And put some test code.

@alexdowad
Copy link
Contributor

@youkidearitai I think you don't have any test case where str is an empty string; that is a basic test which should be included for any string-processing function.

@youkidearitai
Copy link
Contributor Author

@alexdowad Thanks, I added empty string when $string is empty.

@alecpl
Copy link

alecpl commented Nov 19, 2023

I good idea would be to copy all relevant test cases for trim-family functions from https://github.com/php/php-src/tree/master/ext/standard/tests/strings

@youkidearitai
Copy link
Contributor Author

@alecpl Thank you! Copied from trim.phpt to mb_trim.phpt.

@nielsdos
Copy link
Member

I've been looking at this back and forth, and I think this implementation can be simplified and speed up a lot.
You currently repeatedly decode the "what" string into unicode codepoints, and you encode the codepoints into a string in "mb_trim_default_chars".

I would internally always work on unicode codepoints such that you don't have to encode in "mb_trim_default_chars".
I would then decode once upfront the "what" string in php_do_mb_trim in the if (what) { case.

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of reviewing.
Looks pretty good now, thanks!

I have some minor suggestions for improvements, but overall it looks pretty good.
Can you also add the testcases I wrote in my earlier review to the .phpt file? That makes sure we don't accidentally re-introduce those issues from before when refactoring or optimizing.

@youkidearitai
Copy link
Contributor Author

@nielsdos Thank you very much for deeply review. Added crashed test case.

@nielsdos
Copy link
Member

Looks good, thanks. The only remaining improvement I can see is to merge the two for loops to avoid duplicating the codepoint looking in what_ht.

Something like this:

diff --git a/ext/mbstring/mbstring.c b/ext/mbstring/mbstring.c
index bb82c4b268..7400f3d460 100644
--- a/ext/mbstring/mbstring.c
+++ b/ext/mbstring/mbstring.c
@@ -2968,24 +2968,18 @@ static zend_string* trim_each_wchar(zend_string *str, const HashTable *what_ht,
 		ZEND_ASSERT(out_len <= 128);
 		total_len += out_len;
 
-		if (mode & MB_LTRIM) {
-			for (size_t i = 0; i < out_len; i++) {
-				uint32_t w = wchar_buf[i];
-				if (is_trim_wchar(w, what_ht)) {
+		for (size_t i = 0; i < out_len; i++) {
+			uint32_t w = wchar_buf[i];
+			if (is_trim_wchar(w, what_ht)) {
+				if (mode & MB_LTRIM) {
 					left += 1;
-				} else {
-					mode ^= MB_LTRIM;
-					break;
 				}
-			}
-		}
-
-		if (mode & MB_RTRIM) {
-			for (size_t j = 0; j < out_len; j++) {
-				uint32_t w = wchar_buf[j];
-				if (is_trim_wchar(w, what_ht)) {
+				if (mode & MB_RTRIM) {
 					right += 1;
-				} else {
+				}
+			} else {
+				mode &= ~MB_LTRIM;
+				if (mode & MB_RTRIM) {
 					right = 0;
 				}
 			}

This makes a minor (8-10%) performance improvement in my test.

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks.

@alexdowad
Copy link
Contributor

@youkidearitai Can you squash this into one commit?

@alexdowad
Copy link
Contributor

@youkidearitai Thanks very much!
I will try to pull this very soon, build, and run the fuzzer.

@youkidearitai
Copy link
Contributor Author

@alexdowad Thank you very much!

I will try to pull this very soon, build, and run the fuzzer.

Thanks, again.

@alexdowad
Copy link
Contributor

The code looks good, all tests are passing.

I would just like to fuzz it a bit.

@alexdowad
Copy link
Contributor

Discovered an interesting test case when fuzzing... mb_trim("\x0E\x2D") in encoding JIS.

I thought that mb_trim would always make strings shorter (in byte length), but this one comes out longer after "trimming", because while the original string uses 'shift in' to encode a JISX 0201 character, the output string uses an "ESC ( I" sequence.

@alexdowad
Copy link
Contributor

OK, well, my fuzzer hasn't found any bugs.

};
size_t trim_default_chars_length = sizeof(trim_default_chars) / sizeof(uint32_t);
mb_convert_buf buf;
mb_convert_buf_init(&buf, trim_default_chars_length, MBSTRG(current_filter_illegal_substchar), MBSTRG(current_filter_illegal_mode));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youkidearitai Congratulations on submitting and implementing your first RFC!!

I don't think it is necessary to convert trim_default_chars from an array of wchars to a string, because in trim_each_wchar, you convert the string back to wchars again.

I think it would be better if trim_each_wchar received an array of wchars (with length parameter). Then in php_do_mb_strim, you would need to convert the received string to an array of wchars inside the if (what) { clause.

Of course, the problem with this is that you can't control how big of a string the caller will pass. It could be huge. So it may or may not fit into a static array (allocated on the stack like what you are doing with what_wchar_buf). I think you would need to:

  • Have a static, stack-allocated array like what_wchar_buf
  • If the what string fits completely in what_wchar_buf when converted to wchars, just pass that to trim_each_wchar.
  • If the what string does not completely fit in what_wchar_buf after conversion, then dynamically allocate a buffer for it and fill that dynamic buffer using what_wchar_buf. Free the dynamic buffer before returning.

It's unfortunate that you would need to allocate memory when the what string is big, but it shouldn't usually be big.

If you really don't want to allocate memory when the what string is big, you could make two versions of trim_each_wchar... one which receives an array of wchars, and one which receives a string.

I just feel that converting trim_default_chars to a string and then immediately converting it back to wchars is very wasteful and will make this function unnecessarily slow when the default list of whitespace chars is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexdowad It seems that this comment was made for older code. He no longer does the double conversion in the current version.

Z_PARAM_STR(str)
Z_PARAM_OPTIONAL
Z_PARAM_STR_OR_NULL(what)
Z_PARAM_STR_OR_NULL(encoding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, just thinking about this. Is it true that most mbstring functions take the encoding last as an optional parameter? If so, then this is good. 👍🏻

It just means that if a user wants to pass an explicit encoding, they have to pass the list of whitespace chars as well. If the order was (str, encoding, what), then the user could pass an explicit encoding, but use the default value of what.

What do you think is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If this was already discussed at the RFC stage, then please disregard my comment.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my 2c: The mbstring functions take encoding as last. If you don't want to pass what then you can use named arguments:
mb_trim("my string", encoding: "GB18030") for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @nielsdos said, I am assuming to use mb_trim("my string", encoding: "GB18030");

zend_string_release(str);
zend_string_release(what);

return mb_get_substr(str, left, total_len + (right - left), enc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important performance optimization: if left == 0 && right == 0, then just return zend_string_copy(str). (zend_string_copy just increments the refcount of a string.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I checked the source code for mb_get_substr to see if it performs the same optimization, and it looks like it doesn't.)

static bool is_trim_wchar(uint32_t w, uint32_t *characters, size_t characters_len)
{
for (size_t i = 0; i < characters_len; i++) {
if (w == characters[i]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of characters would have to be a lot bigger than 1 to make building a hashtable worthwhile. I would guess at least 20, but of course, a benchmark would be the best way to find out.

@alexdowad
Copy link
Contributor

@youkidearitai I just remembered, something should be added to NEWS for this change.

Then I guess I can go ahead and merge into master?

@youkidearitai
Copy link
Contributor Author

@alexdowad Thanks, Added NEWS about mb_trim.

@Girgias
Copy link
Member

Girgias commented Nov 24, 2023

It is not NEWS that needs to be updated but UPGRADING.

Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com>
Co-authored-by: Gina Peter Banyard <girgias@php.net>
@youkidearitai
Copy link
Contributor Author

@Girgias Thanks. Added UPGRADING

@alexdowad
Copy link
Contributor

OK, well, I really think this is ready to merge now...

@alexdowad
Copy link
Contributor

Landed on master. If anyone notices any other improvements which can be made, please open another PR.

Thanks very, very much to @youkidearitai and all the others!

@alexdowad alexdowad closed this Nov 24, 2023
@youkidearitai
Copy link
Contributor Author

Thanks very, very much for all reviewers!

@youkidearitai youkidearitai deleted the mb_trim branch November 24, 2023 07:51
youkidearitai referenced this pull request Nov 24, 2023
Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com>
Co-authored-by: Gina Peter Banyard <girgias@php.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants