[RFC] Add trim for multibyte function #12459

youkidearitai · 2023-10-17T15:01:15Z

Related: #9216

I make a propose a multibyte version of the trim function.

function mb_trim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}", ?string $encoding = null): string {}
function mb_ltrim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}", ?string $encoding = null): string {}
function mb_rtrim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}", ?string $encoding = null): string {}

On the other hand, The ".." notation for $characters that was in the trim function was not supported. ex: \u{0000}..\u{FFFF}
Because the reason is below:

Unicode character is very wide
- Difficult to search
- Difficult to store in memory

~~At later, I'll writing to RFC. Just a moment.~~
RFC: https://wiki.php.net/rfc/mb_trim

youkidearitai · 2023-10-17T15:34:40Z

memo (JST is 0:00 am, I'll fix later):

Fix test
Fix warning

ext/mbstring/mbstring.c

kbergha · 2023-10-19T13:41:59Z

Have you considered also adding U+FEFF / ZERO WIDTH NO-BREAK / "Byte order mark" as mentioned in #9216 (comment) ?

sztanpet · 2023-10-19T14:13:38Z

Maybe it would be best to define whitespace similarly to Go
so basically defining it as the Z Pattern_White_Space property PLUS the ascii '\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP)

This would make it somewhat future-proof from a maintenance point whenever unicode changes what whitespace means

youkidearitai · 2023-10-19T14:50:44Z

@kbergha Thank you very much. Please let me consider it. I was thinking only about regex \s.

youkidearitai · 2023-10-19T14:54:00Z

@sztanpet Thank you for your comment. I was thinking about use case of preg_replace(“/^\s+|\s+$/u”, '', $string);. Therefore, I feel like Go language's IsSpace is not enough.

youkidearitai · 2023-10-20T08:09:23Z

Have you considered also adding U+FEFF / ZERO WIDTH NO-BREAK / "Byte order mark" as mentioned in #9216 (comment) ?

@kbergha U+FEFF is "ZERO WIDTH NO-BREAK" as well as "BYTE ORDER MARK" at old Unicode. Given this historical background, I think to recommend to use mb_ltrim in each of them.

youkidearitai · 2023-10-20T08:18:57Z

I just want to this function is only "delete space". If delete other character, please use second parameter of $character.

ext/mbstring/mbstring_arginfo.h

youkidearitai · 2023-10-27T08:36:11Z

@kbergha Use case of "trim BOM", I think would work. I wrote RFC Introduction section.

One of use case is “trim Byte Order Mark”. I think mb_ltrim would be work:
mb_ltrim($string, "\u{FEFF}\u{FFFE}");

youkidearitai · 2023-11-18T00:54:34Z

mb_trim, mb_ltrim and mb_rtrim is accepted (https://wiki.php.net/rfc/mb_trim).
Change from Draft Pull Request to Normal Pull Request.

Thanks.

alexdowad · 2023-11-18T11:29:59Z

@youkidearitai Aside from the points I mentioned above, I think the other important thing is to fuzz this code. Maybe add a bit of code to sapi/fuzzer/fuzzer-mbstring.c so that aside from testing conversion, it also tries passing its test string to mb_trim.

Even if you don't add any assertions, this will still help to catch memory leaks, bad pointer accesses, etc.

alexdowad · 2023-11-18T11:31:11Z

Any other comments from @Girgias? I see that @nielsdos and @mvorisek have already reviewed.

alexdowad · 2023-11-18T11:32:56Z

@youkidearitai One more thing... it would be good to have a few test cases for what happens when invalid strings (bad encoding) are passed to mb_trim.

nielsdos

Some comments and questions :)

ext/mbstring/mbstring.c

nielsdos · 2023-11-18T12:20:38Z

ext/mbstring/mbstring.c

+static bool is_trim_wchar(uint32_t w, uint32_t *characters, size_t characters_len)
+{
+	for (size_t i = 0; i < characters_len; i++) {
+		if (w == characters[i]) {


This loop could get very slow for every character, I wonder if it makes sense to build a HashTable if the nr of characters is > 1

The number of characters would have to be a lot bigger than 1 to make building a hashtable worthwhile. I would guess at least 20, but of course, a benchmark would be the best way to find out.

Uum.. I think this problem, but I couldn't find a way.

Perhaps the distinction between 1 character and >1 character shouldn't be made.

Anyway, I still think a HashTable would be a good idea, combined with avoiding decoding the what string repeatedly. This would improve performance.

To do so:

Create a HashTable variable on the stack

Use zend_hash_init to initialize that HashTable; and zend_hash_destroy to destroy it.

Fill the HashTable with the unicode codepoints of what as keys and an empty value as value using zend_hash_index_add_new. Do this once before the loop.

Pass the HashTable instead of a string to trim_each_wchar

In is_trim_wchar you can use zend_hash_index_exists to check if the unicode codepoint is in the HashTable. This takes constant time so it's much faster than a loop.

In trim_each_wchar it's not necessary anymore to repeatedly decode what, because the HashTable now takes that role.

Note that it's then no longer necessary too to encode the trim_default_chars array in mb_trim_default_chars to a zend_string. Instead you can fill in the HashTable directly with a for loop.

Let me know if you need help 🙂

Thank you very much!😊 This is make sense.
I was replace to HashTable. If any mistake, feel free comment.

Oh, test failed when enable JIT. What's happened?

youkidearitai · 2023-11-18T13:49:07Z

@youkidearitai Aside from the points I mentioned above, I think the other important thing is to fuzz this code. Maybe add a bit of code to sapi/fuzzer/fuzzer-mbstring.c so that aside from testing conversion, it also tries passing its test string to mb_trim.

@alexdowad Thank you for your advice. I'll try put from trim_each_wchar(and mb_trim_default_chars) code to sapi/fuzzer/fuzzer-mbstring.c.
And put some test code.

ext/mbstring/mbstring.c

ext/mbstring/tests/mb_trim.phpt

alexdowad · 2023-11-19T12:06:23Z

@youkidearitai I think you don't have any test case where str is an empty string; that is a basic test which should be included for any string-processing function.

youkidearitai · 2023-11-19T12:19:11Z

@alexdowad Thanks, I added empty string when $string is empty.

alecpl · 2023-11-19T12:40:20Z

I good idea would be to copy all relevant test cases for trim-family functions from https://github.com/php/php-src/tree/master/ext/standard/tests/strings

youkidearitai · 2023-11-19T14:30:39Z

@alecpl Thank you! Copied from trim.phpt to mb_trim.phpt.

ext/mbstring/mbstring.c

nielsdos · 2023-11-19T16:00:48Z

I've been looking at this back and forth, and I think this implementation can be simplified and speed up a lot.
You currently repeatedly decode the "what" string into unicode codepoints, and you encode the codepoints into a string in "mb_trim_default_chars".

I would internally always work on unicode codepoints such that you don't have to encode in "mb_trim_default_chars".
I would then decode once upfront the "what" string in php_do_mb_trim in the if (what) { case.

ext/mbstring/mbstring.c

nielsdos

Another round of reviewing.
Looks pretty good now, thanks!

I have some minor suggestions for improvements, but overall it looks pretty good.
Can you also add the testcases I wrote in my earlier review to the .phpt file? That makes sure we don't accidentally re-introduce those issues from before when refactoring or optimizing.

ext/mbstring/mbstring.c

youkidearitai · 2023-11-22T08:51:13Z

@nielsdos Thank you very much for deeply review. Added crashed test case.

nielsdos · 2023-11-22T16:21:19Z

Looks good, thanks. The only remaining improvement I can see is to merge the two for loops to avoid duplicating the codepoint looking in what_ht.

Something like this:

diff --git a/ext/mbstring/mbstring.c b/ext/mbstring/mbstring.c
index bb82c4b268..7400f3d460 100644
--- a/ext/mbstring/mbstring.c
+++ b/ext/mbstring/mbstring.c
@@ -2968,24 +2968,18 @@ static zend_string* trim_each_wchar(zend_string *str, const HashTable *what_ht,
 		ZEND_ASSERT(out_len <= 128);
 		total_len += out_len;
 
-		if (mode & MB_LTRIM) {
-			for (size_t i = 0; i < out_len; i++) {
-				uint32_t w = wchar_buf[i];
-				if (is_trim_wchar(w, what_ht)) {
+		for (size_t i = 0; i < out_len; i++) {
+			uint32_t w = wchar_buf[i];
+			if (is_trim_wchar(w, what_ht)) {
+				if (mode & MB_LTRIM) {
 					left += 1;
-				} else {
-					mode ^= MB_LTRIM;
-					break;
 				}
-			}
-		}
-
-		if (mode & MB_RTRIM) {
-			for (size_t j = 0; j < out_len; j++) {
-				uint32_t w = wchar_buf[j];
-				if (is_trim_wchar(w, what_ht)) {
+				if (mode & MB_RTRIM) {
 					right += 1;
-				} else {
+				}
+			} else {
+				mode &= ~MB_LTRIM;
+				if (mode & MB_RTRIM) {
 					right = 0;
 				}
 			}

This makes a minor (8-10%) performance improvement in my test.

nielsdos

Looks good to me, thanks.

ext/mbstring/mbstring.c

alexdowad · 2023-11-23T07:34:02Z

@youkidearitai Can you squash this into one commit?

alexdowad · 2023-11-23T08:49:00Z

@youkidearitai Thanks very much!
I will try to pull this very soon, build, and run the fuzzer.

youkidearitai · 2023-11-23T08:53:11Z

@alexdowad Thank you very much!

I will try to pull this very soon, build, and run the fuzzer.

Thanks, again.

alexdowad · 2023-11-23T17:36:33Z

The code looks good, all tests are passing.

I would just like to fuzz it a bit.

alexdowad · 2023-11-23T18:30:41Z

Discovered an interesting test case when fuzzing... mb_trim("\x0E\x2D") in encoding JIS.

I thought that mb_trim would always make strings shorter (in byte length), but this one comes out longer after "trimming", because while the original string uses 'shift in' to encode a JISX 0201 character, the output string uses an "ESC ( I" sequence.

alexdowad · 2023-11-23T18:35:52Z

OK, well, my fuzzer hasn't found any bugs.

alexdowad · 2023-11-18T11:02:04Z

ext/mbstring/mbstring.c

+	};
+	size_t trim_default_chars_length = sizeof(trim_default_chars) / sizeof(uint32_t);
+	mb_convert_buf buf;
+	mb_convert_buf_init(&buf, trim_default_chars_length, MBSTRG(current_filter_illegal_substchar), MBSTRG(current_filter_illegal_mode));


@youkidearitai Congratulations on submitting and implementing your first RFC!!

I don't think it is necessary to convert trim_default_chars from an array of wchars to a string, because in trim_each_wchar, you convert the string back to wchars again.

I think it would be better if trim_each_wchar received an array of wchars (with length parameter). Then in php_do_mb_strim, you would need to convert the received string to an array of wchars inside the if (what) { clause.

Of course, the problem with this is that you can't control how big of a string the caller will pass. It could be huge. So it may or may not fit into a static array (allocated on the stack like what you are doing with what_wchar_buf). I think you would need to:

Have a static, stack-allocated array like what_wchar_buf

If the what string fits completely in what_wchar_buf when converted to wchars, just pass that to trim_each_wchar.

If the what string does not completely fit in what_wchar_buf after conversion, then dynamically allocate a buffer for it and fill that dynamic buffer using what_wchar_buf. Free the dynamic buffer before returning.

It's unfortunate that you would need to allocate memory when the what string is big, but it shouldn't usually be big.

If you really don't want to allocate memory when the what string is big, you could make two versions of trim_each_wchar... one which receives an array of wchars, and one which receives a string.

I just feel that converting trim_default_chars to a string and then immediately converting it back to wchars is very wasteful and will make this function unnecessarily slow when the default list of whitespace chars is used.

@alexdowad It seems that this comment was made for older code. He no longer does the double conversion in the current version.

alexdowad · 2023-11-18T11:04:47Z

ext/mbstring/mbstring.c

+		Z_PARAM_STR(str)
+		Z_PARAM_OPTIONAL
+		Z_PARAM_STR_OR_NULL(what)
+		Z_PARAM_STR_OR_NULL(encoding)


Hmm, just thinking about this. Is it true that most mbstring functions take the encoding last as an optional parameter? If so, then this is good. 👍🏻

It just means that if a user wants to pass an explicit encoding, they have to pass the list of whitespace chars as well. If the order was (str, encoding, what), then the user could pass an explicit encoding, but use the default value of what.

What do you think is better?

(If this was already discussed at the RFC stage, then please disregard my comment.)

Just my 2c: The mbstring functions take encoding as last. If you don't want to pass what then you can use named arguments:
mb_trim("my string", encoding: "GB18030") for example.

As @nielsdos said, I am assuming to use mb_trim("my string", encoding: "GB18030");

alexdowad · 2023-11-18T11:23:23Z

ext/mbstring/mbstring.c

+	zend_string_release(str);
+	zend_string_release(what);
+
+	return mb_get_substr(str, left, total_len + (right - left), enc);


Important performance optimization: if left == 0 && right == 0, then just return zend_string_copy(str). (zend_string_copy just increments the refcount of a string.)

(I checked the source code for mb_get_substr to see if it performs the same optimization, and it looks like it doesn't.)

ext/mbstring/mbstring.c

alexdowad · 2023-11-18T12:43:03Z

ext/mbstring/mbstring.c

+static bool is_trim_wchar(uint32_t w, uint32_t *characters, size_t characters_len)
+{
+	for (size_t i = 0; i < characters_len; i++) {
+		if (w == characters[i]) {


The number of characters would have to be a lot bigger than 1 to make building a hashtable worthwhile. I would guess at least 20, but of course, a benchmark would be the best way to find out.

ext/mbstring/mbstring.c

alexdowad · 2023-11-23T18:41:22Z

@youkidearitai I just remembered, something should be added to NEWS for this change.

Then I guess I can go ahead and merge into master?

youkidearitai · 2023-11-24T00:10:27Z

@alexdowad Thanks, Added NEWS about mb_trim.

Girgias · 2023-11-24T01:17:37Z

It is not NEWS that needs to be updated but UPGRADING.

Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com> Co-authored-by: Gina Peter Banyard <girgias@php.net>

youkidearitai · 2023-11-24T05:08:59Z

@Girgias Thanks. Added UPGRADING

alexdowad · 2023-11-24T06:03:54Z

OK, well, I really think this is ready to merge now...

alexdowad · 2023-11-24T06:06:15Z

Landed on master. If anyone notices any other improvements which can be made, please open another PR.

Thanks very, very much to @youkidearitai and all the others!

youkidearitai · 2023-11-24T06:10:07Z

Thanks very, very much for all reviewers!

Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com> Co-authored-by: Gina Peter Banyard <girgias@php.net>

github-actions bot added the Extension: mbstring label Oct 17, 2023

youkidearitai mentioned this pull request Oct 19, 2023

trim support for multibyte spaces / mb_trim() #9216

Closed

nielsdos reviewed Oct 19, 2023

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

mvorisek reviewed Oct 23, 2023

View reviewed changes

ext/mbstring/mbstring_arginfo.h Show resolved Hide resolved

youkidearitai marked this pull request as ready for review November 18, 2023 00:52

youkidearitai requested a review from alexdowad as a code owner November 18, 2023 00:52

nielsdos requested changes Nov 18, 2023

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

nielsdos reviewed Nov 18, 2023

View reviewed changes

TimWolla added the RFC label Nov 18, 2023

alecpl reviewed Nov 19, 2023

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

alecpl reviewed Nov 19, 2023

View reviewed changes

ext/mbstring/tests/mb_trim.phpt Show resolved Hide resolved

nielsdos reviewed Nov 19, 2023

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

nielsdos reviewed Nov 20, 2023

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

nielsdos reviewed Nov 21, 2023

View reviewed changes

Girgias reviewed Nov 22, 2023

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

nielsdos approved these changes Nov 23, 2023

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

youkidearitai force-pushed the mb_trim branch from 3ad5121 to 94ac067 Compare November 23, 2023 08:29

alexdowad approved these changes Nov 23, 2023

View reviewed changes

youkidearitai force-pushed the mb_trim branch from 94ac067 to 9dbe715 Compare November 24, 2023 00:08

Add mb_trim function

60db3fb

Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com> Co-authored-by: Gina Peter Banyard <girgias@php.net>

youkidearitai force-pushed the mb_trim branch from 9dbe715 to 60db3fb Compare November 24, 2023 05:08

alexdowad closed this Nov 24, 2023

youkidearitai deleted the mb_trim branch November 24, 2023 07:51

youkidearitai referenced this pull request Nov 24, 2023

Add mb_trim function

a80b6d7

Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com> Co-authored-by: Gina Peter Banyard <girgias@php.net>

gnutix mentioned this pull request Jan 5, 2024

Multibyte for ucfirst function #13075

Closed

damianwadley mentioned this pull request Mar 27, 2024

mb_trim() inaccurate $characters default value #13815

Closed

jrfnl mentioned this pull request Sep 15, 2024

PHP 8.4 | NewFunctions: detect use of new mb_*trim() functions (RFC) PHPCompatibility/PHPCompatibility#1766

Merged

[RFC] Add trim for multibyte function #12459

[RFC] Add trim for multibyte function #12459

Uh oh!

Conversation

youkidearitai commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youkidearitai commented Oct 17, 2023

Uh oh!

Uh oh!

kbergha commented Oct 19, 2023

Uh oh!

sztanpet commented Oct 19, 2023

Uh oh!

youkidearitai commented Oct 19, 2023

Uh oh!

youkidearitai commented Oct 19, 2023

Uh oh!

youkidearitai commented Oct 20, 2023

Uh oh!

youkidearitai commented Oct 20, 2023

Uh oh!

Uh oh!

youkidearitai commented Oct 27, 2023

Uh oh!

youkidearitai commented Nov 18, 2023

Uh oh!

alexdowad commented Nov 18, 2023

Uh oh!

alexdowad commented Nov 18, 2023

Uh oh!

alexdowad commented Nov 18, 2023

Uh oh!

nielsdos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youkidearitai commented Nov 18, 2023

Uh oh!

Uh oh!

Uh oh!

alexdowad commented Nov 19, 2023

Uh oh!

youkidearitai commented Nov 19, 2023

Uh oh!

alecpl commented Nov 19, 2023

Uh oh!

youkidearitai commented Nov 19, 2023

Uh oh!

Uh oh!

nielsdos commented Nov 19, 2023

Uh oh!

Uh oh!

nielsdos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

youkidearitai commented Nov 22, 2023

youkidearitai commented Oct 17, 2023 •

edited

Loading