Mark certain strings as valid utf8 based on operations #10463

Girgias · 2023-01-27T16:57:51Z

This marks integers or floats cast to string, and strings that are the result of concatenating two UTF-8 strings as valid UTF-8.

A very crude (and terrible) benchmark seems to show an order of magnitude improved performance in validating the string is UTF-8:

<?php
const COPY_TIMES = 10_000;
$string = "a";

$string_padded = str_repeat($string, COPY_TIMES);

$string_concat = $string;
for ($i = 1; $i < COPY_TIMES; $i++) {
    $string_concat .= $string;
}

$tp = hrtime(true);
mb_check_encoding($string_padded , "UTF-8");
$tp_total = hrtime(true) - $tp;
var_dump($tp_total);

$tc = hrtime(true);
mb_check_encoding($string_concat , "UTF-8");
$tc_total = hrtime(true) - $tc;
var_dump($tc_total);

It probably makes some sense to performance test this more via micro benchmarks, but just want to get opinions if marking such strings makes sense in the first place.

mvorisek · 2023-01-27T17:05:05Z

Zend/zend_operators.c

@@ -1984,6 +1984,9 @@ ZEND_API zend_result ZEND_FASTCALL concat_function(zval *result, zval *op1, zval

 		memcpy(ZSTR_VAL(result_str) + op1_len, Z_STRVAL_P(op2), op2_len);
 		ZSTR_VAL(result_str)[result_len] = '\0';
+		if (ZSTR_IS_VALID_UTF8(Z_STR_P(op1)) && ZSTR_IS_VALID_UTF8(Z_STR_P(op2))) {


is this method called also from functions like implode?

also it would be great to test the flags by a test, if the flag is not available in userland, it should be possible to expose it in zend_test

No, implode() uses a custom algorithm, similar to str_pad() adding those flags to these functions shouldn't be too difficult. Having a function in zend_test to check the flag is set is a good idea.

I agree having a function in zend_test to test the flag is a good idea.

alexdowad · 2023-01-27T17:24:32Z

This is great, I like it!

alexdowad

Thanks very much!

cmb69 · 2023-01-27T17:45:07Z

Zend/zend_operators.c

@@ -3218,7 +3221,9 @@ ZEND_API zend_string* ZEND_FASTCALL zend_long_to_str(zend_long num) /* {{{ */
 	} else {
 		char buf[MAX_LENGTH_OF_LONG + 1];
 		char *res = zend_print_long_to_buf(buf + sizeof(buf) - 1, num);
-		return zend_string_init(res, buf + sizeof(buf) - 1 - res, 0);
+		zend_string *str =  zend_string_init(res, buf + sizeof(buf) - 1 - res, 0);
+		GC_ADD_FLAGS(str, IS_STR_VALID_UTF8);


Is this assumption valid even for files with arbitrary charsets/encodings, and arbitrary settings of zend.multibyte? Genuinely asking, since I've never used anything but UTF-8.

I'm not sure if I totally understand the question.

AFAIK, numbers in UTF-16 would be encoded over two bytes. But looking at the implementation of zend_print_long_to_buf() it seems to just create an ASCII string and doesn't care about zend_multibyte.

However, my understanding of what zend_multibyte does is to convert the current script file into an internal encoding to process the file, which is most likely going to be UTF-8. But that's a good question I don't know the answer. But it seems like we assume anyway that the encoding is going to be ASCII compatible

Yeah, it looks like these functions always return ASCII strings.

I don't think this is related at all to files with arbitrary charsets; yes, string literals in such files might use some other text encoding, but this is not about string literals. There is no reason why PHP would use zend_long_to_str to create string objects for string literals embedded in a source file.

staabm · 2023-01-28T07:54:41Z

Similar but different idea: could strings returned from json_decode also be marked as valid utf8, since json itself by definition is utf8?

idea2: should strings returned from often used mb functions, like mb_substr also be marked as utf8, when the input string is utf8 and no $encoding param is used?

alexdowad · 2023-01-28T11:17:49Z

idea2: should strings returned from often used mb functions, like mb_substr also be marked as utf8, when the input string is utf8 and no $encoding param is used?

This has already been implemented. It's actually more than what you have suggested; basically any time when mbstring is generating UTF-8 strings, it marks them as 'valid UTF-8'. For example, mb_convert_encoding($str, 'UTF-8', 'SJIS') and so on.

ext/zend_test/tests/strings_marked_as_utf8.phpt

Girgias · 2023-01-30T15:45:45Z

@nielsdos do you think the flag also needs to be propagated in the zend_set_str_gc_flags() macro in ext/opcache/zend_persist.c?

nielsdos · 2023-01-30T16:42:02Z

@nielsdos do you think the flag also needs to be propagated in the zend_set_str_gc_flags() macro in ext/opcache/zend_persist.c?

I think so because otherwise persisted strings by opcache would not get the utf8 flag back it they are loaded. So you may need to add an additional argument to that macro so you can pass in whether the string was utf8 or not.
I only had a brief look at that code though.

youkidearitai · 2023-01-30T17:20:50Z

Hi, I have a question. I create 7-bit character encoding (e.g. ISO-2022-JP(JIS)), can valid UTF-8 strings. Please see below.
This gist (phpt) file output "あ" ISO-2022-JP. Is this right behavior?
https://gist.github.com/youkidearitai/0965634060e714c1b2f6439f2ec08fe1

alexdowad · 2023-01-30T17:27:53Z

Hi, I have a question. I create 7-bit character encoding (e.g. ISO-2022-JP(JIS)), can valid UTF-8 strings. Please see below. This gist (phpt) file output "あ" ISO-2022-JP. Is this right behavior? https://gist.github.com/youkidearitai/0965634060e714c1b2f6439f2ec08fe1

Thanks for asking.

All bytes from 0x00-7F are valid single-byte characters in UTF-8. So yes, "\x1B\x24\x42" is a valid UTF-8 string. So is "\x1B\x24\x22".

Girgias · 2023-01-31T03:54:29Z

Current test failures with the tracing JIT are:

Concatenation in loop (compound assignment):
036+ bool(false)
036- bool(true)

Hopefully this was just an issue with the persist flag, as I can't reproduce this locally (well I've been using the function JIT which worked well before as it was an Opcache but if it's not that I might try to run the tracing JIT actually...)

~~The issue seems to be in the definitions of zend_jit_concat() in zend_jit_x86.dasc and zend_jit_arm64.dasc~~

The issue was that in zend_jit_fast_concat_tmp_helper() and zend_jit_fast_assign_concat_helper() the UTF-8 flag was added just before the do-while() loop ended, which means when it was exited early the flag was not propagated.

Also add test to check what strings are marked as having the flag

The UTF-8 valid flag needs to be copied upon interning, otherwise strings that are concatenated at compile time lose this information. However, if previously this string was interned without the flag it is not added E.g. in the case the string is an existing class name. Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com>

Girgias · 2023-01-31T18:31:36Z

Hopefully all issues are resolved now and I've squashed and rebased the PR so that the commits are independent and handle all stuff together.

staabm · 2023-02-01T08:46:25Z

This has already been implemented. It's actually more than what you have suggested; basically any time when mbstring is generating UTF-8 strings, it marks them as 'valid UTF-8'. For example, mb_convert_encoding($str, 'UTF-8', 'SJIS') and so on.

thanks for the details @alexdowad . is it also done for non-mb functions like substr, str_pad, str_repeat and friends?

alexdowad · 2023-02-01T08:51:41Z

thanks for the details @alexdowad . is it also done for non-mb functions like substr, str_pad, str_repeat and friends?

No, it is not.

str_repeat is a very good candidate for adding this feature. So is str_pad. For substr, you would need to check that the input string is UTF-8 and that the cut points don't fall in the middle of a multi-byte character.

alexdowad · 2023-02-01T08:55:04Z

Thanks for the hard work, @Girgias.

alexdowad · 2023-02-01T08:57:40Z

str_repeat is a very good candidate for adding this feature. So is str_pad. For substr, you would need to check that the input string is UTF-8 and that the cut points don't fall in the middle of a multi-byte character.

I think this is a very inexpensive check (for substr)... if we know that the input string is valid UTF-8, just check the byte immediately following each cut point and make sure that (c & 0xC0) != 0x80, or equivalently, ((int8_t)c) >= -64.

alexdowad · 2023-02-02T12:24:00Z

Thanks! I will need to update my code now to use the new macro.

alexdowad · 2023-02-03T09:20:45Z

str_repeat is a very good candidate for adding this feature. So is str_pad. For substr, you would need to check that the input string is UTF-8 and that the cut points don't fall in the middle of a multi-byte character.

Hmm, maybe adding this feature to str_pad isn't such a good idea after all.

str_pad does not necessarily add the entire pad string, but just as many bytes as it needs to pad out the string to the requested number of bytes.

That could easily cause it to cut a multi-byte char in the middle, resulting in invalid UTF-8.

Girgias requested a review from alexdowad January 27, 2023 16:57

Girgias added the Category: Strings label Jan 27, 2023

github-actions bot added the Category: Engine label Jan 27, 2023

mvorisek reviewed Jan 27, 2023

View reviewed changes

alexdowad approved these changes Jan 27, 2023

View reviewed changes

cmb69 reviewed Jan 27, 2023

View reviewed changes

Girgias force-pushed the mark-string-casts-utf8 branch from 5d983b2 to c76a25d Compare January 27, 2023 22:24

github-actions bot added the Extension: zend_test label Jan 27, 2023

github-actions bot added the Extension: opcache label Jan 28, 2023

Girgias force-pushed the mark-string-casts-utf8 branch from 47c4de3 to 2dcafb9 Compare January 29, 2023 17:01

mvorisek reviewed Jan 30, 2023

View reviewed changes

ext/zend_test/tests/strings_marked_as_utf8.phpt Show resolved Hide resolved

mvorisek reviewed Jan 30, 2023

View reviewed changes

ext/zend_test/tests/strings_marked_as_utf8.phpt Show resolved Hide resolved

Girgias force-pushed the mark-string-casts-utf8 branch from 2dcafb9 to fbd6799 Compare January 30, 2023 15:44

Add macro to check zend_string is marked as valid UTF-8

b611d6d

Girgias force-pushed the mark-string-casts-utf8 branch from fbd6799 to 7ca082f Compare January 31, 2023 03:51

Girgias and others added 3 commits January 31, 2023 18:28

Add function in zend_test to check UTF8 flag is added

e0ac69d

Also add test to check what strings are marked as having the flag

Mark numeric strings as valid UTF-8

6ac7656

Girgias force-pushed the mark-string-casts-utf8 branch from 7ca082f to 9f6134a Compare January 31, 2023 18:28

Girgias merged commit 64127b6 into php:master Feb 2, 2023

Girgias deleted the mark-string-casts-utf8 branch February 2, 2023 12:02

mvorisek mentioned this pull request Mar 4, 2023

Imply UTF8 validity in implode function #10780

Merged

Girgias mentioned this pull request Mar 26, 2023

Propagate UTF-8 flag during Rope operations #10915

Merged

Mark certain strings as valid utf8 based on operations #10463

Mark certain strings as valid utf8 based on operations #10463

Uh oh!

Conversation

Girgias commented Jan 27, 2023

Uh oh!

mvorisek Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

Girgias Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

alexdowad Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

alexdowad commented Jan 27, 2023

Uh oh!

alexdowad left a comment

Choose a reason for hiding this comment

Uh oh!

cmb69 Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

Girgias Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

alexdowad Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

staabm commented Jan 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexdowad commented Jan 28, 2023

Uh oh!

Uh oh!

Uh oh!

Girgias commented Jan 30, 2023

Uh oh!

nielsdos commented Jan 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youkidearitai commented Jan 30, 2023

Uh oh!

alexdowad commented Jan 30, 2023

Uh oh!

Girgias commented Jan 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Girgias commented Jan 31, 2023

Uh oh!

staabm commented Feb 1, 2023

Uh oh!

alexdowad commented Feb 1, 2023

Uh oh!

alexdowad commented Feb 1, 2023

Uh oh!

alexdowad commented Feb 1, 2023

Uh oh!

alexdowad commented Feb 2, 2023

Uh oh!

alexdowad commented Feb 3, 2023

Uh oh!

Uh oh!

staabm commented Jan 28, 2023 •

edited

Loading

nielsdos commented Jan 30, 2023 •

edited

Loading

Girgias commented Jan 31, 2023 •

edited

Loading