-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Mark certain strings as valid utf8 based on operations #10463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Zend/zend_operators.c
Outdated
@@ -1984,6 +1984,9 @@ ZEND_API zend_result ZEND_FASTCALL concat_function(zval *result, zval *op1, zval | |||
|
|||
memcpy(ZSTR_VAL(result_str) + op1_len, Z_STRVAL_P(op2), op2_len); | |||
ZSTR_VAL(result_str)[result_len] = '\0'; | |||
if (ZSTR_IS_VALID_UTF8(Z_STR_P(op1)) && ZSTR_IS_VALID_UTF8(Z_STR_P(op2))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this method called also from functions like implode
?
also it would be great to test the flags by a test, if the flag is not available in userland, it should be possible to expose it in zend_test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, implode()
uses a custom algorithm, similar to str_pad()
adding those flags to these functions shouldn't be too difficult. Having a function in zend_test
to check the flag is set is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree having a function in zend_test
to test the flag is a good idea.
This is great, I like it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much!
@@ -3218,7 +3221,9 @@ ZEND_API zend_string* ZEND_FASTCALL zend_long_to_str(zend_long num) /* {{{ */ | |||
} else { | |||
char buf[MAX_LENGTH_OF_LONG + 1]; | |||
char *res = zend_print_long_to_buf(buf + sizeof(buf) - 1, num); | |||
return zend_string_init(res, buf + sizeof(buf) - 1 - res, 0); | |||
zend_string *str = zend_string_init(res, buf + sizeof(buf) - 1 - res, 0); | |||
GC_ADD_FLAGS(str, IS_STR_VALID_UTF8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this assumption valid even for files with arbitrary charsets/encodings, and arbitrary settings of zend.multibyte
? Genuinely asking, since I've never used anything but UTF-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I totally understand the question.
AFAIK, numbers in UTF-16 would be encoded over two bytes. But looking at the implementation of zend_print_long_to_buf()
it seems to just create an ASCII string and doesn't care about zend_multibyte.
However, my understanding of what zend_multibyte does is to convert the current script file into an internal encoding to process the file, which is most likely going to be UTF-8. But that's a good question I don't know the answer. But it seems like we assume anyway that the encoding is going to be ASCII compatible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it looks like these functions always return ASCII strings.
I don't think this is related at all to files with arbitrary charsets; yes, string literals in such files might use some other text encoding, but this is not about string literals. There is no reason why PHP would use zend_long_to_str
to create string objects for string literals embedded in a source file.
5d983b2
to
c76a25d
Compare
Similar but different idea: could strings returned from idea2: should strings returned from often used mb functions, like mb_substr also be marked as utf8, when the input string is utf8 and no $encoding param is used? |
This has already been implemented. It's actually more than what you have suggested; basically any time when mbstring is generating UTF-8 strings, it marks them as 'valid UTF-8'. For example, |
47c4de3
to
2dcafb9
Compare
2dcafb9
to
fbd6799
Compare
@nielsdos do you think the flag also needs to be propagated in the |
I think so because otherwise persisted strings by opcache would not get the utf8 flag back it they are loaded. So you may need to add an additional argument to that macro so you can pass in whether the string was utf8 or not. |
Hi, I have a question. I create 7-bit character encoding (e.g. ISO-2022-JP(JIS)), can valid UTF-8 strings. Please see below. |
Thanks for asking. All bytes from 0x00-7F are valid single-byte characters in UTF-8. So yes, |
fbd6799
to
7ca082f
Compare
Current test failures with the tracing JIT are: Concatenation in loop (compound assignment):
036+ bool(false)
036- bool(true)
The issue was that in |
Also add test to check what strings are marked as having the flag
The UTF-8 valid flag needs to be copied upon interning, otherwise strings that are concatenated at compile time lose this information. However, if previously this string was interned without the flag it is not added E.g. in the case the string is an existing class name. Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com>
7ca082f
to
9f6134a
Compare
Hopefully all issues are resolved now and I've squashed and rebased the PR so that the commits are independent and handle all stuff together. |
thanks for the details @alexdowad . is it also done for non-mb functions like |
No, it is not.
|
Thanks for the hard work, @Girgias. |
I think this is a very inexpensive check (for |
Thanks! I will need to update my code now to use the new macro. |
Hmm, maybe adding this feature to
That could easily cause it to cut a multi-byte char in the middle, resulting in invalid UTF-8. |
This marks integers or floats cast to string, and strings that are the result of concatenating two UTF-8 strings as valid UTF-8.
A very crude (and terrible) benchmark seems to show an order of magnitude improved performance in validating the string is UTF-8:
It probably makes some sense to performance test this more via micro benchmarks, but just want to get opinions if marking such strings makes sense in the first place.