Skip to content

Mark certain strings as valid utf8 based on operations #10463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 2, 2023

Conversation

Girgias
Copy link
Member

@Girgias Girgias commented Jan 27, 2023

This marks integers or floats cast to string, and strings that are the result of concatenating two UTF-8 strings as valid UTF-8.

A very crude (and terrible) benchmark seems to show an order of magnitude improved performance in validating the string is UTF-8:

<?php
const COPY_TIMES = 10_000;
$string = "a";

$string_padded = str_repeat($string, COPY_TIMES);

$string_concat = $string;
for ($i = 1; $i < COPY_TIMES; $i++) {
    $string_concat .= $string;
}

$tp = hrtime(true);
mb_check_encoding($string_padded , "UTF-8");
$tp_total = hrtime(true) - $tp;
var_dump($tp_total);

$tc = hrtime(true);
mb_check_encoding($string_concat , "UTF-8");
$tc_total = hrtime(true) - $tc;
var_dump($tc_total);

It probably makes some sense to performance test this more via micro benchmarks, but just want to get opinions if marking such strings makes sense in the first place.

@@ -1984,6 +1984,9 @@ ZEND_API zend_result ZEND_FASTCALL concat_function(zval *result, zval *op1, zval

memcpy(ZSTR_VAL(result_str) + op1_len, Z_STRVAL_P(op2), op2_len);
ZSTR_VAL(result_str)[result_len] = '\0';
if (ZSTR_IS_VALID_UTF8(Z_STR_P(op1)) && ZSTR_IS_VALID_UTF8(Z_STR_P(op2))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this method called also from functions like implode?

also it would be great to test the flags by a test, if the flag is not available in userland, it should be possible to expose it in zend_test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, implode() uses a custom algorithm, similar to str_pad() adding those flags to these functions shouldn't be too difficult. Having a function in zend_test to check the flag is set is a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree having a function in zend_test to test the flag is a good idea.

@alexdowad
Copy link
Contributor

This is great, I like it!

Copy link
Contributor

@alexdowad alexdowad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much!

@@ -3218,7 +3221,9 @@ ZEND_API zend_string* ZEND_FASTCALL zend_long_to_str(zend_long num) /* {{{ */
} else {
char buf[MAX_LENGTH_OF_LONG + 1];
char *res = zend_print_long_to_buf(buf + sizeof(buf) - 1, num);
return zend_string_init(res, buf + sizeof(buf) - 1 - res, 0);
zend_string *str = zend_string_init(res, buf + sizeof(buf) - 1 - res, 0);
GC_ADD_FLAGS(str, IS_STR_VALID_UTF8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assumption valid even for files with arbitrary charsets/encodings, and arbitrary settings of zend.multibyte? Genuinely asking, since I've never used anything but UTF-8.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I totally understand the question.

AFAIK, numbers in UTF-16 would be encoded over two bytes. But looking at the implementation of zend_print_long_to_buf() it seems to just create an ASCII string and doesn't care about zend_multibyte.

However, my understanding of what zend_multibyte does is to convert the current script file into an internal encoding to process the file, which is most likely going to be UTF-8. But that's a good question I don't know the answer. But it seems like we assume anyway that the encoding is going to be ASCII compatible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it looks like these functions always return ASCII strings.

I don't think this is related at all to files with arbitrary charsets; yes, string literals in such files might use some other text encoding, but this is not about string literals. There is no reason why PHP would use zend_long_to_str to create string objects for string literals embedded in a source file.

@staabm
Copy link
Contributor

staabm commented Jan 28, 2023

Similar but different idea: could strings returned from json_decode also be marked as valid utf8, since json itself by definition is utf8?

idea2: should strings returned from often used mb functions, like mb_substr also be marked as utf8, when the input string is utf8 and no $encoding param is used?

@alexdowad
Copy link
Contributor

idea2: should strings returned from often used mb functions, like mb_substr also be marked as utf8, when the input string is utf8 and no $encoding param is used?

This has already been implemented. It's actually more than what you have suggested; basically any time when mbstring is generating UTF-8 strings, it marks them as 'valid UTF-8'. For example, mb_convert_encoding($str, 'UTF-8', 'SJIS') and so on.

@Girgias Girgias force-pushed the mark-string-casts-utf8 branch from 2dcafb9 to fbd6799 Compare January 30, 2023 15:44
@Girgias
Copy link
Member Author

Girgias commented Jan 30, 2023

@nielsdos do you think the flag also needs to be propagated in the zend_set_str_gc_flags() macro in ext/opcache/zend_persist.c?

@nielsdos
Copy link
Member

nielsdos commented Jan 30, 2023

@nielsdos do you think the flag also needs to be propagated in the zend_set_str_gc_flags() macro in ext/opcache/zend_persist.c?

I think so because otherwise persisted strings by opcache would not get the utf8 flag back it they are loaded. So you may need to add an additional argument to that macro so you can pass in whether the string was utf8 or not.
I only had a brief look at that code though.

@youkidearitai
Copy link
Contributor

Hi, I have a question. I create 7-bit character encoding (e.g. ISO-2022-JP(JIS)), can valid UTF-8 strings. Please see below.
This gist (phpt) file output "あ" ISO-2022-JP. Is this right behavior?
https://gist.github.com/youkidearitai/0965634060e714c1b2f6439f2ec08fe1

@alexdowad
Copy link
Contributor

Hi, I have a question. I create 7-bit character encoding (e.g. ISO-2022-JP(JIS)), can valid UTF-8 strings. Please see below. This gist (phpt) file output "あ" ISO-2022-JP. Is this right behavior? https://gist.github.com/youkidearitai/0965634060e714c1b2f6439f2ec08fe1

Thanks for asking.

All bytes from 0x00-7F are valid single-byte characters in UTF-8. So yes, "\x1B\x24\x42" is a valid UTF-8 string. So is "\x1B\x24\x22".

@Girgias Girgias force-pushed the mark-string-casts-utf8 branch from fbd6799 to 7ca082f Compare January 31, 2023 03:51
@Girgias
Copy link
Member Author

Girgias commented Jan 31, 2023

Current test failures with the tracing JIT are:

Concatenation in loop (compound assignment):
036+ bool(false)
036- bool(true)

Hopefully this was just an issue with the persist flag, as I can't reproduce this locally (well I've been using the function JIT which worked well before as it was an Opcache but if it's not that I might try to run the tracing JIT actually...)

The issue seems to be in the definitions of zend_jit_concat() in zend_jit_x86.dasc and zend_jit_arm64.dasc

The issue was that in zend_jit_fast_concat_tmp_helper() and zend_jit_fast_assign_concat_helper() the UTF-8 flag was added just before the do-while() loop ended, which means when it was exited early the flag was not propagated.

Girgias and others added 3 commits January 31, 2023 18:28
Also add test to check what strings are marked as having the flag
The UTF-8 valid flag needs to be copied upon interning,
otherwise strings that are concatenated at compile time lose this information.

However, if previously this string was interned without the flag it is not added
E.g. in the case the string is an existing class name.

Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com>
@Girgias Girgias force-pushed the mark-string-casts-utf8 branch from 7ca082f to 9f6134a Compare January 31, 2023 18:28
@Girgias
Copy link
Member Author

Girgias commented Jan 31, 2023

Hopefully all issues are resolved now and I've squashed and rebased the PR so that the commits are independent and handle all stuff together.

@staabm
Copy link
Contributor

staabm commented Feb 1, 2023

This has already been implemented. It's actually more than what you have suggested; basically any time when mbstring is generating UTF-8 strings, it marks them as 'valid UTF-8'. For example, mb_convert_encoding($str, 'UTF-8', 'SJIS') and so on.

thanks for the details @alexdowad . is it also done for non-mb functions like substr, str_pad, str_repeat and friends?

@alexdowad
Copy link
Contributor

thanks for the details @alexdowad . is it also done for non-mb functions like substr, str_pad, str_repeat and friends?

No, it is not.

str_repeat is a very good candidate for adding this feature. So is str_pad. For substr, you would need to check that the input string is UTF-8 and that the cut points don't fall in the middle of a multi-byte character.

@alexdowad
Copy link
Contributor

Thanks for the hard work, @Girgias.

@alexdowad
Copy link
Contributor

str_repeat is a very good candidate for adding this feature. So is str_pad. For substr, you would need to check that the input string is UTF-8 and that the cut points don't fall in the middle of a multi-byte character.

I think this is a very inexpensive check (for substr)... if we know that the input string is valid UTF-8, just check the byte immediately following each cut point and make sure that (c & 0xC0) != 0x80, or equivalently, ((int8_t)c) >= -64.

@Girgias Girgias merged commit 64127b6 into php:master Feb 2, 2023
@Girgias Girgias deleted the mark-string-casts-utf8 branch February 2, 2023 12:02
@alexdowad
Copy link
Contributor

Thanks! I will need to update my code now to use the new macro.

@alexdowad
Copy link
Contributor

str_repeat is a very good candidate for adding this feature. So is str_pad. For substr, you would need to check that the input string is UTF-8 and that the cut points don't fall in the middle of a multi-byte character.

Hmm, maybe adding this feature to str_pad isn't such a good idea after all.

str_pad does not necessarily add the entire pad string, but just as many bytes as it needs to pad out the string to the requested number of bytes.

That could easily cause it to cut a multi-byte char in the middle, resulting in invalid UTF-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants