Skip to content

Add ini_parse_quantity to convert ini directive "byte" shorthand to int #8454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 10, 2022

Conversation

dmsnell
Copy link
Contributor

@dmsnell dmsnell commented Apr 28, 2022

With WordPress we have seen a variety of bugs that stem from the fact that we
don't reliably parse the memory size values from php.ini directives. We want
to know those values in order to communicate to site users what the file
upload limits are.

In this patch I'm adding a new function ini_bytes( $shorthand ) to return
whatever PHP evaluates that value to be internally, thus exposing the actual
byte size to the outside world. This eliminates the need for WordPress and
other projects to port the config value parser since they could directly
call it. Additionally it will reliably reproduce any quirks due to the
platform build and overflow behaviors when handling the g, m, and k
suffixes on post_max_size and upload_max_filesize (and others).

Related

  • Add memory_get_limit() function #8380
    Adding a new function to get the memory size when I think we could use
    ini_bytes() to read the existing value reported for it instead.

  • zend_atoi/zend_atol cleanup #4132
    Highlighting some of the nuances of parsing values set for byte sizes
    in the config. Some other non-byte-size configs also work through atol
    and can be similarly surprising when parsed.

Alternative approaches

This function is a shallow wrapper around zend_atol() and could become
out of sync with the actual ini directive parsing. I considered a variation
of this function which would take a directive's name and return the stored
internal value directly, keeping it in sync.

switch ( param_name ) {
	case 'post_max_size':
		RETVAL_LONG(SG(post_max_size));
}
$post_max_size = ini_get_value( 'post_max_size' );

After looking briefly at this option I dismissed it for being a bit too
coupled to those options, too demanding to add all the possible config
options to a big switch statement, and too complicated for returning all
of the possible return types stored in the SAPI globals.

This is my first attempt at patching PHP and I don't know what I'm doing :)
I'm not thrilled at the function name but I needed something.
If this idea is remotely valuable to people and the approach is fine
then I'll attempt to add all the necessary tests and cleanups according
to the contributing guidelines.

Expected uses

The expecation is to use it in situations roughly like this:

function max_upload_size() {
	$u_bytes = ini_bytes( ini_get( 'upload_max_filesize' ) || 0 );
	$p_bytes = ini_bytes( ini_get( 'post_max_size' ) || 0 );

	return min( $u_bytes, $p_bytes );
}

This patch also makes it easier to perform mathematic comparisons and
operations on the config values, which again previously required duplicating
the parsing outside of PHP and likely will fail for a variety of config values.

function validate_upload_size() {
	$u_bytes = ini_bytes( ini_get( 'upload_max_filesize' ) || 0 );
	$p_bytes = ini_bytes( ini_get( 'post_max_size' ) || 0 );

	return $u_bytes <= $p_bytes
		? [ true ]
		: [ false, 'upload_max_filesize must be smaller than post_max_size' ];
}

A strong benefit to this approach is that it does avoid having to answer
questions about the scope of the change, about handling details of the
internal representation of these config values, about predicting how a
different shorthand value will be interpreted, etc... In essence we leave
it to the PHP developer to choose to parse the values as they wish.

Something not in this patch which might be worth adding if the approach
and idea are valuable to the maintainers is to update the PHP docs to
link to this function where it mentions the shorthand but doesn't explain
how to interpret it or figure out what it actually is.

@cmb69
Copy link
Member

cmb69 commented Apr 29, 2022

Thank you for the PR! To fix CI, please re-generate basic_functions_arginfo.h again, so that it gets a new stub hash (see beginning of the file).

It seems to me this is the most pragmatic approach for #7944 and related issues. While I usually would suggest to write this function in PHP, given that zend_atol()'s behavior might change over time (I certainly hope so), exposing that function to userland is fine for me.

This should still be discussed on the internals mailing list. Are you willing to write a message to that list?

@dmsnell
Copy link
Contributor Author

dmsnell commented Apr 29, 2022

Much obliged for the review @cmb69. I'll definitely get this updated and regenerate the stub hash (sorry I missed that when editing basic_functions_arginfo.h).

Will try and write a message on the mailing list today or early next week. 🙇‍♂️

@cmb69
Copy link
Member

cmb69 commented Apr 29, 2022

sorry I missed that when editing basic_functions_arginfo.h

There is no need to edit this file manually. You can just do:

php build/gen_stub.php ext/standard/basic_functions.stub.php

@javiereguiluz
Copy link
Contributor

javiereguiluz commented May 1, 2022

@dmsnell thanks for this proposal. I like this function and I wish I had it in the past when I need it for some projects.

My minor comment is about the proposed function name: ini_bytes(). I'm not sure it fully conveys its purpose. Here are some possible alternatives for your consideration:

config_size_to_bytes()
config_unit_to_bytes()
config_value_to_bytes()
ini_size_bytes()
ini_size_to_bytes()
ini_unit_to_bytes()
ini_value_to_bytes()
php_size_to_bytes()
php_unit_to_bytes()
php_value_to_bytes()

@dmsnell
Copy link
Contributor Author

dmsnell commented May 2, 2022

@javiereguiluz I'm not the biggest fan of it either. I considered more verbose names as you suggest, though I found the brevity of ini_bytes() to hold a nice symmetry with its related functions: ini_get, ini_set, ini_restore, etc...

another one I thought of decoupled it from the "bytes" language because it appears like it's really not related to bytes: ini_size, ini_numeric_value, ini_number_value, ini_as_int, or ini_int if we want to make people dizzy when reading it 😉

decoupled because multiple non-byte values use this same parsing, anything running OnUpdateLong or OnUpdateLongGEZero:

  • the recently-added hard_timeout (but not the longstanding max_execution_time).
  • realpath_cache_ttl
  • user_ini.cache_ttl
  • max_input_nesting_level (must be > 0)
  • max_input_vars (must be > 0)

I'm getting ready to post on the mailing list; I'll ask for feedback on the name because I think it's not entirely clear to me what we should be calling this since the existing naming around these values (byte sizes) doesn't fully overlap how they are used.

Copy link
Contributor

@brzuchal brzuchal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the internal function name has nothing to do with INI parsing why call it ini_get? The purpose ofc is to parse string representation of values written with suffixes, but IMO this could also be called atol(string $value): int or something different w/o ini_ in name. Any thoughts?

@@ -493,6 +493,8 @@ function ini_alter(string $option, string|int|float|bool|null $value): string|fa

function ini_restore(string $option): void {}

function ini_bytes(string $option): int {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO name of the parameter should be something like $value instead of $option as according to your examples in the PR description what you pass here is a string representation of a value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good catch. That was an artifact of starting by trying to return the internal value for a php.ini directive before I gave up and realized that was too broad of a scope.

Renamed to $shorthand in 7f626f1

Z_PARAM_STRING(shorthand, shorthand_len)
ZEND_PARSE_PARAMETERS_END();

RETVAL_LONG(zend_atol(shorthand, shorthand_len));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it fail? What happens if it fails? I see no test for valid and non-valid values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that the function can fail and there are no truly invalid values. That is, for values that aren't valid from a use standpoint they will return 0.

For certain parameters like post_max_size that means if we don't have a legitimate configuration value we won't apply any limit, whereas a legitimate value is any value that returns > 0 from this function, but negative values and outright ridiculous values are still allowed.

@Girgias
Copy link
Member

Girgias commented May 4, 2022

I would highly prefer #7951 over this and expose the new zend_ini parser function to userland in ext/std instead of having just testing it in zend_test then have this function which relies on zend_atol()

@dmsnell
Copy link
Contributor Author

dmsnell commented May 4, 2022

Since the internal function name has nothing to do with INI parsing why call it ini_get? The purpose ofc is to parse string representation of values written with suffixes, but IMO this could also be called atol(string $value): int or something different w/o ini_ in name. Any thoughts?

@brzuchal this does have everything to do with INI parsing and that's why I picked a related name. That is, I wouldn't want to expose some atol to encourage people to dive into PHP's internals from PHP itself, but without a way of knowing what the php.ini directives' value internally are, we need some mechanism to communicate that outwardly.

That brings up what @Girgias mentioned in #7951

expose the new zend_ini parser function

Actually I'm all in favor of that too, though I would think there's still some reason to abstract it behind a function related to the existing ini_ functions; essentially the point would be that if we replaced internal calls from zend_atol() to zend_ini_parse_quantity() then the ini_bytes() function described here would return that as well instead of atol(). The point of this proposed function isn't to return atol() but to return "what numeric value does PHP internally interpret the given string value to be?"

On that note, I do like the ring of ini_parse_quantity()

@dmsnell
Copy link
Contributor Author

dmsnell commented May 4, 2022

Renamed the function to ini_parse_quantity() in d9aa196 following the work in #7951.

@dmsnell dmsnell changed the title Add ini_bytes to translate ini directive byte shorthand to int value Add ini_parse_quantity to translate ini directive "byte" shorthand to int value May 4, 2022
@dmsnell dmsnell changed the title Add ini_parse_quantity to translate ini directive "byte" shorthand to int value Add ini_parse_quantity to convert ini directive "byte" shorthand to int May 4, 2022
@ramsey
Copy link
Member

ramsey commented May 22, 2022

Mailing list thread, for reference:

@ramsey ramsey added the Feature label May 22, 2022
@ramsey
Copy link
Member

ramsey commented May 22, 2022

There wasn't much feedback on the mailing list, and the feedback you have received has been positive, so I'm adding this to the 8.2. milestone.

Please write tests for the functionality.

@ramsey ramsey added this to the PHP 8.2 milestone May 22, 2022
@dmsnell
Copy link
Contributor Author

dmsnell commented May 26, 2022

Rebased from 9d29465 to a26290e to incorporate updates to arginfo file

@dmsnell
Copy link
Contributor Author

dmsnell commented May 26, 2022

@ramsey thanks for your help. I've added a test file, and read through the documentation on adding tests, but I'm not very confident I made the right choices, and I didn't see a way to test locally so I'm not sure - maybe I'm missing some major things here.

Would you mind giving me at least a high-level review of the test file and let me know generally if I should be doing this differently? (If the approach is sound I would like to add more test cases that do a better job covering the expected behaviors, but don't want to do all that up-front if the general approach is wrong).

1048576
1073741824
14
0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks good, but it'll fail. 😄

To run tests locally, use the run-tests.php script in the root of php-src. After you build PHP with your changes (using make), you can run the test like this:

php run-tests.php -p sapi/cli/php tests/basic/ini_parse_quantity_basic.phpt

Where -p sapi/cli/php is pointing to the new PHP executable that you just built that includes your changes.

There are many options you can pass to run-tests.php. You can check them out by entering php run-tests.php -h.

When you have a failing test, run-tests.php will create a bunch of additional files in the same directory alongside your test file, which you can use for debugging.

For example, when tests/basic/ini_parse_quantity_basic.php fails, you'll see the following new files created:

tests/basic/ini_parse_quantity_basic.diff
tests/basic/ini_parse_quantity_basic.expm
tests/basic/ini_parse_quantity_basic.log
tests/basic/ini_parse_quantity_basic.out
tests/basic/ini_parse_quantity_basic.php
tests/basic/ini_parse_quantity_basic.sh

Take a look at them to see what all they contain. *.out will tell you what the text actually printed out, while *.diff will show a diff of what was expected vs. what was actually printed.

For more information, check out the following:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all great, and I'm quite happily surprised at how easy the testing setup is. Nothing wrong with test that fail before they succeed either 😄

These run locally for me, but I couldn't build them when I rebased against master (I think because of changes introduced in 8.1? Could not build PHP with the following error on make

make: *** No rule to make target `/Users/dmsnell/code/php-src/main/php_stdint.h', needed by `ext/opcache/ZendAccelerator.lo'.  Stop.

I'll try rebasing again in a work day or two as I presume this will be fixed upstream quickly (it's there for me in master anyway).

Maybe I'll hold off on a more exhaustive test suite until the interplay with #7951 is clearer. The zend_test_ini_parse_quantity functions there I think could be renamed as ini_parse_quantity and replace this entire patch 🙃

@iluuu1994
Copy link
Member

Note that a new API is in the works, deprecating zend_atol: #7951. The new one can fail (due to warning to exception promotion).

@dmsnell
Copy link
Contributor Author

dmsnell commented Jun 9, 2022

Thank you @iluuu1994 - I've been following that patch and have referenced it a number of times in this PR. Based on what I've seen it's not incompatible with this function, or possibly someone will want to incorporate this public function into that patch and close my work (which would be a bummer, since I was hoping I'd get at least a mention for this contribution, but I understand).

The question I have is whether these are both slated to appear in the same release, assuming I can update all the tests. I've been absent for a couple weeks due to a WordPress conference that took my attention. If so, maybe we should merge the two PRs so that there's never a time when we have to update this code to account for those changes. If not, maybe this is still okay to push to master since it solves the problem today with an interface that shouldn't have to change whenever #7951 lands?

Concerning the change of behavior introduced in #7951 I think we're just going to have to deal with it and that's fine too. It's just a breaking change. But, I still don't see that being any different with or without this function; just a matter of whether we expose this API sooner or later. Am I missing something?

I'm more curious about parse_ini_quantity() and parse_ini_uquantity() and whether that does introduce more confusion that this PR was supposed to address. As far as I know there's only "shorthand notation" and not "signed shorthand notation" or "unsigned shorthand notation" which will I think will parse differently from each other for the same shorthand.

@ramsey I'm going to try and fix the tests here regardless, but maybe this work should be joined with #7951? Can you share your thoughts on that? Also, I'd still like to update the documentation to indicate what this is and how it's possible to read the parsed value; I don't see that in this repository - can you (or anyone) point me in the direction to find those?

@dmsnell
Copy link
Contributor Author

dmsnell commented Jun 21, 2022

@cmb69 for the life of me I haven't been able to build this in recent master pulls. I keep getting this error and I'm not sure why. Worked before without trouble. Do you happen to know off-hand if there's something I should be checking or setting? I'm through a few rounds of scouring the web for possible answers but still stuck.

Undefined symbols for architecture x86_64:
  "_php_safe_bcmp", referenced from:
      _zif_hash_equals in hash.o
      _php_password_bcrypt_verify in password.o
ld: symbol(s) not found for architecture x86_64

Would like to rebase this and update it with the changes from #7951 and I guess I can just rely on the tests and CI if I need to. Would be nice to run and test locally 😄

@iluuu1994
Copy link
Member

@dmsnell Have you tried make distclean and a complete rebuild from the top?

@cmb69
Copy link
Member

cmb69 commented Jun 21, 2022

And if make distclean doesn't work either, do a git clean -fdx and try again. One of the problems is that changes in header files are usually not caught (i.e. don't trigger recompilation), so sometimes you need to clean manually.

@dmsnell
Copy link
Contributor Author

dmsnell commented Jun 21, 2022

yes thank you @cmb69! I should have known as I've been burnt by hiding files before. git status --ignored showed header files, many header files…

@dmsnell
Copy link
Contributor Author

dmsnell commented Jun 21, 2022

y'all have been super helpful. I've updated against #7951 and added a new test just to make sure we throw warnings when given some invalid/inappropriate values.

for now I think what this needs is a final look over and any instruction you want to give me as for what you want to see in the tests.

there was that final thought that maybe we should just rename the function to ini_parse_shorthand( $shorthand ) since that's more directly what we're doing…

@Girgias
Copy link
Member

Girgias commented Jun 21, 2022

And if make distclean doesn't work either, do a git clean -fdx and try again. One of the problems is that changes in header files are usually not caught (i.e. don't trigger recompilation), so sometimes you need to clean manually.

Headers are tracked now and trigger recompilation, the issue is when build dependencies are changed (i.e. new files added or removed) then ./buildconf needs to be rerun to regenerate the configure file to actually compile the new files (or stop referencing to non existing ones)

@dmsnell
Copy link
Contributor Author

dmsnell commented Jun 22, 2022

Not sure why the Travis build is failing. Seems unrelated but I don't appear to be able to restart the job to check.

@iluuu1994
Copy link
Member

@dmsnell Travis frequently fails for no reason. You don't need to restart. I just triggered the GitHub tests, let's see if those pass.

@arnaud-lb
Copy link
Member

I'm more curious about parse_ini_quantity() and parse_ini_uquantity() and whether that does introduce more confusion that this PR was supposed to address. As far as I know there's only "shorthand notation" and not "signed shorthand notation" or "unsigned shorthand notation" which will I think will parse differently from each other for the same shorthand.

There is indeed a single notation, but different accepted ranges.

Most ini settings accept values in the range [-2^63, 2^63-1] (on 64bit systems), but memory_limit accepts values in the range [0, 2^64]. The functions zend_ini_parse_quantity() and zend_ini_parse_uquantity() use different range checks, but otherwise behave the same. Shorthand notations outside of the range trigger a warning and the return value is unspecified.

Note that this doesn't change the behavior that existed before #7951, appart from the new warnings.

Unfortunately, using only zend_ini_parse_quantity() in this PR will be unsuitable for parsing memory_limit values.

It's fine with me if others agree.

Some solutions which I suggested in #7951 would be to add comparison functions similar to version_compare() for ini quantities.

@dmsnell
Copy link
Contributor Author

dmsnell commented Jun 27, 2022

Thanks for the feedback @arnaud-lb

Some solutions which I suggested in #7951 (comment) would be to add comparison functions similar to version_compare() for ini quantities.

as noted over in that other PR this is a solution for a different problem, but related to knowing the limits. even with those comparison functions though I guess won't we still have to know if PHP is using signed or unsigned parsing? that is, if we provide a shorthand_compare function we'd still need to know if we're comparing memory_size or some other PHP ini directive…

for now I'm tempted to leave it hanging as a mystery given what you said, "doing the wrong thing would work most of the time." that is, maybe you can check me on this:

  • on a 32-bit system we're correct with _quantity up to 4 GB of memory, then wrong if higher. but 32-bit systems don't support more than 4 GB of memory right? or at least, processed compiled for 32-bit systems can't address more than that anyway? so the only cases where this fails (because we should have used _uquantity) is where the memory limit doesn't make sense for the platform and we have a warning thrown
  • on a 64-bit system we're correct with _quantity up to 9e18, which is more memory than any system should have. again, we might get meaningless numbers if parsed, but there's not a direct correspondence to the reality of the limit anyway since no computer today can have that much memory

mainly this catches negative values for memory_limit other than -1, which is treated special. if that's the case I guess it doesn't matter and we can still only expose _quantity without worrying too much.

why did we not use uquantity then as well for other values which can only have non-negative numbers as meaningful limits? for instance, post_max_size for example?


In summary, my take from #7951 is that we should continue to leave it up to the application developer to realize that memory_limit is special and they should either provide their own verification on that or know it could be wrong. It's unfortunate to have to expose that internal detail, but this is still significantly better than it was before, plus now we get warnings for invalid ini settings thanks to your PR.

This approach is the scope-limited alternative to passing in the ini directive's name and having PHP decide how to interpret it based on which setting is in use.

dmsnell added 8 commits June 27, 2022 19:21
With WordPress we have seen a variety of bugs that stem from the fact that we
don't reliably parse the memory size values from php.ini directives. We want
to know those values in order to communicate to site users what the file
upload limits are.

In this patch I'm adding a new function `ini_bytes( $shorthand )` to return
whatever PHP evaluates that value to be internally, thus exposing the actual
byte size to the outside world. This eliminates the need for WordPress and
other projects to port the config value parser since they could directly
call it. Additionally it will reliably reproduce any quirks due to the
platform build and overflow behaviors when handling the `g`, `m`, and `k`
suffixes on `post_max_size` and `upload_max_filesize` (and others).

This function is a shallow wrapper around `zend_atoi()` and could become
out of sync with the actual ini directive parsing. I considered a variation
of this function which would take a directive's name and return the stored
internal value directly, keeping it in sync.

```php
switch ( param_name ) {
	case 'post_max_size':
		RETVAL_LONG(SG(post_max_size));
}
```

After looking briefly at this option I dismissed it for being a bit too
coupled to those options, too demanding to add all the possible config
options to a big switch statement, and too complicated for returning all
of the possible return types stored in the SAPI globals.

This is my first attempt at patching PHP and I don't know what I'm doing :)
I'm not thrilled at the function name but I needed something.
I'm not aware of what I need to write for testing, documentation, etc...
but I'm happy to add whatever is required.

The expecation is to use it in situations roughly like this:

```php
function max_upload_size() {
	$u_bytes = ini_bytes( ini_get( 'upload_max_filesize' ) || 0 );
	$p_bytes = ini_bytes( ini_get( 'post_max_size' ) || 0 );

	return min( $u_bytes, $p_bytes );
}
```

This patch also makes it easier to perform mathematic comparisons and
operations on the config values, which again previously required duplicating
the parsing outside of PHP and likely will fail for a variety of config values.

```php
function validate_upload_size() {
	$u_bytes = ini_bytes( ini_get( 'upload_max_filesize' ) || 0 );
	$p_bytes = ini_bytes( ini_get( 'post_max_size' ) || 0 );

	return $u_bytes <= $p_bytes
		? [ true ]
		: [ false, 'upload_max_filesize must be smaller than post_max_size' ];
}
```
Of all the naming suggestions so far this one seems
to fit best with the purpose of the function.
@arnaud-lb
Copy link
Member

maybe you can check me on this:

  • on a 32-bit system we're correct with _quantity up to 4 GB of memory, then wrong if higher. but 32-bit systems don't support more than 4 GB of memory right? or at least, processed compiled for 32-bit systems can't address more than that anyway? so the only cases where this fails (because we should have used _uquantity) is where the memory limit doesn't make sense for the platform and we have a warning thrown

ini_parse_quantity() allows values up to 2GiB in 32bit systems, and ini_parse_uquantity() allows values up to 4GiB

So for example a memory_limit value of 3G, which is valid, will be considered invalid by ini_parse_quantity().

I think that it's ok to add only ini_parse_quantity() for now and specify on the documentation that memory_limit values can not be parsed with this function.

@dmsnell
Copy link
Contributor Author

dmsnell commented Jul 1, 2022

ini_parse_quantity() allows values up to 2GiB in 32bit systems, and ini_parse_uquantity() allows values up to 4GiB

right! I thought I had to have something wrong there. well, I don't like having to leak the internal implementation but I think this is the best we can get for the compromise.

what needs to happen for this PR before merge?

@arnaud-lb
Copy link
Member

The implementation looks good to me, and the API is good for all ini settings but memory_limit. We will need other functions to handle the full range of memory_limit values, but this is unlikely to be needed on 64bit systems.

@ramsey should we merge ?

@bukka
Copy link
Member

bukka commented Jul 9, 2022

@arnaud-lb I think you can just merge it. If there are no objections, there is no need to wait for RM to confirm pre feature freeze.

@arnaud-lb arnaud-lb merged commit 492af9f into php:master Jul 10, 2022
@arnaud-lb
Copy link
Member

Merged. Thank you @dmsnell !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants