Fix parsing of semi-reserved tokens at offset > 4GB #6616

TysonAndre · 2021-01-18T15:58:56Z

Possibly related to GH-5668
Also affects php 8.0 but this changes the
datastructures in public(?) header files.
(binary ABI change)
PHP 7.4 wouldn't even parse a class with the below example.

Fix parsing semi-reserved tokens such as namespace
with an offset that's larger than 2**32
Fix parsing some tokens when their length is larger than 2**32

Offset is relative to yy_start(the start of the file?)

Low priority because 4GB php files and the configured memory limits to support them are extremely rare - this was only noticed when looking into edge cases in the parser - this test case is artificial

Instead of parsing the token 'namespace', the character range from 4GB before is used due to truncation to 32 bits

php > ini_set('memory_limit', '12G');
php > eval(str_repeat(' ' , 2**32) . 'class MyClass{ public static function namespace() {}}');
php > MyClass::namespace();

Warning: Uncaught Error: Call to undefined method MyClass::namespace() in php shell code:1
Stack trace:
  thrown in php shell code on line 1
php > var_export((new ReflectionClass('MyClass'))->getMethods());
array (
  0 =>
  ReflectionMethod::__set_state(array(
     'name' => '         ',
     'class' => 'MyClass',
  )),
)

nikic · 2021-01-18T16:10:10Z

This used 32-bit integers do avoid doubling the size of the parser stack elements...

Related to phpGH-5668 Also affects php 8.0 but this changes the datastructures in public header files. (binary ABI change) PHP 7.4 wouldn't even parse a class with the below example. 1. Fix parsing semi-reserved tokens such as `namespace` with an offset that's larger than 2**32 2. Fix parsing some tokens when their length is larger than 2**32 Offset is relative to yy_start(the start of the file?) Low priority because 4GB php files are extremely rare ``` php > ini_set('memory_limit', '12G'); php > eval(str_repeat(' ' , 2**32) . 'class MyClass{ public static function namespace() {}}'); php > MyClass::namespace(); Warning: Uncaught Error: Call to undefined method MyClass::namespace() in php shell code:1 Stack trace: thrown in php shell code on line 1 php > var_export((new ReflectionClass('MyClass'))->getMethods()); array ( 0 => ReflectionMethod::__set_state(array( 'name' => ' ', 'class' => 'MyClass', )), ) ```

TysonAndre · 2021-01-18T16:51:48Z

This can probably use 64-bit integers just for the offset and continue using 32-bit integers for the length?

Is it possible to reuse the memory for zend_string * elem->str (and possibly num) instead from stack elements?
For example, make it a union of zend_string * and size_t offset, and make the token type determine whether it's used as an ident or a zend_string? (EDIT: missed that this was already a union: typedef union _zend_parser_stack_elem {)

emit_token_with_ident:
	if (PARSER_MODE()) {
		elem->ident.offset = SCNG(yy_text) - SCNG(yy_start);
		elem->ident.len = SCNG(yy_leng);
	}
	if (SCNG(on_event)) {
		SCNG(on_event)(ON_TOKEN, token, start_line, yytext, yyleng, SCNG(on_event_context));
	}
	return token;

nikic · 2021-01-19T09:39:31Z

This can probably use 64-bit integers just for the offset and continue using 32-bit integers for the length?

That makes no difference, due to padding.

A way to preserve the current size would be to store a char* to the start of the keyword, and then determine the length in zend_lex_tstring. After all, keywords are [a-zA-Z_], so it's not hard to determine the length from that.

TBH I have no idea how important it is to preserve the size, it's just why I did this at the time.

nikic · 2021-01-25T13:38:14Z

I committed a fix using the approach I mentioned in my previous comment in cc3e03c. Please tell me if you see any further issues with this.

TysonAndre force-pushed the parse-4GB-file branch from caf2e47 to de57850 Compare January 18, 2021 16:34

nikic closed this Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix parsing of semi-reserved tokens at offset > 4GB #6616

Fix parsing of semi-reserved tokens at offset > 4GB #6616

Uh oh!

TysonAndre commented Jan 18, 2021

Uh oh!

nikic commented Jan 18, 2021

Uh oh!

TysonAndre commented Jan 18, 2021 •

edited

Loading

Uh oh!

nikic commented Jan 19, 2021

Uh oh!

nikic commented Jan 25, 2021

Uh oh!

Uh oh!

Fix parsing of semi-reserved tokens at offset > 4GB #6616

Fix parsing of semi-reserved tokens at offset > 4GB #6616

Uh oh!

Conversation

TysonAndre commented Jan 18, 2021

Uh oh!

nikic commented Jan 18, 2021

Uh oh!

TysonAndre commented Jan 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic commented Jan 19, 2021

Uh oh!

nikic commented Jan 25, 2021

Uh oh!

Uh oh!

TysonAndre commented Jan 18, 2021 •

edited

Loading