Syntax errors caused by unclosed {, [, ( mention specific location #5364

alexdowad · 2020-04-09T11:40:10Z

New contributor here. Pointers on how to improve will be much appreciated! Now starting with the commit message:

Aside from a few very specific syntax errors for which detailed exceptions are
thrown, generally PHP just emits the default error messages generated by bison on syntax
error. These messages are very uninformative; they just say "Unexpected ... at line ...".

This is most problematic with constructs which can span an arbitrary number of lines, such
as blocks of code delimited by { }, 'if' conditions delimited by ( ), and so on. If a closing
delimiter is missed, the block will run for the entire remainder of the source file (which
could be thousands of lines), and then at the end, a parse error will be thrown with the
dreaded words: "Unexpected end of file".

Therefore, track the positions of opening and closing delimiters and ensure that they match
up correctly. If any mismatch or missing delimiter is detected, immediately throw a parse
error which points the user to the offending line. This is best done in the lexer and not
in the parser.

This was done after seeing Bug 79368: https://bugs.php.net/bug.php?id=79368.

I tried to see if there is a separate repo for the Tokenizer extension, but couldn't find one. Hopefully this is the right place for changes to Tokenizer? There are just a couple of lines adjusted there so that it doesn't break.

Performance impact: Time taken to run a source file with 1,157,900 curly brackets (and nothing else) goes up from 320ms to 380ms on my Linux Mint machine (Intel Core 2 Duo, 1500MHz). It looks like slighly less than 1 millisecond of overhead is added for every 10,000 pairs of brackets. Note this overhead occurs only when the source file is being parsed; once it is compiled and running, there is no overhead.

If the PHP dev team feels that is too much, I can try to optimize it down, but I feel that in a more realistic situation, the effect will be completely drowned out by the time taken to run the code. Even just looking at the time used for parsing, brackets do not usually take up the majority of a source file.

Zend/zend_language_scanner.l

alexdowad · 2020-04-09T11:45:44Z

Zend/zend_language_scanner.l

+	}
+
+	if (closing) { 	/* 'closing' will be 0 if at end of file */
+		used += snprintf(buf + used, sizeof(buf) - used, " does not match '%c'", closing);


used += is not needed, but I still do it in case someone comes and adds more code below. Don't want any smashed stacks.

Zend/zend_language_scanner.l

alexdowad · 2020-04-09T11:53:10Z

ext/tokenizer/tokenizer.c

@@ -392,6 +392,8 @@ static zend_bool tokenize(zval *return_value, zend_string *source, zend_class_en
 	array_init(return_value);

 	while ((token_type = lex_scan(&token, NULL))) {
+		if (token_type == T_ERROR) break;


Does Tokenizer really want to return T_ERROR tokens to the caller? I doubt it. But if so, this can be fixed up in a different way.

The contract here is that the lexer should only return T_ERROR if PARSER_MODE() is enabled. (We should probably ZEND_ASSERT(token_type != T_ERROR) here though.)

Wow, thanks for that information!! Just looked at PARSER_MODE() and it seems that it is true if the tokens are being used to build an AST. Hope that is right?

I thought that I had to return T_ERROR because of zend_compile.c:1810, which asserts that the returned token must be T_ERROR if an exception is thrown.

Can I try to fix that?

alexdowad · 2020-04-09T11:55:06Z

ext/tokenizer/tokenizer.c

@@ -408,7 +410,7 @@ static zend_bool tokenize(zval *return_value, zend_string *source, zend_class_en
 				&& --need_tokens == 0
 			) {
 				/* fetch the rest into a T_INLINE_HTML */
-				if (zendcursor != zendlimit) {


This breaks if it receives a T_ERROR token at end of file (which was not possible before). I'm not sure why, but in that case, zendcursor is actually more than zendlimit. Changing the test to zendcursor < zendlimit makes it a bit more robust.

The added guard clause above actually takes care of this problem, but I figure changing the test to zendcursor < zendlimit is still a good idea.

alexdowad · 2020-04-09T11:55:56Z

ext/tokenizer/tests/token_get_all_variation17.phpt

@@ -40,7 +41,7 @@ echo "Done"
 ?>
 --EXPECTF--
 *** Testing token_get_all() : with exception keywords ***
-array(81) {
+array(83) {


This test had bad syntax before in the code passed to token_get_all (it was missing a curly bracket). Fixed that up.

alexdowad · 2020-04-09T11:56:50Z

See the detailed code-level comments in the "Files Changed" tab.

Zend/tests/bug60099.phpt

Girgias · 2020-04-09T12:42:06Z

I can't really give comments about the implementation as I'm not familiar with this area, but in the same vein there are unterminated/run-away strings which could also receive a similar treatment to this.

alexdowad · 2020-04-09T12:42:59Z

I can't really give comments about the implementation as I'm not familiar with this area, but in the same vein there are unterminated/run-away strings which could also receive a similar treatment to this.

I'd be happy to implement better syntax errors for those as well if the PHP developers would like a patch.

alexdowad · 2020-04-09T13:20:35Z

@nikic, after your helpful comments I just fixed up the code so it only returns T_ERROR from the lexer when in PARSER_MODE(). Amended and force-pushed.

alexdowad · 2020-04-09T15:00:33Z

This is what failed on AppVeyor:

========DIFF========
001+ proc_terminate: Undefined constant 'SIGTERM'
========DONE========
FAIL Check that all internal parameter defaults evaluate without error [C:\projects\php-src\ext\reflection\tests\internal_parameter_default_value\check_all.phpt]

carusogabriel · 2020-04-09T15:10:47Z

@alexdowad This was fixed via #5365. Please, rebase with master.

alexdowad · 2020-04-09T15:14:21Z

@alexdowad This was fixed via #5365. Please, rebase with master.

Thanks so much. Just rebased and force-pushed.

nikic

This looks good to me.

Zend/zend_language_scanner.l

Zend/zend_language_scanner.h

Zend/zend_language_scanner.l

nikic · 2020-04-12T08:39:49Z

Zend/zend_language_scanner.l

@@ -2569,6 +2663,7 @@ skip_escape_conversion:
 <ST_DOUBLE_QUOTES,ST_BACKQUOTE,ST_HEREDOC>"{$" {
 	yy_push_state(ST_IN_SCRIPTING);
 	yyless(1);
+	enter_nesting(yytext);


This will refer to an opening { rather than {$. That's fine, but makes me wonder if we need the special-casing around ${? Shouldn't we just report that as { as well?

I think we still need the special-casing for ${. The error messages will be more helpful. The case where ${ will be reported is the one on line 1878 (different from this one).

This makes the error reporting inconsistent though. It means that ${ outside a string will report as an unclosed {, while a ${ inside a string will report and unclosed ${. This seems somewhat odd to me. The fact that we treat $ and { as separate tokens outside strings and as a single token inside strings shouldn't leak through in this way.

Maybe my intuition is off, but when I see "... ${<expression>} ...", intuitively I think of the ${ as matching the }. In my mind, those are the delimiters. But when I see something like $foo->{$bar}, I think of { as matching } and $bar as being the "contents" of the braces. To my eyes, {$ doesn't appear to be a delimiter, but rather just {.

(Please note that above you refer to "${ outside a string", which is not what we are dealing with here. It's {$ outside a string.)

Anyways, please just tell me what you want. Would you like PHP to report {$ as not matching a different (or missing) closing delimiter? If so, that's fine. I can do that.

(Please note that above you refer to "${ outside a string", which is not what we are dealing with here. It's {$ outside a string.)

To be clear, I'm talking about something like ${$foo}. This is valid both inside and outside strings, and I think those should receive the same treatment, either by referring to ${ in both cases, or { in both cases. As writing $ {$foo} with additional whitespace is also legal (outside strings), I'd go with just reporting the { in both cases.

OK. That makes sense now.

One hot, fresh little force-push coming right up!!!

alexdowad · 2020-04-12T14:58:42Z

Incorporated helpful suggestions from @nikic and force-pushed.

Aside from a few very specific syntax errors for which detailed exceptions are thrown, generally PHP just emits the default error messages generated by bison on syntax error. These messages are very uninformative; they just say "Unexpected ... at line ...". This is most problematic with constructs which can span an arbitrary number of lines, such as blocks of code delimited by { }, 'if' conditions delimited by ( ), and so on. If a closing delimiter is missed, the block will run for the entire remainder of the source file (which could be thousands of lines), and then at the end, a parse error will be thrown with the dreaded words: "Unexpected end of file". Therefore, track the positions of opening and closing delimiters and ensure that they match up correctly. If any mismatch or missing delimiter is detected, immediately throw a parse error which points the user to the offending line. This is best done in the *lexer* and not in the parser. Thanks to Nikita Popov and George Peter Banyard for suggesting improvements.

nikic · 2020-04-13T20:09:44Z

Zend/zend_language_scanner.l

+
+	zend_nest_location *nest_loc = zend_stack_top(&SCNG(nest_location_stack));
+	char opening = nest_loc->text;
+	char closing = *location;


This leftover is currently making the build fail.

Thanks. I actually saw that already and am just running make and make test.

Sorry for my haste to push out the last update!

By the way, there is some funniness with PHP's Makefiles. I find that after editing certain source files, it is necessary to make clean. Just make doesn't rebuild everything which needs to be rebuilt, and then some tests fail with SIGTERM, SIGSEGV, etc.

The makefiles don't track header dependencies, so this can happen when data structures in headers are changed.

It would be nice to implement this, but everyone is afraid to touch the build system :)

alexdowad commented Apr 9, 2020

View reviewed changes

Zend/zend_language_scanner.l Show resolved Hide resolved

alexdowad commented Apr 9, 2020

View reviewed changes

Zend/zend_language_scanner.l Show resolved Hide resolved

alexdowad commented Apr 9, 2020

View reviewed changes

Girgias reviewed Apr 9, 2020

View reviewed changes

Zend/tests/bug60099.phpt Outdated Show resolved Hide resolved

alexdowad force-pushed the better-syntax-errors branch 2 times, most recently from 4919bee to c94ac2c Compare April 9, 2020 13:19

carusogabriel added the Feature label Apr 9, 2020

alexdowad force-pushed the better-syntax-errors branch from c94ac2c to b2f9cdb Compare April 9, 2020 15:14

nikic reviewed Apr 12, 2020

View reviewed changes

alexdowad force-pushed the better-syntax-errors branch from b2f9cdb to 20e45c4 Compare April 12, 2020 14:58

alexdowad force-pushed the better-syntax-errors branch from 20e45c4 to 0272b68 Compare April 13, 2020 19:57

nikic reviewed Apr 13, 2020

View reviewed changes

alexdowad force-pushed the better-syntax-errors branch from 0272b68 to e8d69c6 Compare April 13, 2020 20:34

nikic approved these changes Apr 14, 2020

View reviewed changes

php-pulls closed this in 80598f1 Apr 14, 2020

carusogabriel added the Bug label Apr 15, 2020

nikic mentioned this pull request Apr 20, 2020

RFC: modernize the generation of syntax errors #5416

Open

carusogabriel added this to the PHP 8.0 milestone Jul 27, 2020

Syntax errors caused by unclosed {, [, ( mention specific location #5364

Syntax errors caused by unclosed {, [, ( mention specific location #5364

Uh oh!

Conversation

alexdowad commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexdowad Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexdowad commented Apr 9, 2020

Uh oh!

Uh oh!

Girgias commented Apr 9, 2020

Uh oh!

alexdowad commented Apr 9, 2020

Uh oh!

alexdowad commented Apr 9, 2020

Uh oh!

alexdowad commented Apr 9, 2020

Uh oh!

carusogabriel commented Apr 9, 2020

Uh oh!

alexdowad commented Apr 9, 2020

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexdowad commented Apr 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexdowad commented Apr 9, 2020 •

edited

Loading

alexdowad Apr 9, 2020 •

edited

Loading