Skip to content

Syntax errors caused by unclosed {, [, ( mention specific location #5364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

alexdowad
Copy link
Contributor

@alexdowad alexdowad commented Apr 9, 2020

New contributor here. Pointers on how to improve will be much appreciated! Now starting with the commit message:


Aside from a few very specific syntax errors for which detailed exceptions are
thrown, generally PHP just emits the default error messages generated by bison on syntax
error. These messages are very uninformative; they just say "Unexpected ... at line ...".

This is most problematic with constructs which can span an arbitrary number of lines, such
as blocks of code delimited by { }, 'if' conditions delimited by ( ), and so on. If a closing
delimiter is missed, the block will run for the entire remainder of the source file (which
could be thousands of lines), and then at the end, a parse error will be thrown with the
dreaded words: "Unexpected end of file".

Therefore, track the positions of opening and closing delimiters and ensure that they match
up correctly. If any mismatch or missing delimiter is detected, immediately throw a parse
error which points the user to the offending line. This is best done in the lexer and not
in the parser.


This was done after seeing Bug 79368: https://bugs.php.net/bug.php?id=79368.

I tried to see if there is a separate repo for the Tokenizer extension, but couldn't find one. Hopefully this is the right place for changes to Tokenizer? There are just a couple of lines adjusted there so that it doesn't break.

Performance impact: Time taken to run a source file with 1,157,900 curly brackets (and nothing else) goes up from 320ms to 380ms on my Linux Mint machine (Intel Core 2 Duo, 1500MHz). It looks like slighly less than 1 millisecond of overhead is added for every 10,000 pairs of brackets. Note this overhead occurs only when the source file is being parsed; once it is compiled and running, there is no overhead.

If the PHP dev team feels that is too much, I can try to optimize it down, but I feel that in a more realistic situation, the effect will be completely drowned out by the time taken to run the code. Even just looking at the time used for parsing, brackets do not usually take up the majority of a source file.

}

if (closing) { /* 'closing' will be 0 if at end of file */
used += snprintf(buf + used, sizeof(buf) - used, " does not match '%c'", closing);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used += is not needed, but I still do it in case someone comes and adds more code below. Don't want any smashed stacks.

@@ -392,6 +392,8 @@ static zend_bool tokenize(zval *return_value, zend_string *source, zend_class_en
array_init(return_value);

while ((token_type = lex_scan(&token, NULL))) {
if (token_type == T_ERROR) break;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Tokenizer really want to return T_ERROR tokens to the caller? I doubt it. But if so, this can be fixed up in a different way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract here is that the lexer should only return T_ERROR if PARSER_MODE() is enabled. (We should probably ZEND_ASSERT(token_type != T_ERROR) here though.)

Copy link
Contributor Author

@alexdowad alexdowad Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thanks for that information!! Just looked at PARSER_MODE() and it seems that it is true if the tokens are being used to build an AST. Hope that is right?

I thought that I had to return T_ERROR because of zend_compile.c:1810, which asserts that the returned token must be T_ERROR if an exception is thrown.

Can I try to fix that?

@@ -408,7 +410,7 @@ static zend_bool tokenize(zval *return_value, zend_string *source, zend_class_en
&& --need_tokens == 0
) {
/* fetch the rest into a T_INLINE_HTML */
if (zendcursor != zendlimit) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks if it receives a T_ERROR token at end of file (which was not possible before). I'm not sure why, but in that case, zendcursor is actually more than zendlimit. Changing the test to zendcursor < zendlimit makes it a bit more robust.

The added guard clause above actually takes care of this problem, but I figure changing the test to zendcursor < zendlimit is still a good idea.

@@ -40,7 +41,7 @@ echo "Done"
?>
--EXPECTF--
*** Testing token_get_all() : with exception keywords ***
array(81) {
array(83) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test had bad syntax before in the code passed to token_get_all (it was missing a curly bracket). Fixed that up.

@alexdowad
Copy link
Contributor Author

See the detailed code-level comments in the "Files Changed" tab.

@Girgias
Copy link
Member

Girgias commented Apr 9, 2020

I can't really give comments about the implementation as I'm not familiar with this area, but in the same vein there are unterminated/run-away strings which could also receive a similar treatment to this.

@alexdowad
Copy link
Contributor Author

I can't really give comments about the implementation as I'm not familiar with this area, but in the same vein there are unterminated/run-away strings which could also receive a similar treatment to this.

I'd be happy to implement better syntax errors for those as well if the PHP developers would like a patch.

@alexdowad alexdowad force-pushed the better-syntax-errors branch 2 times, most recently from 4919bee to c94ac2c Compare April 9, 2020 13:19
@alexdowad
Copy link
Contributor Author

@nikic, after your helpful comments I just fixed up the code so it only returns T_ERROR from the lexer when in PARSER_MODE(). Amended and force-pushed.

@alexdowad
Copy link
Contributor Author

This is what failed on AppVeyor:

========DIFF========
001+ proc_terminate: Undefined constant 'SIGTERM'
========DONE========
FAIL Check that all internal parameter defaults evaluate without error [C:\projects\php-src\ext\reflection\tests\internal_parameter_default_value\check_all.phpt] 

@carusogabriel
Copy link
Contributor

@alexdowad This was fixed via #5365. Please, rebase with master.

@alexdowad alexdowad force-pushed the better-syntax-errors branch from c94ac2c to b2f9cdb Compare April 9, 2020 15:14
@alexdowad
Copy link
Contributor Author

@alexdowad This was fixed via #5365. Please, rebase with master.

Thanks so much. Just rebased and force-pushed.

Copy link
Member

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@@ -2569,6 +2663,7 @@ skip_escape_conversion:
<ST_DOUBLE_QUOTES,ST_BACKQUOTE,ST_HEREDOC>"{$" {
yy_push_state(ST_IN_SCRIPTING);
yyless(1);
enter_nesting(yytext);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will refer to an opening { rather than {$. That's fine, but makes me wonder if we need the special-casing around ${? Shouldn't we just report that as { as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need the special-casing for ${. The error messages will be more helpful. The case where ${ will be reported is the one on line 1878 (different from this one).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the error reporting inconsistent though. It means that ${ outside a string will report as an unclosed {, while a ${ inside a string will report and unclosed ${. This seems somewhat odd to me. The fact that we treat $ and { as separate tokens outside strings and as a single token inside strings shouldn't leak through in this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe my intuition is off, but when I see "... ${<expression>} ...", intuitively I think of the ${ as matching the }. In my mind, those are the delimiters. But when I see something like $foo->{$bar}, I think of { as matching } and $bar as being the "contents" of the braces. To my eyes, {$ doesn't appear to be a delimiter, but rather just {.

(Please note that above you refer to "${ outside a string", which is not what we are dealing with here. It's {$ outside a string.)

Anyways, please just tell me what you want. Would you like PHP to report {$ as not matching a different (or missing) closing delimiter? If so, that's fine. I can do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Please note that above you refer to "${ outside a string", which is not what we are dealing with here. It's {$ outside a string.)

To be clear, I'm talking about something like ${$foo}. This is valid both inside and outside strings, and I think those should receive the same treatment, either by referring to ${ in both cases, or { in both cases. As writing $ {$foo} with additional whitespace is also legal (outside strings), I'd go with just reporting the { in both cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. That makes sense now.

One hot, fresh little force-push coming right up!!!

@alexdowad alexdowad force-pushed the better-syntax-errors branch from b2f9cdb to 20e45c4 Compare April 12, 2020 14:58
@alexdowad
Copy link
Contributor Author

Incorporated helpful suggestions from @nikic and force-pushed.

@alexdowad alexdowad force-pushed the better-syntax-errors branch from 20e45c4 to 0272b68 Compare April 13, 2020 19:57
Aside from a few very specific syntax errors for which detailed exceptions are
thrown, generally PHP just emits the default error messages generated by bison on syntax
error. These messages are very uninformative; they just say "Unexpected ... at line ...".

This is most problematic with constructs which can span an arbitrary number of lines, such
as blocks of code delimited by { }, 'if' conditions delimited by ( ), and so on. If a closing
delimiter is missed, the block will run for the entire remainder of the source file (which
could be thousands of lines), and then at the end, a parse error will be thrown with the
dreaded words: "Unexpected end of file".

Therefore, track the positions of opening and closing delimiters and ensure that they match
up correctly. If any mismatch or missing delimiter is detected, immediately throw a parse
error which points the user to the offending line. This is best done in the *lexer* and not
in the parser.

Thanks to Nikita Popov and George Peter Banyard for suggesting improvements.

zend_nest_location *nest_loc = zend_stack_top(&SCNG(nest_location_stack));
char opening = nest_loc->text;
char closing = *location;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leftover is currently making the build fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I actually saw that already and am just running make and make test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my haste to push out the last update!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, there is some funniness with PHP's Makefiles. I find that after editing certain source files, it is necessary to make clean. Just make doesn't rebuild everything which needs to be rebuilt, and then some tests fail with SIGTERM, SIGSEGV, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The makefiles don't track header dependencies, so this can happen when data structures in headers are changed.

It would be nice to implement this, but everyone is afraid to touch the build system :)

@alexdowad alexdowad force-pushed the better-syntax-errors branch from 0272b68 to e8d69c6 Compare April 13, 2020 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants