-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Syntax errors caused by unclosed {, [, ( mention specific location #5364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
|
||
if (closing) { /* 'closing' will be 0 if at end of file */ | ||
used += snprintf(buf + used, sizeof(buf) - used, " does not match '%c'", closing); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used +=
is not needed, but I still do it in case someone comes and adds more code below. Don't want any smashed stacks.
ext/tokenizer/tokenizer.c
Outdated
@@ -392,6 +392,8 @@ static zend_bool tokenize(zval *return_value, zend_string *source, zend_class_en | |||
array_init(return_value); | |||
|
|||
while ((token_type = lex_scan(&token, NULL))) { | |||
if (token_type == T_ERROR) break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Tokenizer really want to return T_ERROR
tokens to the caller? I doubt it. But if so, this can be fixed up in a different way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contract here is that the lexer should only return T_ERROR
if PARSER_MODE()
is enabled. (We should probably ZEND_ASSERT(token_type != T_ERROR)
here though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, thanks for that information!! Just looked at PARSER_MODE()
and it seems that it is true if the tokens are being used to build an AST. Hope that is right?
I thought that I had to return T_ERROR
because of zend_compile.c:1810
, which asserts that the returned token must be T_ERROR
if an exception is thrown.
Can I try to fix that?
@@ -408,7 +410,7 @@ static zend_bool tokenize(zval *return_value, zend_string *source, zend_class_en | |||
&& --need_tokens == 0 | |||
) { | |||
/* fetch the rest into a T_INLINE_HTML */ | |||
if (zendcursor != zendlimit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks if it receives a T_ERROR
token at end of file (which was not possible before). I'm not sure why, but in that case, zendcursor
is actually more than zendlimit
. Changing the test to zendcursor < zendlimit
makes it a bit more robust.
The added guard clause above actually takes care of this problem, but I figure changing the test to zendcursor < zendlimit
is still a good idea.
@@ -40,7 +41,7 @@ echo "Done" | |||
?> | |||
--EXPECTF-- | |||
*** Testing token_get_all() : with exception keywords *** | |||
array(81) { | |||
array(83) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test had bad syntax before in the code passed to token_get_all
(it was missing a curly bracket). Fixed that up.
See the detailed code-level comments in the "Files Changed" tab. |
I can't really give comments about the implementation as I'm not familiar with this area, but in the same vein there are unterminated/run-away strings which could also receive a similar treatment to this. |
I'd be happy to implement better syntax errors for those as well if the PHP developers would like a patch. |
4919bee
to
c94ac2c
Compare
@nikic, after your helpful comments I just fixed up the code so it only returns |
This is what failed on AppVeyor:
|
@alexdowad This was fixed via #5365. Please, rebase with |
c94ac2c
to
b2f9cdb
Compare
Thanks so much. Just rebased and force-pushed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
Zend/zend_language_scanner.l
Outdated
@@ -2569,6 +2663,7 @@ skip_escape_conversion: | |||
<ST_DOUBLE_QUOTES,ST_BACKQUOTE,ST_HEREDOC>"{$" { | |||
yy_push_state(ST_IN_SCRIPTING); | |||
yyless(1); | |||
enter_nesting(yytext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will refer to an opening {
rather than {$
. That's fine, but makes me wonder if we need the special-casing around ${
? Shouldn't we just report that as {
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need the special-casing for ${
. The error messages will be more helpful. The case where ${
will be reported is the one on line 1878 (different from this one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the error reporting inconsistent though. It means that ${
outside a string will report as an unclosed {
, while a ${
inside a string will report and unclosed ${
. This seems somewhat odd to me. The fact that we treat $
and {
as separate tokens outside strings and as a single token inside strings shouldn't leak through in this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe my intuition is off, but when I see "... ${<expression>} ..."
, intuitively I think of the ${
as matching the }
. In my mind, those are the delimiters. But when I see something like $foo->{$bar}
, I think of {
as matching }
and $bar
as being the "contents" of the braces. To my eyes, {$
doesn't appear to be a delimiter, but rather just {
.
(Please note that above you refer to "${
outside a string", which is not what we are dealing with here. It's {$
outside a string.)
Anyways, please just tell me what you want. Would you like PHP to report {$
as not matching a different (or missing) closing delimiter? If so, that's fine. I can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Please note that above you refer to "
${
outside a string", which is not what we are dealing with here. It's{$
outside a string.)
To be clear, I'm talking about something like ${$foo}
. This is valid both inside and outside strings, and I think those should receive the same treatment, either by referring to ${
in both cases, or {
in both cases. As writing $ {$foo}
with additional whitespace is also legal (outside strings), I'd go with just reporting the {
in both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. That makes sense now.
One hot, fresh little force-push coming right up!!!
b2f9cdb
to
20e45c4
Compare
Incorporated helpful suggestions from @nikic and force-pushed. |
20e45c4
to
0272b68
Compare
Aside from a few very specific syntax errors for which detailed exceptions are thrown, generally PHP just emits the default error messages generated by bison on syntax error. These messages are very uninformative; they just say "Unexpected ... at line ...". This is most problematic with constructs which can span an arbitrary number of lines, such as blocks of code delimited by { }, 'if' conditions delimited by ( ), and so on. If a closing delimiter is missed, the block will run for the entire remainder of the source file (which could be thousands of lines), and then at the end, a parse error will be thrown with the dreaded words: "Unexpected end of file". Therefore, track the positions of opening and closing delimiters and ensure that they match up correctly. If any mismatch or missing delimiter is detected, immediately throw a parse error which points the user to the offending line. This is best done in the *lexer* and not in the parser. Thanks to Nikita Popov and George Peter Banyard for suggesting improvements.
Zend/zend_language_scanner.l
Outdated
|
||
zend_nest_location *nest_loc = zend_stack_top(&SCNG(nest_location_stack)); | ||
char opening = nest_loc->text; | ||
char closing = *location; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leftover is currently making the build fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I actually saw that already and am just running make
and make test
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my haste to push out the last update!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, there is some funniness with PHP's Makefiles. I find that after editing certain source files, it is necessary to make clean
. Just make
doesn't rebuild everything which needs to be rebuilt, and then some tests fail with SIGTERM, SIGSEGV, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The makefiles don't track header dependencies, so this can happen when data structures in headers are changed.
It would be nice to implement this, but everyone is afraid to touch the build system :)
0272b68
to
e8d69c6
Compare
New contributor here. Pointers on how to improve will be much appreciated! Now starting with the commit message:
Aside from a few very specific syntax errors for which detailed exceptions are
thrown, generally PHP just emits the default error messages generated by bison on syntax
error. These messages are very uninformative; they just say "Unexpected ... at line ...".
This is most problematic with constructs which can span an arbitrary number of lines, such
as blocks of code delimited by { }, 'if' conditions delimited by ( ), and so on. If a closing
delimiter is missed, the block will run for the entire remainder of the source file (which
could be thousands of lines), and then at the end, a parse error will be thrown with the
dreaded words: "Unexpected end of file".
Therefore, track the positions of opening and closing delimiters and ensure that they match
up correctly. If any mismatch or missing delimiter is detected, immediately throw a parse
error which points the user to the offending line. This is best done in the lexer and not
in the parser.
This was done after seeing Bug 79368: https://bugs.php.net/bug.php?id=79368.
I tried to see if there is a separate repo for the Tokenizer extension, but couldn't find one. Hopefully this is the right place for changes to Tokenizer? There are just a couple of lines adjusted there so that it doesn't break.
Performance impact: Time taken to run a source file with 1,157,900 curly brackets (and nothing else) goes up from 320ms to 380ms on my Linux Mint machine (Intel Core 2 Duo, 1500MHz). It looks like slighly less than 1 millisecond of overhead is added for every 10,000 pairs of brackets. Note this overhead occurs only when the source file is being parsed; once it is compiled and running, there is no overhead.
If the PHP dev team feels that is too much, I can try to optimize it down, but I feel that in a more realistic situation, the effect will be completely drowned out by the time taken to run the code. Even just looking at the time used for parsing, brackets do not usually take up the majority of a source file.