-
-
Notifications
You must be signed in to change notification settings - Fork 32k
bpo-43892: Make match patterns explicit in AST #25585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-43892: Make match patterns explicit in AST #25585
Conversation
* Separates "pattern" and "expr" nodes in AST * AST node names are derived from the AST proposed in PEP 642, except that MatchValue always matches by equality and a separate MatchConstant node is defined for matching by identity * Grammar definition has been updated to emit the newly defined AST nodes * TODO: update code generator to consume new nodes * TODO: update AST validator to consume new nodes * TODO: update AST unparser to consume new nodes
…pattern-matching-ast
…pattern-matching-ast
…pattern-matching-ast
|
@pablogsal, do you want to see buildbots on this before merging? |
Either that or run manually a bunch of related tests yourself with -R : and paste the output here :) |
This good?
|
Yup yup, go ahead! 🚀 |
Python/pythonrun.c
Outdated
@@ -1366,7 +1366,7 @@ Py_CompileStringObject(const char *str, PyObject *filename, int start, | |||
return NULL; | |||
|
|||
mod = _PyParser_ASTFromString(str, filename, start, flags, arena); | |||
if (mod == NULL) { | |||
if (mod == NULL || !_PyAST_Validate(mod)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, why is this change here? THis is going to slow down quite a lot Py_CompileStringObject
code path. Notice that the parser already calls this on input only on debug mode precisely because of performance concerns.
The design principle is that the ast that comes from the parser must be always correct and I think we should certainly preserve that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove it for now, then. I understood it as being motivated by the points in this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regading this comment there is an important fact to highlight:
compiling from strings didn't attempt to validate them because it never attempted to validate anything
The actual reason is because by design the parser code must return a valid AST (there is a debug-mode validation step to ensure that this contract is true).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, looks like the new code might have been relying on that pass to give us a SyntaxError in test_patma_240
. I'll try to figure out what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser can't verify that a complex literal is actually a complex literal because the tokeniser just emits NUMBER
for both real and imaginary numbers. As a result, 0+0
makes it through the parsing rules for pattern matching.
The compiler can't reject the syntax because it never sees it - the value has turned into the legal constant 0
by the time the pattern gets to it.
The original pattern matching implementation worked around that limitation by putting a validation step in the AST constant folding pass: instead of using the standard constant folding for expressions, it reimplemented the subset required to fold the literals that pattern matching supported and emitted a syntax error for binary operations that weren't complex numbers.
This PR switches over to using the standard expression folding functions in the optimisation pass, so forcing validation was needed to keep "0+0" illegal in value patterns and mapping pattern keys.
I'll try replace it with a more targeted fix that adds a helper function to pegen that ensures the AST coming out of the parser is valid rather than relying on the AST validator or the AST constant folding to reject it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler can't reject the syntax because it never sees it - the value has turned into the legal constant 0 by the time the pattern gets to it.
We probably shouldn't fold the AST for patterns in that case so the compiler can handle it. Or I am missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a change that restores the invariant of the parser always producing a valid AST without having to weaken the checks on the AST nodes for value patterns and mapping patterns keys.
The trick is a new _PyAST_EnsureImaginary
helper function that accepts and returns expr_ty
.
It looks like I didn't get the reversion/merge with Brandt's changes quite right though, so I'll post an example after cleaning that up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had reverted the validator and compiler changes, but the AST optimiser changes needed to be reverted as well.
This is how compilation from a string behaves with the parser enhanced to ensure that complex literals actually are complex numbers:
>>> match x:
... case 0+0:
SyntaxError: Imaginary number required in complex literal
So with that everything is handling the responsibilities it should be handling:
- parser rejects "0+0" when compiling from a string rather than producing an invalid AST
- validator rejects "0+0" when compiling from a prebuilt AST tree
- optimiser and compiler have enough checks to protect themselves from segfaults, but are otherwise pretty tolerant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the first iteration, I realised the imaginary number checking would work better as a pegen helper rather than as an AST helper:
>>> match x:
... case 0+0:
File "<stdin>", line 2
case 0+0:
^
SyntaxError: Imaginary number required in complex literal
positional_patterns[asdl_pattern_seq*]: | ||
| args[asdl_pattern_seq*]=','.pattern+ { args } | ||
keyword_patterns[asdl_seq*]: | ||
| keywords[asdl_seq*]=','.keyword_pattern+ { keywords } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably change this to just ','.keyword_pattern+
. Same for similar stuff around this like | items=','.key_value_pattern+ { items }
. But we can clean this afterwards, for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this and PEP 642 were my first real ventures into working with the new parser. It's really nice, but it isn't always obvious when it will need help with typecasting and when it will be able to figure it on its own (e.g. I think this node used to be a more specific type than adsl_seq, so it would have needed the typecast)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this and PEP 642 were my first real ventures into working with the new parser. It's really nice, but it isn't always obvious when it will need help with typecasting and when it will be able to figure it on its own (e.g. I think this node used to be a more specific type than adsl_seq, so it would have needed the typecast)
I'm preparing a new documentation for working with the parser :)
How has been your experience so far? Any feedback (good and bad) is good feedback ;)
Okay, I disabled some overly-aggressive optimizations for value patterns and mapping patterns in the AST optimizer and moved validation of numeric literals back into the compiler. Reran the refleak tests locally and everything looks good. |
@ncoghlan, it's almost time for me to go to sleep and for you to wake up. I'm happy with this now, so go ahead and merge if you agree! |
@brandtbucher The parser potentially producing an illegal AST tree never sat well with me, and it finally occurred to me today how to fix it: filter the required-to-be-imaginary parse nodes through an AST helper function that raises a syntax error if they're not complex numbers. Between the tokenizer ensuring that the nodes are literals, and the helper function ensuring that they're complex number, we get a combined assurance that the value is an imaginary number. So I reworked things to frontload the validation again, while keeping your other changes. |
Updated refleak hunting results (still clean, although it appears my laptop is slightly slower than Brandt's machine):
|
signed_number[expr_ty]: | ||
| NUMBER | ||
| '-' number=NUMBER { _PyAST_UnaryOp(USub, number, EXTRA) } | ||
|
||
capture_pattern[expr_ty]: | ||
imaginary_number[expr_ty]: | ||
| imag=NUMBER { _PyPegen_ensure_imaginary(p, imag) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brandtbucher @pablogsal This is the new helper that ensures the parser won't accept "0+0" and similar constructs as "complex literals".
Hm, this still allows some illegal value expressions: >>> match 1j:
... case 1j+0j:
... print(":(")
...
:( (We should probably add a test for this.) I personally still prefer checking these in the compiler (as I left the branch last night), since it keeps us from having to put the same logic for rejecting illegal value patterns in two places (the parser and the validator). Note that there are already lots of illegal trees that make it as far as the compiler: I'm almost ready to drop complex literal value patterns entirely and wait until somebody asks for them. And then I can just tell them to use a dotted name instead. 😉 |
The key is that the AST check is to check manually crafted AST objects. Otherwise the compiler and the parser are obviously coupled, so the compiler knows what needs to check and what not, which is how is normally done and keeps it efficient. The slightly sad thing that I want to improve is more Syntax errors to the grammar so the grammar represents Python and not "almost Python".
You have until this Sunday to decide 😉 |
Looping in @gvanrossum. Are complex literal value patterns worth the complexity right now? I think we just added them because we could. If so, then we can throw out
So, continue to allow ( |
I find it important that we can spell all values for these types. So I think we should keep it. (But I do not think that checking for all invalid cases needs to be implemented before beta 1.) |
Okay, that makes sense. Unfortunately, I'm not sure how to regenerate the parser on somebody else's PR. Since there's already a lot going on here, I am going to just land this as-is and open a smaller follow-up PR for the cleanup. |
That approach makes sense to me, too. Thanks for the reviews @brandtbucher, @pablogsal! |
separate from the "Control flow" section
in PEP 642, but adjusted to better match the semantics
of PEP 634 and to be more consistent with the existing AST
newly defined AST nodes
(so the AST optimiser no longer has to do it)
https://bugs.python.org/issue43892