Skip to content

bpo-43892: Make match patterns explicit in AST #25585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ncoghlan
Copy link
Contributor

@ncoghlan ncoghlan commented Apr 25, 2021

  • Separates "pattern" and "expr" nodes in AST
  • Adds a new "Pattern matching" section in the ast docs,
    separate from the "Control flow" section
  • AST node names are inspired by the AST proposed
    in PEP 642, but adjusted to better match the semantics
    of PEP 634 and to be more consistent with the existing AST
  • Grammar definition has been updated to emit the
    newly defined AST nodes
  • Parser now emits SyntaxError for ill-formed complex literals
    (so the AST optimiser no longer has to do it)
  • Code generator consumes the new nodes
  • AST validator consumes the new nodes
  • AST unparser consumes new nodes
  • test_unparse always checks the AST roundtrip for test_patma.py
  • Add some clarifying comments to the unparsing code

https://bugs.python.org/issue43892

@ncoghlan
Copy link
Contributor Author

ncoghlan commented Apr 26, 2021

  • Merged in master, adjusting for the recursion tracking in the AST validator and optimiser
  • Updated the AST documentation (the MatchAs and MatchOr doctests failed, highlighting that the new nodes also needed documentation)

@brandtbucher
Copy link
Member

@pablogsal, do you want to see buildbots on this before merging?

@pablogsal
Copy link
Member

@pablogsal, do you want to see buildbots on this before merging?

Either that or run manually a bunch of related tests yourself with -R : and paste the output here :)

@brandtbucher
Copy link
Member

This good?

$ ./python -m test -R : test_ast test_symtable test_peg_generator test_patma test_grammar test_unparse test_compile test_syntax
0:00:00 load avg: 2.04 Run tests sequentially
0:00:00 load avg: 2.04 [1/8] test_ast
beginning 9 repetitions
123456789
.........
0:00:21 load avg: 1.82 [2/8] test_symtable
beginning 9 repetitions
123456789
.........
0:00:21 load avg: 1.82 [3/8] test_peg_generator
beginning 9 repetitions
123456789
/home/bucher/src/cpython/Lib/test/test_peg_generator/test_c_parser.py:4: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
  from distutils.tests.support import TempdirManager
/home/bucher/src/cpython/Lib/test/support/__init__.py:1671: DeprecationWarning: The distutils.sysconfig module is deprecated, use sysconfig instead
  from distutils import ccompiler, sysconfig, spawn, errors
.........
0:08:10 load avg: 1.41 [4/8] test_patma -- test_peg_generator passed in 7 min 48 sec
beginning 9 repetitions
123456789
.........
0:08:10 load avg: 1.41 [5/8] test_grammar
beginning 9 repetitions
123456789
.........
0:08:10 load avg: 1.41 [6/8] test_unparse
beginning 9 repetitions
123456789
.........
0:08:36 load avg: 1.40 [7/8] test_compile
beginning 9 repetitions
123456789
.........
0:09:14 load avg: 1.27 [8/8] test_syntax -- test_compile passed in 38.1 sec
beginning 9 repetitions
123456789
.........

== Tests result: SUCCESS ==

All 8 tests OK.

Total duration: 9 min 15 sec
Tests result: SUCCESS

@pablogsal
Copy link
Member

enerator test_patma test_grammar test_unparse test_compile test_syntax
0:00:00 load avg: 2.04 Run tests sequentially
0:00:00 load avg: 2.04 [1/8] test_ast

Yup yup, go ahead! 🚀

@@ -1366,7 +1366,7 @@ Py_CompileStringObject(const char *str, PyObject *filename, int start,
return NULL;

mod = _PyParser_ASTFromString(str, filename, start, flags, arena);
if (mod == NULL) {
if (mod == NULL || !_PyAST_Validate(mod)) {
Copy link
Member

@pablogsal pablogsal Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why is this change here? THis is going to slow down quite a lot Py_CompileStringObject code path. Notice that the parser already calls this on input only on debug mode precisely because of performance concerns.

The design principle is that the ast that comes from the parser must be always correct and I think we should certainly preserve that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove it for now, then. I understood it as being motivated by the points in this comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regading this comment there is an important fact to highlight:

compiling from strings didn't attempt to validate them because it never attempted to validate anything

The actual reason is because by design the parser code must return a valid AST (there is a debug-mode validation step to ensure that this contract is true).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, looks like the new code might have been relying on that pass to give us a SyntaxError in test_patma_240. I'll try to figure out what's going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser can't verify that a complex literal is actually a complex literal because the tokeniser just emits NUMBER for both real and imaginary numbers. As a result, 0+0 makes it through the parsing rules for pattern matching.

The compiler can't reject the syntax because it never sees it - the value has turned into the legal constant 0 by the time the pattern gets to it.

The original pattern matching implementation worked around that limitation by putting a validation step in the AST constant folding pass: instead of using the standard constant folding for expressions, it reimplemented the subset required to fold the literals that pattern matching supported and emitted a syntax error for binary operations that weren't complex numbers.

This PR switches over to using the standard expression folding functions in the optimisation pass, so forcing validation was needed to keep "0+0" illegal in value patterns and mapping pattern keys.

I'll try replace it with a more targeted fix that adds a helper function to pegen that ensures the AST coming out of the parser is valid rather than relying on the AST validator or the AST constant folding to reject it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler can't reject the syntax because it never sees it - the value has turned into the legal constant 0 by the time the pattern gets to it.

We probably shouldn't fold the AST for patterns in that case so the compiler can handle it. Or I am missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a change that restores the invariant of the parser always producing a valid AST without having to weaken the checks on the AST nodes for value patterns and mapping patterns keys.

The trick is a new _PyAST_EnsureImaginary helper function that accepts and returns expr_ty.

It looks like I didn't get the reversion/merge with Brandt's changes quite right though, so I'll post an example after cleaning that up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had reverted the validator and compiler changes, but the AST optimiser changes needed to be reverted as well.
This is how compilation from a string behaves with the parser enhanced to ensure that complex literals actually are complex numbers:

>>> match x:
...     case 0+0:
SyntaxError: Imaginary number required in complex literal

So with that everything is handling the responsibilities it should be handling:

  • parser rejects "0+0" when compiling from a string rather than producing an invalid AST
  • validator rejects "0+0" when compiling from a prebuilt AST tree
  • optimiser and compiler have enough checks to protect themselves from segfaults, but are otherwise pretty tolerant

Copy link
Contributor Author

@ncoghlan ncoghlan Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the first iteration, I realised the imaginary number checking would work better as a pegen helper rather than as an AST helper:

>>> match x:
...     case 0+0:
  File "<stdin>", line 2
    case 0+0:
           ^
SyntaxError: Imaginary number required in complex literal

positional_patterns[asdl_pattern_seq*]:
| args[asdl_pattern_seq*]=','.pattern+ { args }
keyword_patterns[asdl_seq*]:
| keywords[asdl_seq*]=','.keyword_pattern+ { keywords }
Copy link
Member

@pablogsal pablogsal Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably change this to just ','.keyword_pattern+. Same for similar stuff around this like | items=','.key_value_pattern+ { items }. But we can clean this afterwards, for sure.

Copy link
Contributor Author

@ncoghlan ncoghlan Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this and PEP 642 were my first real ventures into working with the new parser. It's really nice, but it isn't always obvious when it will need help with typecasting and when it will be able to figure it on its own (e.g. I think this node used to be a more specific type than adsl_seq, so it would have needed the typecast)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this and PEP 642 were my first real ventures into working with the new parser. It's really nice, but it isn't always obvious when it will need help with typecasting and when it will be able to figure it on its own (e.g. I think this node used to be a more specific type than adsl_seq, so it would have needed the typecast)

I'm preparing a new documentation for working with the parser :)

How has been your experience so far? Any feedback (good and bad) is good feedback ;)

@brandtbucher
Copy link
Member

Okay, I disabled some overly-aggressive optimizations for value patterns and mapping patterns in the AST optimizer and moved validation of numeric literals back into the compiler. Reran the refleak tests locally and everything looks good.

@brandtbucher
Copy link
Member

@ncoghlan, it's almost time for me to go to sleep and for you to wake up. I'm happy with this now, so go ahead and merge if you agree!

@ncoghlan
Copy link
Contributor Author

@brandtbucher The parser potentially producing an illegal AST tree never sat well with me, and it finally occurred to me today how to fix it: filter the required-to-be-imaginary parse nodes through an AST helper function that raises a syntax error if they're not complex numbers. Between the tokenizer ensuring that the nodes are literals, and the helper function ensuring that they're complex number, we get a combined assurance that the value is an imaginary number.

So I reworked things to frontload the validation again, while keeping your other changes.

@ncoghlan ncoghlan requested a review from brandtbucher April 28, 2021 12:49
@ncoghlan
Copy link
Contributor Author

Updated refleak hunting results (still clean, although it appears my laptop is slightly slower than Brandt's machine):

$ ./python -m test -R : test_ast test_symtable test_peg_generator test_patma test_grammar test_unparse test_compile test_syntax
0:00:00 load avg: 0.68 Run tests sequentially
0:00:00 load avg: 0.68 [1/8] test_ast
beginning 9 repetitions
123456789
.........
0:00:20 load avg: 0.79 [2/8] test_symtable
beginning 9 repetitions
123456789
.........
0:00:20 load avg: 0.79 [3/8] test_peg_generator
beginning 9 repetitions
123456789
/home/ncoghlan/devel/cpython/Lib/test/test_peg_generator/test_c_parser.py:4: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
  from distutils.tests.support import TempdirManager
/home/ncoghlan/devel/cpython/Lib/test/support/__init__.py:1671: DeprecationWarning: The distutils.sysconfig module is deprecated, use sysconfig instead
  from distutils import ccompiler, sysconfig, spawn, errors
.........
0:10:38 load avg: 1.39 [4/8] test_patma -- test_peg_generator passed in 10 min 17 sec
beginning 9 repetitions
123456789
.........
0:10:38 load avg: 1.39 [5/8] test_grammar
beginning 9 repetitions
123456789
.........
0:10:39 load avg: 1.39 [6/8] test_unparse
beginning 9 repetitions
123456789
.........
0:10:58 load avg: 1.28 [7/8] test_compile
beginning 9 repetitions
123456789
.........
0:11:31 load avg: 1.15 [8/8] test_syntax -- test_compile passed in 33.4 sec
beginning 9 repetitions
123456789
.........

== Tests result: SUCCESS ==

All 8 tests OK.

Total duration: 11 min 32 sec
Tests result: SUCCESS

signed_number[expr_ty]:
| NUMBER
| '-' number=NUMBER { _PyAST_UnaryOp(USub, number, EXTRA) }

capture_pattern[expr_ty]:
imaginary_number[expr_ty]:
| imag=NUMBER { _PyPegen_ensure_imaginary(p, imag) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandtbucher @pablogsal This is the new helper that ensures the parser won't accept "0+0" and similar constructs as "complex literals".

@brandtbucher
Copy link
Member

brandtbucher commented Apr 28, 2021

Hm, this still allows some illegal value expressions:

>>> match 1j:
...     case 1j+0j:
...         print(":(")
... 
:(

(We should probably add a test for this.)

I personally still prefer checking these in the compiler (as I left the branch last night), since it keeps us from having to put the same logic for rejecting illegal value patterns in two places (the parser and the validator).

Note that there are already lots of illegal trees that make it as far as the compiler: yield/async/await/return outside of functions, break/continue outside of loops, illegal * expressions, etc. I see the job of the validator as being "make sure this tree can't crash the compiler", not "make sure this tree is valid Python".

I'm almost ready to drop complex literal value patterns entirely and wait until somebody asks for them. And then I can just tell them to use a dotted name instead. 😉

@pablogsal
Copy link
Member

Hm, this still allows some illegal value expressions:

>>> match 1j:
...     case 1j+0j:
...         print(":(")
... 
:(

(We should probably add a test for this.)

I personally still prefer checking these in the compiler (as I left the branch last night), since it keeps us from having to put the same logic for rejecting illegal value patterns in two places (the parser and the validator).

Note that there are already lots of illegal trees that make it as far as the compiler: yield/async/await/return outside of functions, break/continue outside of loops, illegal * expressions, etc. I see the job of the validator as being "make sure this tree can't crash the compiler", not "make sure this tree is valid Python".

The key is that the AST check is to check manually crafted AST objects. Otherwise the compiler and the parser are obviously coupled, so the compiler knows what needs to check and what not, which is how is normally done and keeps it efficient.

The slightly sad thing that I want to improve is more Syntax errors to the grammar so the grammar represents Python and not "almost Python".

I'm almost ready to drop complex literal value patterns entirely and wait until somebody asks for them. And then I can just tell them to use a dotted name instead. 😉

You have until this Sunday to decide 😉

@brandtbucher
Copy link
Member

Looping in @gvanrossum. Are complex literal value patterns worth the complexity right now? I think we just added them because we could.

If so, then we can throw out BinOp values entirely, and only accept:

  • strings
  • byte-strings
  • floats
  • negated floats
  • ints
  • negated ints
  • imaginary numbers
  • negated imaginary numbers

So, continue to allow (4.2j, and -4.2j, but not -4.2+4.2j).

@gvanrossum
Copy link
Member

I find it important that we can spell all values for these types. So I think we should keep it. (But I do not think that checking for all invalid cases needs to be implemented before beta 1.)

@brandtbucher
Copy link
Member

brandtbucher commented Apr 29, 2021

Okay, that makes sense.

Unfortunately, I'm not sure how to regenerate the parser on somebody else's PR. Since there's already a lot going on here, I am going to just land this as-is and open a smaller follow-up PR for the cleanup.

@brandtbucher brandtbucher merged commit 1e7b858 into python:master Apr 29, 2021
@ncoghlan
Copy link
Contributor Author

That approach makes sense to me, too. Thanks for the reviews @brandtbucher, @pablogsal!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants