-
Notifications
You must be signed in to change notification settings - Fork 50
Regex Syntax Pitch #140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex Syntax Pitch #140
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
This exhaustive treatment is so, so helpful. Thanks for continuing to push on it and I think it's coming along well. |
5d49e82
to
f3d7301
Compare
a592889
to
3ebb55e
Compare
3ebb55e
to
1beb1f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intro, motivation, and proposed solution.
Co-authored-by: Michael Ilseman <michael.ilseman@gmail.com>
(I want to share my progress quickly and won't be able to formally commit to this PR yet, so I apologize for using the comment stream for this) Reworking the start of Detailed Design a bit, trying to crunch it down a little bit. With this we can delete the Trivia and Comment sections too. Detailed DesignWe propose the following syntax for use inside Swift regex literals. TODO: Disclosure triangle explaining the grammar conventions? Top-level regular expression
A regex literal may be prefixed with a sequence of global matching options(TODO: intra-doc link). A literal's contents can be empty or a sequence of alternatives separated by Alternatives are a series of expressions concatenated together. The concatentation ends with either a Alternation has a lower precedence than concatenation or other operations, so e.g Concatenated subexpressions
Each component of a concatenation may be "trivia" (comments and non-semantic whitespace, if applicable), a quoted run of literal content, or a potentially-quantified subexpression. In-line comments, similarly to C, are lexical and are not recursively nested like normal groups are. A closing For example, TODO: Do we want to warn for a Quantified subexpressions(... rest of document) |
That's currently an error.
Yeah I think it probably makes sense to have a general warning for escaped |
I pushed a commit that trimmed prose a little bit and reorganized some of the detailed design. |
I finished the intro, motivation, and proposed solution sections. I added 3 alternatives considered. @hamishknight how close are we to having this ready? I made a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: Whoops, when doing line comments you need to click the green button to save it but not publish the review. But not so for review comments! Ignore me :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, had this in-progress review that got jumbled up. Sorry about that.
@@ -427,7 +443,7 @@ We support all the matching options accepted by PCRE, ICU, and Oniguruma. In add | |||
- `n`: Disables the capturing behavior of `(...)` groups. Named capture groups must be used instead. | |||
- `s`: Changes `.` to match any character, including newlines. | |||
- `U`: Changes quantifiers to be reluctant by default, with the `?` specifier changing to mean greedy. | |||
- `x`, `xx`: Enables extended syntax mode, which allows non-semantic whitespace and end-of-line comments. See the *trivia* section for more info. | |||
- `x`, `xx`: Enables extended syntax mode, which allows non-semantic whitespace and end-of-line comments. See [Extended Syntax Modes](#extended-syntax-modes) for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know where, but if it's helpful, Perl's x
came historically before xx
. Unifying on x
is more about avoiding an accident of history than diverging from the spirit of Perl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the extended syntax section to also include a mention of Perl, is that sufficient? Or do we want to go into more detail?
OnigurumaCalloutDirection -> 'X' | '<' | '>' | ||
``` | ||
|
||
A callout is a feature that allows a user-supplied function to be called when matching reaches that point in the pattern. We supported parsing both the PCRE and Oniguruma callout syntax. The PCRE syntax accepts a string or numeric argument that is passed to the function. The Oniguruma syntax is more involved, and may accept a tag, argument list, or even an arbitrary program in the 'callout of contents' syntax. | ||
A callout is a feature that allows a user-supplied function to be called when matching reaches that point in the pattern. We supported parsing both the PCRE and Oniguruma callout syntax. The PCRE syntax accepts a string or numeric argument that is passed to the function. The Oniguruma syntax is more involved, and may accept an identifier with an optional tag and argument list. It may also accept an arbitrary program in the 'callout of contents' syntax. This is an expanded version of Perl's interpolation syntax, and allows an arbitrary nesting of delimiters in addition to an optional tag and direction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth an additional paragraph perhaps:
While we propose parsing these for the purposes of issuing helpful diagnostics, we are deferring full support for interpolation for the future.
We may want to remove the "Oniguruma" from the title of some of the productions, as this may end up being the Swift interpolation syntax. We just need to figure out what types of things go inside there (analogous to string interpolations) and if that's protocol driven. Clearly future work, though it would bring literals more up to parity with result builders' ability to interweave library calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! 💯
My only general comment is that the last paragraph in the proposal should come very early on, so that this syntactical/parsing proposal isn't misinterpreted as promising the feature set of Swift regular expressions.
UnicodeScalar -> '\u{' HexDigit{1...} '}' | ||
| '\u' HexDigit{4} | ||
| '\x{' HexDigit{1...} '}' | ||
| '\x' HexDigit{0...2} | ||
| '\U' HexDigit{8} | ||
| '\o{' OctalDigit{1...} '}' | ||
| '\0' OctalDigit{0...3} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In earlier discussions, we had talked about only supporting the '\u{' HexDigit{1...} '}'
style for unicode scalar literals, to match strings, and offering fixits for all the others. Are we planning to support all these varied styles instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we want to support at least parsing them, but we will likely want to warn and tell users to use \u{...}
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Support" should be maximally permissive (balanced with being clear and unambiguous) as that's the main value of going with existing syntax instead of a novel syntax. But, we can also negotiate what is warned on and what has fixits.
I think the "Swift canonical" scalar syntax is clearly superior and should be encouraged, but there's a choice of mechanism here. We certainly should accept the other forms and compile them to the same thing in the end, it's a question of whether and what kind of diagnostics we emit.
I am a little concerned with tons of warnings, especially since Swift doesn't support fine-grained warning control (err, it does internally because I added it way back when when I added global warning suppression and warnings-as-errors, but I don't think it's been exposed). A bad developer experience is warnings that are unavoidable or for whom addressing is worse than living with them.
For a function taking a run-time string containing regex syntax, we'd support all the ways to spell it rather than throw a cannot-compile error. A different API would be whether it's in Swift canonical form or to put it into Swift canonical form. If we warn too heavily, that could even encourage people to use raw string literals instead of regex literals, which would be awkward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the sentence to the intro: f79fb92
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I disagree about accepting the other forms. This list is a superset; AFAIK there's no language that supports the full laundry list. That means that whenever you're moving regexes from language to language there's some necessary massaging of things like this. If we can make that as easy as clicking a fix-it or pressing ⌃⌥⌘F, that's more support than you typically get.
Moved the note about syntactic support not meaning run-time support to intro
Co-authored-by: Nate Cook <natecook@apple.com>
@swift-ci please test Linux |
No description provided.