Skip to content

Regex Syntax Pitch #140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Mar 3, 2022
Merged

Regex Syntax Pitch #140

merged 30 commits into from
Mar 3, 2022

Conversation

hamishknight
Copy link
Contributor

No description provided.

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@milseman
Copy link
Member

This exhaustive treatment is so, so helpful. Thanks for continuing to push on it and I think it's coming along well.

@hamishknight hamishknight force-pushed the regex-syntax branch 2 times, most recently from 5d49e82 to f3d7301 Compare February 16, 2022 13:49
@hamishknight hamishknight force-pushed the regex-syntax branch 2 times, most recently from a592889 to 3ebb55e Compare February 18, 2022 13:53
Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intro, motivation, and proposed solution.

@milseman
Copy link
Member

(I want to share my progress quickly and won't be able to formally commit to this PR yet, so I apologize for using the comment stream for this)

Reworking the start of Detailed Design a bit, trying to crunch it down a little bit. With this we can delete the Trivia and Comment sections too.

Detailed Design

We propose the following syntax for use inside Swift regex literals.

TODO: Disclosure triangle explaining the grammar conventions?

Top-level regular expression

Regex     -> GlobalMatchingOptionSequence? RegexNode
RegexNode -> '' | Alternation
Alternation -> Concatenation ('|' Concatenation)*
Concatenation   -> (!'|' !')' ConcatComponent)*

A regex literal may be prefixed with a sequence of global matching options(TODO: intra-doc link). A literal's contents can be empty or a sequence of alternatives separated by |.

Alternatives are a series of expressions concatenated together. The concatentation ends with either a | denoting the end of the alternative or a ) denoting the end of a recursively parsed group.

Alternation has a lower precedence than concatenation or other operations, so e.g abc|def matches against abc or def..

Concatenated subexpressions

ConcatComponent -> Trivia | Quote | Quantification

Trivia  -> Comment | NonSemanticWhitespace
Comment -> '(?#' (!')')* ')' | EndOfLineComment

(extended syntax only) EndOfLineComment      -> '#' .*$
(extended syntax only) NonSemanticWhitespace -> \s+

Quote -> '\Q' (!'\E' .)* '\E'

Each component of a concatenation may be "trivia" (comments and non-semantic whitespace, if applicable), a quoted run of literal content, or a potentially-quantified subexpression.

In-line comments, similarly to C, are lexical and are not recursively nested like normal groups are. A closing ) cannot be escaped. Quotes are similarly lexical, non-nested, and the \ before a \E cannot be escaped.

For example, \Q^[xy]+$\E, is treated as the literal characters ^[xy]+$ rather than an anchored quantified character class. \Q\\E is a literal \.

TODO: Do we want to warn for a \E appearing without a preceding \Q? Do we want to warn for a ) without anything opening it?

Quantified subexpressions

(... rest of document)

@hamishknight
Copy link
Contributor Author

Do we want to warn for a ) without anything opening it?

That's currently an error.

Do we want to warn for a \E appearing without a preceding \Q?

Yeah I think it probably makes sense to have a general warning for escaped [A-Za-z] characters that are being treated as literal, and maybe \E could have a custom version of that which mentions \Q.

@milseman
Copy link
Member

I pushed a commit that trimmed prose a little bit and reorganized some of the detailed design.

@milseman
Copy link
Member

I finished the intro, motivation, and proposed solution sections. I added 3 alternatives considered.

@hamishknight how close are we to having this ready? I made a TODO: Hamish not in the Swift canonical syntax part, basically it's not formally proposed by useful to discuss a preferred spelling, so I don't think we need anything as heavily structured as the normal proposal parts.

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: Whoops, when doing line comments you need to click the green button to save it but not publish the review. But not so for review comments! Ignore me :-)

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, had this in-progress review that got jumbled up. Sorry about that.

@@ -427,7 +443,7 @@ We support all the matching options accepted by PCRE, ICU, and Oniguruma. In add
- `n`: Disables the capturing behavior of `(...)` groups. Named capture groups must be used instead.
- `s`: Changes `.` to match any character, including newlines.
- `U`: Changes quantifiers to be reluctant by default, with the `?` specifier changing to mean greedy.
- `x`, `xx`: Enables extended syntax mode, which allows non-semantic whitespace and end-of-line comments. See the *trivia* section for more info.
- `x`, `xx`: Enables extended syntax mode, which allows non-semantic whitespace and end-of-line comments. See [Extended Syntax Modes](#extended-syntax-modes) for more info.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know where, but if it's helpful, Perl's x came historically before xx. Unifying on x is more about avoiding an accident of history than diverging from the spirit of Perl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the extended syntax section to also include a mention of Perl, is that sufficient? Or do we want to go into more detail?

OnigurumaCalloutDirection -> 'X' | '<' | '>'
```

A callout is a feature that allows a user-supplied function to be called when matching reaches that point in the pattern. We supported parsing both the PCRE and Oniguruma callout syntax. The PCRE syntax accepts a string or numeric argument that is passed to the function. The Oniguruma syntax is more involved, and may accept a tag, argument list, or even an arbitrary program in the 'callout of contents' syntax.
A callout is a feature that allows a user-supplied function to be called when matching reaches that point in the pattern. We supported parsing both the PCRE and Oniguruma callout syntax. The PCRE syntax accepts a string or numeric argument that is passed to the function. The Oniguruma syntax is more involved, and may accept an identifier with an optional tag and argument list. It may also accept an arbitrary program in the 'callout of contents' syntax. This is an expanded version of Perl's interpolation syntax, and allows an arbitrary nesting of delimiters in addition to an optional tag and direction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth an additional paragraph perhaps:

While we propose parsing these for the purposes of issuing helpful diagnostics, we are deferring full support for interpolation for the future.

We may want to remove the "Oniguruma" from the title of some of the productions, as this may end up being the Swift interpolation syntax. We just need to figure out what types of things go inside there (analogous to string interpolations) and if that's protocol driven. Clearly future work, though it would bring literals more up to parity with result builders' ability to interweave library calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this look?

@hamishknight hamishknight marked this pull request as ready for review March 2, 2022 11:55
Copy link
Member

@natecook1000 natecook1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 💯

My only general comment is that the last paragraph in the proposal should come very early on, so that this syntactical/parsing proposal isn't misinterpreted as promising the feature set of Swift regular expressions.

Comment on lines +214 to +220
UnicodeScalar -> '\u{' HexDigit{1...} '}'
| '\u' HexDigit{4}
| '\x{' HexDigit{1...} '}'
| '\x' HexDigit{0...2}
| '\U' HexDigit{8}
| '\o{' OctalDigit{1...} '}'
| '\0' OctalDigit{0...3}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In earlier discussions, we had talked about only supporting the '\u{' HexDigit{1...} '}' style for unicode scalar literals, to match strings, and offering fixits for all the others. Are we planning to support all these varied styles instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we want to support at least parsing them, but we will likely want to warn and tell users to use \u{...} instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Support" should be maximally permissive (balanced with being clear and unambiguous) as that's the main value of going with existing syntax instead of a novel syntax. But, we can also negotiate what is warned on and what has fixits.

I think the "Swift canonical" scalar syntax is clearly superior and should be encouraged, but there's a choice of mechanism here. We certainly should accept the other forms and compile them to the same thing in the end, it's a question of whether and what kind of diagnostics we emit.

I am a little concerned with tons of warnings, especially since Swift doesn't support fine-grained warning control (err, it does internally because I added it way back when when I added global warning suppression and warnings-as-errors, but I don't think it's been exposed). A bad developer experience is warnings that are unavoidable or for whom addressing is worse than living with them.

For a function taking a run-time string containing regex syntax, we'd support all the ways to spell it rather than throw a cannot-compile error. A different API would be whether it's in Swift canonical form or to put it into Swift canonical form. If we warn too heavily, that could even encourage people to use raw string literals instead of regex literals, which would be awkward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the sentence to the intro: f79fb92

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I disagree about accepting the other forms. This list is a superset; AFAIK there's no language that supports the full laundry list. That means that whenever you're moving regexes from language to language there's some necessary massaging of things like this. If we can make that as easy as clicking a fix-it or pressing ⌃⌥⌘F, that's more support than you typically get.

hamishknight and others added 3 commits March 2, 2022 19:00
Moved the note about syntactic support not meaning run-time support to intro
Co-authored-by: Nate Cook <natecook@apple.com>
@hamishknight
Copy link
Contributor Author

@swift-ci please test Linux

@hamishknight hamishknight merged commit a4d89ec into swiftlang:main Mar 3, 2022
@hamishknight hamishknight deleted the regex-syntax branch March 3, 2022 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants