Update RegexSyntax.md

hamishknight · hamishknight · commit 79036096c19b · 2022-02-16T13:48:39.000Z
diff --git a/Documentation/Evolution/RegexSyntax.md b/Documentation/Evolution/RegexSyntax.md
@@ -6,7 +6,7 @@
 
 We aim to parse a superset of the syntax accepted by a variety of popular regular expression engines.
 
-**TODO: Elaborate**
+**TODO(Michael): Elaborate**
 
 ## Engines supported
 
@@ -20,13 +20,13 @@ We aim to implement a syntactic superset of:
 
 We also intend to achieve at least Level 1 (**TODO: do we want to promise Level 2?**) [UTS#18][uts18] conformance, which specifies regular expression matching semantics without mandating any particular syntax. However we can infer syntactic feature sets from its guidance.
 
-## Regex syntax supported
+**TODO(Michael): Rework and expand prose**
 
-### General syntax
+## Detailed Design
 
-The following syntax are supported by all the above engines.
+We're proposing the following regular expression syntactic superset for Swift.
 
-#### Alternation
+### Alternation
 
 ```
 Regex       -> '' | Alternation
@@ -35,7 +35,7 @@ Alternation -> Concatenation ('|' Concatenation)*
 
 This is the operator with the lowest precedence in a regular expression, and checks if any of its branches match the input.
 
-#### Concatenation
+### Concatenation
 
 ```
 Concatenation   -> (!'|' !')' ConcatComponent)*
@@ -44,7 +44,7 @@ ConcatComponent -> Trivia | Quote | Quantification
 
 Implicitly denoted by adjacent expressions, a concatenation matches against a sequence of regular expression patterns. This has a higher precedence than an alternation, so e.g `abc|def` matches against `abc` or `def`. The `ConcatComponent` token varies across engine, but at least matches some form of trivia, e.g comments, quoted sequences e.g `\Q...\E`, and a quantified expression.
 
-#### Quantification
+### Quantification
 
 ```
 Quantification -> QuantOperand Quantifier?
@@ -54,7 +54,9 @@ QuantKind      -> '?' | '+'
 
 Specifies that the operand may be matched against a certain number of times.
 
-#### Groups
+**TODO: Briefly mention each and what it means, noting that options can swap eager/reluctant. Might be a good time to introduce the eager/reluctant/possessive terminology**
+
+### Groups
 
 ```
 GroupStart    -> '(?' GroupKind | '('
@@ -65,28 +67,50 @@ NamedGroup    -> 'P<' GroupNameBody '>'
                | '<' GroupNameBody '>'
                | "'" GroupNameBody "'"
 
-GroupNameBody -> Identifier
+GroupNameBody -> Identifier | BalancingGroupBody
 ```
 
 Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced, some may capture the nested match, some may match against the input without advancing, some may change the matching options set in the new scope, etc.
 
-#### Anchors
+**TODO: Something like "note that there are other things that may syntactically appear similarly to groups, but are their own constructs. See .... in-line options, backreferences, ... **
+
+#### Balancing groups
+
+```
+BalancingGroupBody -> Identifier? '-' Identifier
+```
+
+Introduced by .NET, balancing groups extend the `GroupNameBody` syntax to support the ability to refer to a prior group. Upon matching, the prior group is deleted, and any intermediate matched input becomes the capture of the current group.
+
+
+### Anchors
 
 ```
-Anchor -> '^' | '$' | '\b'
+Anchor -> '^' | '$' | '\b' | '\B' | '\A' | '\G' | '\z' | '\Z'
 ```
 
 Anchors match against a certain position in the input rather than on a particular character of the input.
 
-#### Unicode scalars
+**TODO: List these out with a very brief description of what they mean.**
+
+### Unicode scalars
 
 
+### Escape sequences
+
+```
+EscapeSequence -> '\a' | '\b' | '\c' <Char> | '\e' | '\f' | '\n' | '\r' | '\t'
+```
 
-#### Builtin character classes
+These escape sequences denote a specific character. Note that `\b` may only be used in a custom character class, otherwise it represents a word boundary.
 
+### Builtin character classes
 
+```
+BuiltinCharClass -> '\d' | '\D' | '\h' | '\H' | '\R' | '\s' | '\S' | '\v' | '\V' | '\w' | '\W' | '\X'
+```
 
-#### Custom character classes
+### Custom character classes
 
 ```
 CustomCharClass -> Start Set (SetOp Set)* ']'
@@ -96,46 +120,22 @@ Member          -> CustomCharClass | !']' !SetOp (Range | Atom)
 Range           -> Atom `-` Atom
 ```
 
-Custom characters classes introduce their own language, in which most regular expression metacharacters become literal
-
+Custom characters classes introduce their own language, in which most regular expression metacharacters become literal.
 
-#### Character properties
 
-### PCRE-specific syntax
+### Character properties
 
-#### Callouts
 
-### Oniguruma-specific syntax
+### Callouts
 
-#### Custom reference syntax
 
-#### Callout syntax
-
-#### Absent functions
-
-### ICU-specific syntax
-
-
-
-### .NET-specific syntax
-
-#### Balancing groups
-
-```
-GroupNameBody -> Identifier | Identifier? '-' Identifier
-```
-
-.NET supports the ability for a group to reference a prior group, causing the prior group to be deleted, and any intermediate matched input to become the capture of the current group.
-
-#### Character class subtraction with `-`
+### Absent functions
 
 
 
 ## Syntactic differences between engines
 
-### Conflicting differences
-
-#### Character class set operations
+### Character class set operations
 
 In a custom character class, some engines allow for binary set operations that take two character class inputs, and produce a new character class output. However which set operations are supported and the spellings used differ by engine.
 
@@ -147,9 +147,11 @@ In a custom character class, some engines allow for binary set operations that t
 
 These differences are conflicting, as engines that don't support a particular operator treat them as literal, e.g `[x&&y]` in PCRE is the character class of `["x", "&", "y"]` rather than an intersection.
 
+Another conflict arises with .NET's support of using the `-` character in a custom character class to denote both a range as well as a set subtraction. .NET disambiguates this by only permitting its use as a subtraction if the right hand operand is a nested custom character class, otherwise it is a range. This conflicts with e.g ICU where `[x-[y]]`, in which the `-` is treated as literal.
+
 We intend to support the operators `&&`, `--`, `-`, and `~~`. This means that any regex literal containing these sequences in a custom character class while being written for an engine not supporting that operation will have a different semantic meaning in our engine. However this ought not to be a common occurrence, as specifying a character multiple times in a custom character class is redundant. However, we intend on providing a strict compatibility mode that may be used to emulate behavior of a particular engine (**TODO: all engines, or just PCRE?**).
 
-#### Nested custom character classes
+### Nested custom character classes
 
 This allows e.g `[[a]b[c]]`, which is interpreted the same as `[abc]`.
 
@@ -163,35 +165,35 @@ PCRE does not support this feature, and as such treats `]` as the closing charac
 
 We aim to support nested custom character classes, with a strict PCRE mode for emulating the PCRE behavior if desired.
 
-#### `\U`
+### `\U`
 
 In PCRE, if `PCRE2_ALT_BSUX` or `PCRE2_EXTRA_ALT_BSUX` are specified, `\U` matches literal `U`. However in ICU, `\Uhhhhhhhh` matches a hex sequence.
 
-#### `{,n}`
+### `{,n}`
 
 This quantifier is supported by Oniguruma, but in PCRE it matches the literal chars. 
 
-#### \0DDD
+### \0DDD
 
 In ICU, `DDD` are interpreted as an octal code. In PCRE, only the first two digits are interpreted as octal, the last is literal.
 
-#### `\x`
+### `\x`
 
 In PCRE, a bare `\x` denotes the NUL character (`U+00`). In Oniguruma, it denotes literal `x`.
 
-#### Whitespace in ranges
+### Whitespace in ranges
 
 In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 to 4 times. However if whitespace is introduced in the range, it becomes invalid and is then treated as the literal characters. We find this behavior to be unintuitive, and therefore intend to parse any intermixed whitespace in the range, but will emit a warning telling users that we're doing so (**TODO: how would they silence? move to modern syntax?**).
 
-#### Implicitly-scoped matching option scopes
+### Implicitly-scoped matching option scopes
 
 PCRE and Oniguruma both support changing the active matching options through the `(?i)` expression. However, they have differing semantics when it comes to their scoping. In Oniguruma, it is treated as an implicit new scope that wraps everything until the end of the current group. In PCRE, it is treated as changing the matching option for all the following expressions until the end of the group.
 
 These sound similar, but have different semantics around alternations, e.g for `a(?i)b|c|d`, in Oniguruma this becomes `a(?i:b|c|d)`, where `a` is no longer part of the alternation. However in PCRE it becomes `a(?i:b)|(?i:c)|(?i:d)`, where `a` remains a child of the alternation.
 
 We aim to support the Oniguruma behavior by default, with a strict-PCRE mode that emulates the PCRE behavior. **TODO: The PCRE behavior is more complex for the parser, but seems less surprising, maybe that should become the default?**
 
-#### Backreference condition kinds
+### Backreference condition kinds
 
 PCRE and .NET allow for conditional patterns to reference a group by its name, e.g:
 
@@ -203,15 +205,13 @@ where `y` will only be matched if `(?<group1>x)` was matched. PCRE will always t
 
 We intend to always parse such conditions as an arbitrary regular expression condition, and will emit a warning asking users to explicitly use the syntax `(?('group1')y)` if they want a backreference condition. This more explicit syntax is supported by PCRE.
 
-### Non-conflicting differences
-
-#### `\N`
+### `\N`
 
 - PCRE supports `\N` meaning "not a newline"
 - PCRE also supports `\N{U+hhhh}`
 - ICU supports `\N{UNICODE CHAR NAME}` only
 
-#### Extended character property syntax
+### Extended character property syntax
 
 **TODO: Can this be conflicting?**