Update RegexSyntax.md

hamishknight · hamishknight · commit 4a692caa8709 · 2022-02-16T13:48:39.000Z
diff --git a/Documentation/Evolution/RegexSyntax.md b/Documentation/Evolution/RegexSyntax.md
@@ -33,7 +33,7 @@ Regex     -> GlobalMatchingOptionSequence? RegexNode
 RegexNode -> '' | Alternation
 ```
 
-A top-level regular expression may consist of a sequence of global matching options followed by a `RegexNode`, which is the recursive part of the grammar that may be nested within e.g a group.
+A top-level regular expression may consist of a sequence of global matching options followed by a `RegexNode`, which is the recursive part of the grammar that may be nested within e.g a group. A regex node may be empty, which is the null pattern that always matches, but does not advance the input.
 
 ### Alternation
 
@@ -76,23 +76,36 @@ The quantifiers supported are:
 - `{,m}`: Up to `m` matches
 - `{n}`: Exactly `n` matches
 
-A quantifier may optionally followed by `?` or `+`, which apply certain semantics to the quantification. If neither are specified, by default the quantification happens eagerly, meaning that it will try to maximize the number of matches made. However, if `?` is specified, the number of matches will instead be minimized. If `+` is specified, eager matching occurs, but with the additional semantic that it may not be backtracked into to try a different number of matches.
+A quantifier may optionally be followed by `?` or `+`, which adjust its semantics. If neither are specified, by default the quantification happens *eagerly*, meaning that it will try to maximize the number of matches made. However, if `?` is specified, quantification happens *reluctantly*, meaning that the number of matches will instead be minimized. If `+` is specified, *possessive* matching occurs, which is eager matching with the additional semantic that it may not be backtracked into to try a different number of matches.
 
 ### Atom
 
 ```
-Atom -> Anchor | EscapeSequence | BuiltinCharClass
+Atom -> Anchor | EscapeSequence | BuiltinCharClass | Backreference | Subpattern
 ```
 
-Atoms are the smallest unit of regular expression syntax that cannot be split into smaller syntactic expressions.
+Atoms are the smallest unit of regular expression syntax that cannot be split into smaller syntactic expressions. They mainly include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`.
 
 ### Groups
 
 ```
 Group      -> GroupStart RegexNode ')'
-GroupStart -> '(?' GroupKind | '('
-GroupKind  -> ':' | '|' | '>' | '=' | '!' | '*' | '<=' | '<!' | '<*'
-            | NamedGroup | MatchingOptionSeq (':' | ')')
+GroupStart -> '(' GroupKind | '('
+GroupKind  -> '' | '?' BasicGroupKind | '*' PCRE2GroupKind ':'
+
+BasicGroupKind -> ':' | '|' | '>' | '=' | '!' | '*' | '<=' | '<!' | '<*'
+                | NamedGroup 
+                | MatchingOptionSeq (':' | ')')
+                
+PCRE2GroupKind -> 'atomic' 
+                | 'pla' | 'positive_lookahead'
+                | 'nla' | 'negative_lookahead'
+                | 'plb' | 'positive_lookbehind'
+                | 'nlb' | 'negative_lookbehind'
+                | 'napla' | 'non_atomic_positive_lookahead'
+                | 'naplb' | 'non_atomic_positive_lookbehind'
+                | 'sr' | 'script_run'
+                | 'asr' | 'atomic_script_run'
 
 NamedGroup -> 'P<' GroupNameBody '>'
             | '<' GroupNameBody '>'
@@ -105,12 +118,30 @@ Identifier -> [\w--\d] \w*
 
 Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced, some may capture the nested match, some may match against the input without advancing, some may change the matching options set in the new scope, etc.
 
-**TODO: Something like "note that there are other things that may syntactically appear similarly to groups, but are their own constructs. See .... in-line options, backreferences, ... **
+Groups may be named, the characters of which may be any letter or number characters (or `_`). However the name must not start with a number. This restriction follows the behavior of other regex engines and avoids ambiguities when it comes to named and numeric group references.
+
+Groups may be used to change the matching options present within their scope, see the *Matching options* section.
+
+Note there are additional constructs that may syntactically appear similar to groups, but are distinct. See the *group-like atoms* section.
+
 
 #### Lookahead and lookbehind
 
+- `(?=` specifies a lookahead that attempts to match against the group body, but does not advance.
+- `(?!` specifies a negative lookahead that ensures the group body does not match, and does not advance.
+- `(?<=` specifies a lookbehind that attempts to match the group body against the input before the current position. Does not advance the input.
+- `(?!<` specifies a negative lookbehind that ensures the group body does not match the input before the current position. Does not advance the input.
+
+PCRE2 defines explicitly spelled out versions of the syntax, e.g `(*negative_lookbehind:)`.
+
+#### Atomic groups
+
+**TODO: Add description**
+
 #### Script runs
 
+**TODO: Add description**
+
 #### Balancing groups
 
 ```
@@ -119,6 +150,50 @@ BalancingGroupBody -> Identifier? '-' Identifier
 
 Introduced by .NET, balancing groups extend the `GroupNameBody` syntax to support the ability to refer to a prior group. Upon matching, the prior group is deleted, and any intermediate matched input becomes the capture of the current group.
 
+### Matching options
+
+```
+MatchingOptionSeq -> '^' MatchingOption* 
+                   | MatchingOption+ 
+                   | MatchingOption* '-' MatchingOption*
+
+MatchingOption -> 'i' | 'J' | 'm' | 'n' | 's' | 'U' | 'x' | 'xx' | 'w' | 'D' | 'P' | 'S' | 'W' | 'y{' ('g' | 'w') '}'
+```
+
+A matching option sequence may be used as a group specifier, and denotes a change in matching options for the scope of that group. For example `(?x:a b c)` enables extended syntax for `a b c`. A matching option sequence may be part of an "isolated group" which has an implicit scope that wraps the remaining elements of the current group. For example, `(?x)a b c` also enables extended syntax for `a b c`.
+
+We support all the matching options accepted by PCRE, ICU, and Oniguruma. In addition, we accept some matching options unique to our matching engine.
+
+#### PCRE options
+
+- `i`: Case insensitive matching
+- `J`: Allows multiple groups to share the same name, which is otherwise forbidden
+- `m`: Enables `^` and `$` to match against the start and end of a line rather than only the start and end of the entire string
+- `n`: Disables capturing of `(...)` groups. Named capture groups must be used instead. 
+- `s`: Changes `.` to match any character, including newlines.
+- `U`: Changes quantifiers to be reluctant by default, with the `?` specifier changing to mean greedy.
+- `x`, `xx`: Enables extended syntax mode, which allows non-semantic whitespace and end-of-line comments. See the *trivia* section for more info.
+
+#### ICU options
+
+- `w`: Enables the Unicode interpretation of word boundaries `\b`. **TODO: Should this be the default?**
+
+#### Oniguruma options
+      
+- `D`: Enables ASCII-only digit matching for `\d`, `\p{Digit}`, `[:digit:]`
+- `S`: Enables ASCII-only space matching for `\s`, `\p{Space}`, `[:space:]`
+- `W`: Enables ASCII-only word matching for `\w`, `\p{Word}`, `[:word:]`, and `\b`
+- `P`: Enables ASCII-only for all POSIX properties (including `digit`, `space`, and `word`)
+- `y{g}`, `y{w}`: Changes the meaning of `\X`, `\y`, `\Y`. These are mutually exclusive options, with `y{g}` specifying extended grapheme cluster mode, and `y{w}` specifying word mode.
+
+#### Swift options
+
+These options are specific to the Swift regex matching engine and control the semantic level at which matching takes place.
+
+- `X`: Grapheme cluster matching
+- `u`: Unicode scalar matching
+- `b`: Byte matching
+
 
 ### Anchors
 
@@ -165,12 +240,29 @@ EscapeSequence -> '\a' | '\b' | '\c' <Char> | '\e' | '\f' | '\n' | '\r' | '\t'
 
 These escape sequences denote a specific character. Note that `\b` may only be used in a custom character class, otherwise it represents a word boundary.
 
+**TODO: List these out with a very brief description of what they mean.**
+
 ### Builtin character classes
 
 ```
 BuiltinCharClass -> '.' | '\d' | '\D' | '\h' | '\H' | '\O' | '\R' | '\s' | '\S' | '\v' | '\V' | '\w' | '\W' | '\X'
 ```
 
+- `.`: Any character excluding newlines
+- `\d`: Digit character
+- `\D`: Non-digit character
+- `\h`: Horizontal space character
+- `\H`: Non-horizontal-space character
+- `\O`: Any character (including newlines). This is syntax from Oniguruma.
+- `\R`: Newline sequence
+- `\s`: Whitespace character
+- `\S`: Non-whitespace character
+- `\v`: Vertical space character
+- `\V`: Non-vertical-space character
+- `\w`: Word character
+- `\W`: Non-word character
+- `\X`: Any extended grapheme cluster
+
 ### Custom character classes
 
 ```
@@ -179,25 +271,43 @@ Start           -> '[' '^'?
 Set             -> Member+
 Member          -> CustomCharClass | !']' !SetOp (Range | Atom)
 Range           -> Atom `-` Atom
+SetOp           -> '&&' | '--' | '~~'
 ```
 
-Custom characters classes introduce their own language, in which most regular expression metacharacters become literal.
+Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though not all atoms are valid, e.g a backreference cannot be made.
+
+Ranges of characters may be specified with the `-` character, e.g `[a-z]` matches against the letters from `a` to `z`. If `-` does not appear between two characters, it is interpreted as literal, e.g `[-a]` is the character class of `-` and `a`. **TODO: .NET's use of it for subtraction**
+
+Custom character classes may be nested within each other, and may be used with set operations. The supported set operations are intersection `&&`, subtraction `--`, and symmetric difference `~~`.
 
 
 ### Character properties
 
 ```
-CharacterProperty -> ('p{' | 'P{') PropertyName ('=' PropertyName)? '}'
-PropertyName -> [\s\w-]+
+CharacterProperty      -> '\' ('p' | 'P') '{' PropertyContents '}'
+POSIXCharacterProperty -> '[:' PropertyContents ':]'
+
+PropertyContents -> PropertyName ('=' PropertyName)?
+PropertyName     -> [\s\w-]+
 ```
 
+A character property specifies a particular Unicode or POSIX property to match against. In general, a property consists of both a key and a value, e.g `General_Category=Whitespace`, however some keys and values may appear on their own (with the other name being inferred). **TODO: Clarify the exact cases this happens**.
+
+**TODO: Spell out the properties we recognize while parsing vs. those we just parse as String?**
+
+Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]`.
+
 ### Named characters
 
 ```
 NamedCharacter -> '\N{' CharName '}'
 CharName -> 'U+' HexDigit{1...8} | [\s\w-]+
 ```
 
+Allows a specific Unicode scalar to be specified by name or code point.
+
+**TODO: Should this be called "named scalar" or similar?**
+
 ### Trivia
 
 ```
@@ -208,15 +318,7 @@ Whitespace -> \s+
 
 Trivia is consumed by the regular expression parser, but has no semantic meaning. Non-semantic whitespace may only occur when the either of the extended syntax matching options `(?x)`, `(?xx)` are enabled.
 
-### Matching options
-
-```
-MatchingOptionSeq -> '^' MatchingOption* 
-                   | MatchingOption+ 
-                   | MatchingOption* '-' MatchingOption*
-
-MatchingOption -> 'i' | 'J' | 'm' | 'n' | 's' | 'U' | 'x' | 'xx' | 'w' | 'D' | 'P' | 'S' | 'W' | 'y{' ('g' | 'w') '}'
-```
+**TODO: Differences between PCRE extended syntax and our syntax**
 
 ### References
 
@@ -226,6 +328,8 @@ NumberRef      -> ('+' | '-')? <Decimal Number> RecursionLevel?
 RecursionLevel -> '+' <Int> | '-' <Int>
 ```
 
+A reference is an abstract identifier for a particular capturing group in a regular expression. It can either be named or numbered, and in the latter case may be specified relative to the current group. For example `-2` refers to the capture group `N - 2` where `N` is the number of the next capture group. References may refer to groups ahead of the current position e.g `+3`, or the name of a future group. These may be useful in recursive cases where the group being referenced has been matched in a prior iteration.
+
 #### Backreferences
 
 ```
@@ -238,6 +342,8 @@ Backreference -> '\g{' NameOrNumberRef '}'
                | '(?P=' Identifier ')'
 ```
 
+A backreference evaluates to the value last captured by a given capturing group.
+
 #### Subpatterns
 
 ```
@@ -251,6 +357,9 @@ GroupLikeSubpatternBody -> 'P>' <String>
                          | NumberRef
 ```
 
+A subpattern causes a particular group to be re-evaluated at the current position. The syntax `(?R)` is equivalent to `(?0)`, and causes the entire pattern to be recursed.
+
+
 ### Conditionals
 
 ```
@@ -274,13 +383,38 @@ PCREVersionCheck  -> '>'? '=' PCREVersionNumber
 PCREVersionNumber -> <Int> '.' <Int>
 ```
 
+A conditional evaluates a particular condition, and chooses a branch to match against accordingly. 1 or 2 branches may be specified. If 1 branch is specified e.g `(?(...)x)`, it is treated as the true branch. Note this includes an empty true branch, e.g `(?(...))` which is the null pattern as described in *top-level regular expression*. If 2 branches are specified, e.g `(?(...)x|y)`, the first is treated as the true branch, the second being the false branch.
+
+A condition may be:
+
+- A reference to a capture group, which checks whether the group matched successfully.
+- A recursion check on either a particular group or the entire regex. In the former case, this checks to see if the last recursive call is through that group. In the latter case, it checks if the match is currently taking place in any kind of recursive call.
+- An arbitrary recursive regular expression, which is matched against, and evaluates to true if the match is successful. (**TODO: Clarify whether it introduces captures**)
+- A PCRE version check.
+
+The `DEFINE` keyword is not used as a condition, but rather a way in which to define a group which is not evaluated, but may be referenced by a subpattern.
+
 ### PCRE backtracking directives
 
 ```
 BacktrackingDirective     -> '(*' BacktrackingDirectiveKind (':' <String>)? ')'
 BacktrackingDirectiveKind -> 'ACCEPT' | 'FAIL' | 'F' | 'MARK' | '' | 'COMMIT' | 'PRUNE' | 'SKIP' | 'THEN'
 ```
 
+This is syntax specific to PCRE, and is used to control backtracking behavior. Any of the directives may include an optional tag, however `MARK` must have a tag. The empty directive is treated as `MARK`. Only the `ACCEPT` directive may be quantified, as it can use the backtracking behavior of the engine to be evaluated only if needed by a reluctant quantification.
+
+- `ACCEPT`: Causes matching to terminate immediately as a successful match. If used within a subpattern, only that level of recursion is terminated.
+- `FAIL`, `F`: Causes matching to fail, forcing backtracking to occur if possible.
+- `MARK`: Assigns a label to the current matching path, which is passed back to the caller on success. Subsequent `MARK` directives overwrite the label assigned, so only the last is passed back.
+- `COMMIT`: Prevents backtracking from reaching any point prior to this directive.
+
+
+**TODO:**
+
+- `PRUNE`: 
+- `SKIP`:
+- `THEN`:
+
 ### PCRE global matching options
 
 ```
@@ -298,6 +432,23 @@ NewlineKind         -> 'CRLF' | 'CR' | 'ANYCRLF' | 'ANY' | 'LF' | 'NUL'
 NewlineSequenceKind -> 'BSR_ANYCRLF' | 'BSR_UNICODE'
 ```
 
+This is syntax specific to PCRE, and allows a set of global options to appear at the start of a regular expression. They may not appear at any other position.
+
+- `LIMIT_DEPTH`, `LIMIT_HEAP`, `LIMIT_MATCH`: These place certain limits on the resources the matching engine may consume, and matches it may make.
+- `CRLF`, `CR`, `ANYCRLF`, `ANY`, `LF`, `NUL`: These control the definition of a newline character, which is used when matching e.g the `.` character class, and evaluating where a line ends in multi-line mode.
+- `BSR_ANYCRLF`, `BSR_UNICODE`: These change the definition of `\R`.
+
+**TODO:**
+
+- `NOTEMPTY_ATSTART`:
+- `NOTEMPTY`:
+- `NO_AUTO_POSSESS`:
+- `NO_DOTSTAR_ANCHOR`:
+- `NO_JIT`:
+- `NO_START_OPT`:
+- `UTF`:
+- `UCP`:
+
 ### Callouts
 
 ```
@@ -327,6 +478,8 @@ OnigurumaCalloutContents   -> <String>
 OnigurumaCalloutDirection  -> 'X' | '<' | '>'
 ```
 
+A callout is a feature that allows a user-supplied function to be called when matching reaches that point in the pattern. We supported parsing both the PCRE and Oniguruma callout syntax. The PCRE syntax accepts a string or numeric argument that is passed to the function. The Oniguruma syntax is more involved, and may accept a tag, argument list, or even an arbitrary program in the 'callout of contents' syntax.
+
 ### Absent functions
 
 ```
@@ -336,8 +489,17 @@ AbsentFunction -> '(?~' RegexNode ')'
                 | '(?~|)'
 ```
 
+An absent function is an Oniguruma feature that allows for the easy inversion of a given pattern. There are 4 variants of the syntax:
+
+- `(?~|absent|expr)`: Absent expression, which attempts to match against `expr`, but is limited by the range that is not matched by `absent`.
+- `(?~absent)`: Absent repeater, which matches against any input not matched by `absent`. Equivalent to `(?~|absent|\O*)`.
+- `(?~|absent)`: Absent stopper, which limits any subsequent matching to not include `absent`.
+- `(?~|)`: Absent clearer, which undoes the effects of the absent stopper.
+
 ## Syntactic differences between engines
 
+**TODO: Intro**
+
 ### Character class set operations
 
 In a custom character class, some engines allow for binary set operations that take two character class inputs, and produce a new character class output. However which set operations are supported and the spellings used differ by engine.
@@ -390,7 +552,7 @@ In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 t
 
 ### Implicitly-scoped matching option scopes
 
-PCRE and Oniguruma both support changing the active matching options through the `(?i)` expression. However, they have differing semantics when it comes to their scoping. In Oniguruma, it is treated as an implicit new scope that wraps everything until the end of the current group. In PCRE, it is treated as changing the matching option for all the following expressions until the end of the group.
+PCRE and Oniguruma both support changing the active matching options through an isolated group e.g `(?i)`. However, they have differing semantics when it comes to their scoping. In Oniguruma, it is treated as an implicit new scope that wraps everything until the end of the current group. In PCRE, it is treated as changing the matching option for all the following expressions until the end of the group.
 
 These sound similar, but have different semantics around alternations, e.g for `a(?i)b|c|d`, in Oniguruma this becomes `a(?i:b|c|d)`, where `a` is no longer part of the alternation. However in PCRE it becomes `a(?i:b)|(?i:c)|(?i:d)`, where `a` remains a child of the alternation.