Update RegexSyntax.md

hamishknight · hamishknight · commit 5d49e82fbe07 · 2022-02-16T13:48:40.000Z
diff --git a/Documentation/Evolution/RegexSyntax.md b/Documentation/Evolution/RegexSyntax.md
@@ -81,10 +81,20 @@ A quantifier may optionally be followed by `?` or `+`, which adjust its semantic
 ### Atom
 
 ```
-Atom -> Anchor | EscapeSequence | BuiltinCharClass | Backreference | Subpattern
+Atom -> Anchor
+      | Backreference
+      | BacktrackingDirective
+      | BuiltinCharClass
+      | Callout
+      | CharacterProperty
+      | EscapeSequence
+      | NamedCharacter
+      | Subpattern
+      | UniScalar
+      | '\'? <Character>
 ```
 
-Atoms are the smallest unit of regular expression syntax that cannot be split into smaller syntactic expressions. They mainly include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`.
+Atoms are the smallest units of regular expression syntax. They include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`. They also include some larger syntactic constructs such as backreferences and callouts. The most basic form of atom is a literal character. A meta-character may be treated as literal by preceding it with a backslash. Other characters may also be preceded with a backslash, but it has no effect, e.g `\I` is literal `I`.
 
 ### Groups
 
@@ -229,19 +239,30 @@ HexDigit   -> [0-9a-zA-Z]
 OctalDigit -> [0-7]
 ```
 
-These sequences define a unicode scalar value to be matched against.
+These sequences define a unicode scalar value to be matched against. There is both syntax for specifying the scalar value in hex notation, as well as octal notation.
 
-**TODO: Some discussion of the fun `\DDD` syntax**
+The `\DDD` syntax that accepts up to 3 octal digits is syntactically ambiguous with backreference syntax. The ambiguity is resolved in the same way as PCRE. If the first digit is `0`, that is always an octal sequence (including `\0` for the NUL character). Otherwise, if any of the following hold, it is treated as a backreference:
+
+- Its `0 < n < 10`.
+- Its first digit is `8` or `9`.
+- Its value corresponds to a valid prior group number.
 
 ### Escape sequences
 
 ```
 EscapeSequence -> '\a' | '\b' | '\c' <Char> | '\e' | '\f' | '\n' | '\r' | '\t'
 ```
 
-These escape sequences denote a specific character. Note that `\b` may only be used in a custom character class, otherwise it represents a word boundary.
+These escape sequences denote a specific character.
 
-**TODO: List these out with a very brief description of what they mean.**
+- `\a`: The alert (bell) character `U+7`.
+- `\b`: The backspace character `U+8`. Note this may only be used in a custom character class, otherwise it represents a word boundary.
+- `\c <Char>`: A control character sequence (`U+00` - `U+7F`).
+- `\e`: The escape character `U+1B`.
+- `\f`: The form-feed character `U+C`.
+- `\n`: The newline character `U+A`.
+- `\r`: The carriage return character `U+D`.
+- `\t`: The tab character `U+9`
 
 ### Builtin character classes
 
@@ -270,17 +291,27 @@ BuiltinCharClass -> '.' | '\d' | '\D' | '\h' | '\H' | '\O' | '\R' | '\s' | '\S'
 CustomCharClass -> Start Set (SetOp Set)* ']'
 Start           -> '[' '^'?
 Set             -> Member+
-Member          -> CustomCharClass | !']' !SetOp (Range | Atom)
+Member          -> CustomCharClass | Quote | Range | Atom
 Range           -> Atom `-` Atom
 SetOp           -> '&&' | '--' | '~~'
 ```
 
-Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though not all atoms are valid, e.g a backreference cannot be made.
+Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though only a few atoms are considered valid:
+
+- Builtin character classes, except `.`, `\O`, and `\X`
+- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary)
+- Unicode scalars
+- Named characters
+- Character properties
+- Plain literal characters
+
+Ranges of characters may be specified with `-`, e.g `[a-z]` matches against the letters from `a` to `z`. Only unicode scalars and literal characters are valid range operands. If `-` does not appear in a valid position, it is interpreted as literal, e.g `[-a]` is the character class of `-` and `a`. **TODO: .NET's use of it for subtraction**
 
-Ranges of characters may be specified with the `-` character, e.g `[a-z]` matches against the letters from `a` to `z`. If `-` does not appear between two characters, it is interpreted as literal, e.g `[-a]` is the character class of `-` and `a`. **TODO: .NET's use of it for subtraction**
+**TODO: Different grammar for range?**
 
 Custom character classes may be nested within each other, and may be used with set operations. The supported set operations are intersection `&&`, subtraction `--`, and symmetric difference `~~`.
 
+Quoted sequences may appear with custom character classes, e.g `[\Q]\E]`, and escape the contained characters.
 
 ### Character properties
 
@@ -292,11 +323,30 @@ PropertyContents -> PropertyName ('=' PropertyName)?
 PropertyName     -> [\s\w-]+
 ```
 
-A character property specifies a particular Unicode or POSIX property to match against. In general, a property consists of both a key and a value, e.g `General_Category=Whitespace`, however some keys and values may appear on their own (with the other name being inferred). **TODO: Clarify the exact cases this happens**.
+A character property specifies a particular Unicode or POSIX property to match against. Fuzzy matching is used when parsing the property name, and is done according to rules set out by [UAX44-LM3]. This means that the following property names are considered equivalent:
+
+- `whitespace`
+- `isWhitespace`
+- `is-White_Space`
+- `iSwHiTeSpaCe`
+- `i s w h i t e s p a c e`
+
+Unicode properties consist of both a key and a value, e.g `General_Category=Whitespace`. However there are some properties where the key or value may be inferred. These include:
+
+- General category properties e.g `\p{Whitespace}` is inferred as `\p{General_Category=Whitespace}`.
+- Script properties e.g `\p{Greek}` is inferred as `\p{Script=Greek}`.
+- Boolean properties that are inferred to have a `True` value, e.g `\p{Lowercase}` is inferred as `\p{Lowercase=True}`.
+
+Other Unicode properties however must specify both a key and value.
+
+For non-Unicode properties, only a value is required. These include:
+
+- The special properties `any`, `assigned`, `ascii`.
+- The POSIX compatibility properties `alnum`, `blank`, `graph`, `print`, `word`, `xdigit`. The remaining POSIX properties are already covered by boolean Unicode property spellings. 
 
 **TODO: Spell out the properties we recognize while parsing vs. those we just parse as String?**
 
-Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]`.
+Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
 
 ### Named characters
 
@@ -325,6 +375,18 @@ Trivia is consumed by the regular expression parser, but has no semantic meaning
 
 **TODO: Differences between PCRE extended syntax and our syntax**
 
+### Quotes
+
+```
+Quote -> '\Q' (!'\E' .)* '\E'
+```
+
+A quoted sequence is delimited by `\Q...\E`, and allows the escaping of metacharacters such that they are interpreted literally. For example, `\Q^[xy]+$\E`, is treated as the literal characters `^[xy]+$` rather than an anchored quantified character class.
+
+The backslash character is also treated as literal within a quoted sequence, and may not be used to escape the closing delimiter, e.g `\Q\\E` is a literal `\`.
+
+`\E` may appear without a preceding `\Q`, in which case it is a literal `E`.
+
 ### References
 
 ```
@@ -335,6 +397,8 @@ RecursionLevel -> '+' <Int> | '-' <Int>
 
 A reference is an abstract identifier for a particular capturing group in a regular expression. It can either be named or numbered, and in the latter case may be specified relative to the current group. For example `-2` refers to the capture group `N - 2` where `N` is the number of the next capture group. References may refer to groups ahead of the current position e.g `+3`, or the name of a future group. These may be useful in recursive cases where the group being referenced has been matched in a prior iteration.
 
+**TODO: Describe how capture groups are numbered? Including nesting & resets?**
+
 #### Backreferences
 
 ```
@@ -347,7 +411,7 @@ Backreference -> '\g{' NameOrNumberRef '}'
                | '(?P=' Identifier ')'
 ```
 
-A backreference evaluates to the value last captured by a given capturing group.
+A backreference evaluates to the value last captured by the referenced capturing group. Note the `\D` form of this syntax is syntactically ambiguous with octal syntax, see the *unicode scalars* section on how this ambiguity is resolved.
 
 #### Subpatterns
 
@@ -362,8 +426,7 @@ GroupLikeSubpatternBody -> 'P>' <String>
                          | NumberRef
 ```
 
-A subpattern causes a particular group to be re-evaluated at the current position. The syntax `(?R)` is equivalent to `(?0)`, and causes the entire pattern to be recursed.
-
+A subpattern causes the referenced group to be re-evaluated at the current position. The syntax `(?R)` is equivalent to `(?0)`, and causes the entire pattern to be recursed.
 
 ### Conditionals
 
@@ -606,3 +669,4 @@ The least intuitive spelling being `'\' [1-9] [0-9]+`, as it can be a backrefere
 [icu-syntax]: https://unicode-org.github.io/icu/userguide/strings/regexp.html
 [uts18]: https://www.unicode.org/reports/tr18/
 [.net-syntax]: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expressions
+[UAX44-LM3]: https://www.unicode.org/reports/tr44/#UAX44-LM3