Skip to content

Commit 4a692ca

Browse files
committed
Update RegexSyntax.md
1 parent 442c53f commit 4a692ca

File tree

1 file changed

+183
-21
lines changed

1 file changed

+183
-21
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 183 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Regex -> GlobalMatchingOptionSequence? RegexNode
3333
RegexNode -> '' | Alternation
3434
```
3535

36-
A top-level regular expression may consist of a sequence of global matching options followed by a `RegexNode`, which is the recursive part of the grammar that may be nested within e.g a group.
36+
A top-level regular expression may consist of a sequence of global matching options followed by a `RegexNode`, which is the recursive part of the grammar that may be nested within e.g a group. A regex node may be empty, which is the null pattern that always matches, but does not advance the input.
3737

3838
### Alternation
3939

@@ -76,23 +76,36 @@ The quantifiers supported are:
7676
- `{,m}`: Up to `m` matches
7777
- `{n}`: Exactly `n` matches
7878

79-
A quantifier may optionally followed by `?` or `+`, which apply certain semantics to the quantification. If neither are specified, by default the quantification happens eagerly, meaning that it will try to maximize the number of matches made. However, if `?` is specified, the number of matches will instead be minimized. If `+` is specified, eager matching occurs, but with the additional semantic that it may not be backtracked into to try a different number of matches.
79+
A quantifier may optionally be followed by `?` or `+`, which adjust its semantics. If neither are specified, by default the quantification happens *eagerly*, meaning that it will try to maximize the number of matches made. However, if `?` is specified, quantification happens *reluctantly*, meaning that the number of matches will instead be minimized. If `+` is specified, *possessive* matching occurs, which is eager matching with the additional semantic that it may not be backtracked into to try a different number of matches.
8080

8181
### Atom
8282

8383
```
84-
Atom -> Anchor | EscapeSequence | BuiltinCharClass
84+
Atom -> Anchor | EscapeSequence | BuiltinCharClass | Backreference | Subpattern
8585
```
8686

87-
Atoms are the smallest unit of regular expression syntax that cannot be split into smaller syntactic expressions.
87+
Atoms are the smallest unit of regular expression syntax that cannot be split into smaller syntactic expressions. They mainly include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`.
8888

8989
### Groups
9090

9191
```
9292
Group -> GroupStart RegexNode ')'
93-
GroupStart -> '(?' GroupKind | '('
94-
GroupKind -> ':' | '|' | '>' | '=' | '!' | '*' | '<=' | '<!' | '<*'
95-
| NamedGroup | MatchingOptionSeq (':' | ')')
93+
GroupStart -> '(' GroupKind | '('
94+
GroupKind -> '' | '?' BasicGroupKind | '*' PCRE2GroupKind ':'
95+
96+
BasicGroupKind -> ':' | '|' | '>' | '=' | '!' | '*' | '<=' | '<!' | '<*'
97+
| NamedGroup
98+
| MatchingOptionSeq (':' | ')')
99+
100+
PCRE2GroupKind -> 'atomic'
101+
| 'pla' | 'positive_lookahead'
102+
| 'nla' | 'negative_lookahead'
103+
| 'plb' | 'positive_lookbehind'
104+
| 'nlb' | 'negative_lookbehind'
105+
| 'napla' | 'non_atomic_positive_lookahead'
106+
| 'naplb' | 'non_atomic_positive_lookbehind'
107+
| 'sr' | 'script_run'
108+
| 'asr' | 'atomic_script_run'
96109
97110
NamedGroup -> 'P<' GroupNameBody '>'
98111
| '<' GroupNameBody '>'
@@ -105,12 +118,30 @@ Identifier -> [\w--\d] \w*
105118

106119
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced, some may capture the nested match, some may match against the input without advancing, some may change the matching options set in the new scope, etc.
107120

108-
**TODO: Something like "note that there are other things that may syntactically appear similarly to groups, but are their own constructs. See .... in-line options, backreferences, ... **
121+
Groups may be named, the characters of which may be any letter or number characters (or `_`). However the name must not start with a number. This restriction follows the behavior of other regex engines and avoids ambiguities when it comes to named and numeric group references.
122+
123+
Groups may be used to change the matching options present within their scope, see the *Matching options* section.
124+
125+
Note there are additional constructs that may syntactically appear similar to groups, but are distinct. See the *group-like atoms* section.
126+
109127

110128
#### Lookahead and lookbehind
111129

130+
- `(?=` specifies a lookahead that attempts to match against the group body, but does not advance.
131+
- `(?!` specifies a negative lookahead that ensures the group body does not match, and does not advance.
132+
- `(?<=` specifies a lookbehind that attempts to match the group body against the input before the current position. Does not advance the input.
133+
- `(?!<` specifies a negative lookbehind that ensures the group body does not match the input before the current position. Does not advance the input.
134+
135+
PCRE2 defines explicitly spelled out versions of the syntax, e.g `(*negative_lookbehind:)`.
136+
137+
#### Atomic groups
138+
139+
**TODO: Add description**
140+
112141
#### Script runs
113142

143+
**TODO: Add description**
144+
114145
#### Balancing groups
115146

116147
```
@@ -119,6 +150,50 @@ BalancingGroupBody -> Identifier? '-' Identifier
119150

120151
Introduced by .NET, balancing groups extend the `GroupNameBody` syntax to support the ability to refer to a prior group. Upon matching, the prior group is deleted, and any intermediate matched input becomes the capture of the current group.
121152

153+
### Matching options
154+
155+
```
156+
MatchingOptionSeq -> '^' MatchingOption*
157+
| MatchingOption+
158+
| MatchingOption* '-' MatchingOption*
159+
160+
MatchingOption -> 'i' | 'J' | 'm' | 'n' | 's' | 'U' | 'x' | 'xx' | 'w' | 'D' | 'P' | 'S' | 'W' | 'y{' ('g' | 'w') '}'
161+
```
162+
163+
A matching option sequence may be used as a group specifier, and denotes a change in matching options for the scope of that group. For example `(?x:a b c)` enables extended syntax for `a b c`. A matching option sequence may be part of an "isolated group" which has an implicit scope that wraps the remaining elements of the current group. For example, `(?x)a b c` also enables extended syntax for `a b c`.
164+
165+
We support all the matching options accepted by PCRE, ICU, and Oniguruma. In addition, we accept some matching options unique to our matching engine.
166+
167+
#### PCRE options
168+
169+
- `i`: Case insensitive matching
170+
- `J`: Allows multiple groups to share the same name, which is otherwise forbidden
171+
- `m`: Enables `^` and `$` to match against the start and end of a line rather than only the start and end of the entire string
172+
- `n`: Disables capturing of `(...)` groups. Named capture groups must be used instead.
173+
- `s`: Changes `.` to match any character, including newlines.
174+
- `U`: Changes quantifiers to be reluctant by default, with the `?` specifier changing to mean greedy.
175+
- `x`, `xx`: Enables extended syntax mode, which allows non-semantic whitespace and end-of-line comments. See the *trivia* section for more info.
176+
177+
#### ICU options
178+
179+
- `w`: Enables the Unicode interpretation of word boundaries `\b`. **TODO: Should this be the default?**
180+
181+
#### Oniguruma options
182+
183+
- `D`: Enables ASCII-only digit matching for `\d`, `\p{Digit}`, `[:digit:]`
184+
- `S`: Enables ASCII-only space matching for `\s`, `\p{Space}`, `[:space:]`
185+
- `W`: Enables ASCII-only word matching for `\w`, `\p{Word}`, `[:word:]`, and `\b`
186+
- `P`: Enables ASCII-only for all POSIX properties (including `digit`, `space`, and `word`)
187+
- `y{g}`, `y{w}`: Changes the meaning of `\X`, `\y`, `\Y`. These are mutually exclusive options, with `y{g}` specifying extended grapheme cluster mode, and `y{w}` specifying word mode.
188+
189+
#### Swift options
190+
191+
These options are specific to the Swift regex matching engine and control the semantic level at which matching takes place.
192+
193+
- `X`: Grapheme cluster matching
194+
- `u`: Unicode scalar matching
195+
- `b`: Byte matching
196+
122197

123198
### Anchors
124199

@@ -165,12 +240,29 @@ EscapeSequence -> '\a' | '\b' | '\c' <Char> | '\e' | '\f' | '\n' | '\r' | '\t'
165240

166241
These escape sequences denote a specific character. Note that `\b` may only be used in a custom character class, otherwise it represents a word boundary.
167242

243+
**TODO: List these out with a very brief description of what they mean.**
244+
168245
### Builtin character classes
169246

170247
```
171248
BuiltinCharClass -> '.' | '\d' | '\D' | '\h' | '\H' | '\O' | '\R' | '\s' | '\S' | '\v' | '\V' | '\w' | '\W' | '\X'
172249
```
173250

251+
- `.`: Any character excluding newlines
252+
- `\d`: Digit character
253+
- `\D`: Non-digit character
254+
- `\h`: Horizontal space character
255+
- `\H`: Non-horizontal-space character
256+
- `\O`: Any character (including newlines). This is syntax from Oniguruma.
257+
- `\R`: Newline sequence
258+
- `\s`: Whitespace character
259+
- `\S`: Non-whitespace character
260+
- `\v`: Vertical space character
261+
- `\V`: Non-vertical-space character
262+
- `\w`: Word character
263+
- `\W`: Non-word character
264+
- `\X`: Any extended grapheme cluster
265+
174266
### Custom character classes
175267

176268
```
@@ -179,25 +271,43 @@ Start -> '[' '^'?
179271
Set -> Member+
180272
Member -> CustomCharClass | !']' !SetOp (Range | Atom)
181273
Range -> Atom `-` Atom
274+
SetOp -> '&&' | '--' | '~~'
182275
```
183276

184-
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal.
277+
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though not all atoms are valid, e.g a backreference cannot be made.
278+
279+
Ranges of characters may be specified with the `-` character, e.g `[a-z]` matches against the letters from `a` to `z`. If `-` does not appear between two characters, it is interpreted as literal, e.g `[-a]` is the character class of `-` and `a`. **TODO: .NET's use of it for subtraction**
280+
281+
Custom character classes may be nested within each other, and may be used with set operations. The supported set operations are intersection `&&`, subtraction `--`, and symmetric difference `~~`.
185282

186283

187284
### Character properties
188285

189286
```
190-
CharacterProperty -> ('p{' | 'P{') PropertyName ('=' PropertyName)? '}'
191-
PropertyName -> [\s\w-]+
287+
CharacterProperty -> '\' ('p' | 'P') '{' PropertyContents '}'
288+
POSIXCharacterProperty -> '[:' PropertyContents ':]'
289+
290+
PropertyContents -> PropertyName ('=' PropertyName)?
291+
PropertyName -> [\s\w-]+
192292
```
193293

294+
A character property specifies a particular Unicode or POSIX property to match against. In general, a property consists of both a key and a value, e.g `General_Category=Whitespace`, however some keys and values may appear on their own (with the other name being inferred). **TODO: Clarify the exact cases this happens**.
295+
296+
**TODO: Spell out the properties we recognize while parsing vs. those we just parse as String?**
297+
298+
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]`.
299+
194300
### Named characters
195301

196302
```
197303
NamedCharacter -> '\N{' CharName '}'
198304
CharName -> 'U+' HexDigit{1...8} | [\s\w-]+
199305
```
200306

307+
Allows a specific Unicode scalar to be specified by name or code point.
308+
309+
**TODO: Should this be called "named scalar" or similar?**
310+
201311
### Trivia
202312

203313
```
@@ -208,15 +318,7 @@ Whitespace -> \s+
208318

209319
Trivia is consumed by the regular expression parser, but has no semantic meaning. Non-semantic whitespace may only occur when the either of the extended syntax matching options `(?x)`, `(?xx)` are enabled.
210320

211-
### Matching options
212-
213-
```
214-
MatchingOptionSeq -> '^' MatchingOption*
215-
| MatchingOption+
216-
| MatchingOption* '-' MatchingOption*
217-
218-
MatchingOption -> 'i' | 'J' | 'm' | 'n' | 's' | 'U' | 'x' | 'xx' | 'w' | 'D' | 'P' | 'S' | 'W' | 'y{' ('g' | 'w') '}'
219-
```
321+
**TODO: Differences between PCRE extended syntax and our syntax**
220322

221323
### References
222324

@@ -226,6 +328,8 @@ NumberRef -> ('+' | '-')? <Decimal Number> RecursionLevel?
226328
RecursionLevel -> '+' <Int> | '-' <Int>
227329
```
228330

331+
A reference is an abstract identifier for a particular capturing group in a regular expression. It can either be named or numbered, and in the latter case may be specified relative to the current group. For example `-2` refers to the capture group `N - 2` where `N` is the number of the next capture group. References may refer to groups ahead of the current position e.g `+3`, or the name of a future group. These may be useful in recursive cases where the group being referenced has been matched in a prior iteration.
332+
229333
#### Backreferences
230334

231335
```
@@ -238,6 +342,8 @@ Backreference -> '\g{' NameOrNumberRef '}'
238342
| '(?P=' Identifier ')'
239343
```
240344

345+
A backreference evaluates to the value last captured by a given capturing group.
346+
241347
#### Subpatterns
242348

243349
```
@@ -251,6 +357,9 @@ GroupLikeSubpatternBody -> 'P>' <String>
251357
| NumberRef
252358
```
253359

360+
A subpattern causes a particular group to be re-evaluated at the current position. The syntax `(?R)` is equivalent to `(?0)`, and causes the entire pattern to be recursed.
361+
362+
254363
### Conditionals
255364

256365
```
@@ -274,13 +383,38 @@ PCREVersionCheck -> '>'? '=' PCREVersionNumber
274383
PCREVersionNumber -> <Int> '.' <Int>
275384
```
276385

386+
A conditional evaluates a particular condition, and chooses a branch to match against accordingly. 1 or 2 branches may be specified. If 1 branch is specified e.g `(?(...)x)`, it is treated as the true branch. Note this includes an empty true branch, e.g `(?(...))` which is the null pattern as described in *top-level regular expression*. If 2 branches are specified, e.g `(?(...)x|y)`, the first is treated as the true branch, the second being the false branch.
387+
388+
A condition may be:
389+
390+
- A reference to a capture group, which checks whether the group matched successfully.
391+
- A recursion check on either a particular group or the entire regex. In the former case, this checks to see if the last recursive call is through that group. In the latter case, it checks if the match is currently taking place in any kind of recursive call.
392+
- An arbitrary recursive regular expression, which is matched against, and evaluates to true if the match is successful. (**TODO: Clarify whether it introduces captures**)
393+
- A PCRE version check.
394+
395+
The `DEFINE` keyword is not used as a condition, but rather a way in which to define a group which is not evaluated, but may be referenced by a subpattern.
396+
277397
### PCRE backtracking directives
278398

279399
```
280400
BacktrackingDirective -> '(*' BacktrackingDirectiveKind (':' <String>)? ')'
281401
BacktrackingDirectiveKind -> 'ACCEPT' | 'FAIL' | 'F' | 'MARK' | '' | 'COMMIT' | 'PRUNE' | 'SKIP' | 'THEN'
282402
```
283403

404+
This is syntax specific to PCRE, and is used to control backtracking behavior. Any of the directives may include an optional tag, however `MARK` must have a tag. The empty directive is treated as `MARK`. Only the `ACCEPT` directive may be quantified, as it can use the backtracking behavior of the engine to be evaluated only if needed by a reluctant quantification.
405+
406+
- `ACCEPT`: Causes matching to terminate immediately as a successful match. If used within a subpattern, only that level of recursion is terminated.
407+
- `FAIL`, `F`: Causes matching to fail, forcing backtracking to occur if possible.
408+
- `MARK`: Assigns a label to the current matching path, which is passed back to the caller on success. Subsequent `MARK` directives overwrite the label assigned, so only the last is passed back.
409+
- `COMMIT`: Prevents backtracking from reaching any point prior to this directive.
410+
411+
412+
**TODO:**
413+
414+
- `PRUNE`:
415+
- `SKIP`:
416+
- `THEN`:
417+
284418
### PCRE global matching options
285419

286420
```
@@ -298,6 +432,23 @@ NewlineKind -> 'CRLF' | 'CR' | 'ANYCRLF' | 'ANY' | 'LF' | 'NUL'
298432
NewlineSequenceKind -> 'BSR_ANYCRLF' | 'BSR_UNICODE'
299433
```
300434

435+
This is syntax specific to PCRE, and allows a set of global options to appear at the start of a regular expression. They may not appear at any other position.
436+
437+
- `LIMIT_DEPTH`, `LIMIT_HEAP`, `LIMIT_MATCH`: These place certain limits on the resources the matching engine may consume, and matches it may make.
438+
- `CRLF`, `CR`, `ANYCRLF`, `ANY`, `LF`, `NUL`: These control the definition of a newline character, which is used when matching e.g the `.` character class, and evaluating where a line ends in multi-line mode.
439+
- `BSR_ANYCRLF`, `BSR_UNICODE`: These change the definition of `\R`.
440+
441+
**TODO:**
442+
443+
- `NOTEMPTY_ATSTART`:
444+
- `NOTEMPTY`:
445+
- `NO_AUTO_POSSESS`:
446+
- `NO_DOTSTAR_ANCHOR`:
447+
- `NO_JIT`:
448+
- `NO_START_OPT`:
449+
- `UTF`:
450+
- `UCP`:
451+
301452
### Callouts
302453

303454
```
@@ -327,6 +478,8 @@ OnigurumaCalloutContents -> <String>
327478
OnigurumaCalloutDirection -> 'X' | '<' | '>'
328479
```
329480

481+
A callout is a feature that allows a user-supplied function to be called when matching reaches that point in the pattern. We supported parsing both the PCRE and Oniguruma callout syntax. The PCRE syntax accepts a string or numeric argument that is passed to the function. The Oniguruma syntax is more involved, and may accept a tag, argument list, or even an arbitrary program in the 'callout of contents' syntax.
482+
330483
### Absent functions
331484

332485
```
@@ -336,8 +489,17 @@ AbsentFunction -> '(?~' RegexNode ')'
336489
| '(?~|)'
337490
```
338491

492+
An absent function is an Oniguruma feature that allows for the easy inversion of a given pattern. There are 4 variants of the syntax:
493+
494+
- `(?~|absent|expr)`: Absent expression, which attempts to match against `expr`, but is limited by the range that is not matched by `absent`.
495+
- `(?~absent)`: Absent repeater, which matches against any input not matched by `absent`. Equivalent to `(?~|absent|\O*)`.
496+
- `(?~|absent)`: Absent stopper, which limits any subsequent matching to not include `absent`.
497+
- `(?~|)`: Absent clearer, which undoes the effects of the absent stopper.
498+
339499
## Syntactic differences between engines
340500

501+
**TODO: Intro**
502+
341503
### Character class set operations
342504

343505
In a custom character class, some engines allow for binary set operations that take two character class inputs, and produce a new character class output. However which set operations are supported and the spellings used differ by engine.
@@ -390,7 +552,7 @@ In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 t
390552

391553
### Implicitly-scoped matching option scopes
392554

393-
PCRE and Oniguruma both support changing the active matching options through the `(?i)` expression. However, they have differing semantics when it comes to their scoping. In Oniguruma, it is treated as an implicit new scope that wraps everything until the end of the current group. In PCRE, it is treated as changing the matching option for all the following expressions until the end of the group.
555+
PCRE and Oniguruma both support changing the active matching options through an isolated group e.g `(?i)`. However, they have differing semantics when it comes to their scoping. In Oniguruma, it is treated as an implicit new scope that wraps everything until the end of the current group. In PCRE, it is treated as changing the matching option for all the following expressions until the end of the group.
394556

395557
These sound similar, but have different semantics around alternations, e.g for `a(?i)b|c|d`, in Oniguruma this becomes `a(?i:b|c|d)`, where `a` is no longer part of the alternation. However in PCRE it becomes `a(?i:b)|(?i:c)|(?i:d)`, where `a` remains a child of the alternation.
396558

0 commit comments

Comments
 (0)