You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A top-level regular expression may consist of a sequence of global matching options followed by a `RegexNode`, which is the recursive part of the grammar that may be nested within e.g a group.
36
+
A top-level regular expression may consist of a sequence of global matching options followed by a `RegexNode`, which is the recursive part of the grammar that may be nested within e.g a group. A regex node may be empty, which is the null pattern that always matches, but does not advance the input.
37
37
38
38
### Alternation
39
39
@@ -76,23 +76,36 @@ The quantifiers supported are:
76
76
-`{,m}`: Up to `m` matches
77
77
-`{n}`: Exactly `n` matches
78
78
79
-
A quantifier may optionally followed by `?` or `+`, which apply certain semantics to the quantification. If neither are specified, by default the quantification happens eagerly, meaning that it will try to maximize the number of matches made. However, if `?` is specified, the number of matches will instead be minimized. If `+` is specified, eager matching occurs, but with the additional semantic that it may not be backtracked into to try a different number of matches.
79
+
A quantifier may optionally be followed by `?` or `+`, which adjust its semantics. If neither are specified, by default the quantification happens *eagerly*, meaning that it will try to maximize the number of matches made. However, if `?` is specified, quantification happens *reluctantly*, meaning that the number of matches will instead be minimized. If `+` is specified, *possessive* matching occurs, which is eager matching with the additional semantic that it may not be backtracked into to try a different number of matches.
80
80
81
81
### Atom
82
82
83
83
```
84
-
Atom -> Anchor | EscapeSequence | BuiltinCharClass
Atoms are the smallest unit of regular expression syntax that cannot be split into smaller syntactic expressions.
87
+
Atoms are the smallest unit of regular expression syntax that cannot be split into smaller syntactic expressions. They mainly include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`.
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced, some may capture the nested match, some may match against the input without advancing, some may change the matching options set in the new scope, etc.
107
120
108
-
**TODO: Something like "note that there are other things that may syntactically appear similarly to groups, but are their own constructs. See .... in-line options, backreferences, ... **
121
+
Groups may be named, the characters of which may be any letter or number characters (or `_`). However the name must not start with a number. This restriction follows the behavior of other regex engines and avoids ambiguities when it comes to named and numeric group references.
122
+
123
+
Groups may be used to change the matching options present within their scope, see the *Matching options* section.
124
+
125
+
Note there are additional constructs that may syntactically appear similar to groups, but are distinct. See the *group-like atoms* section.
126
+
109
127
110
128
#### Lookahead and lookbehind
111
129
130
+
-`(?=` specifies a lookahead that attempts to match against the group body, but does not advance.
131
+
-`(?!` specifies a negative lookahead that ensures the group body does not match, and does not advance.
132
+
-`(?<=` specifies a lookbehind that attempts to match the group body against the input before the current position. Does not advance the input.
133
+
-`(?!<` specifies a negative lookbehind that ensures the group body does not match the input before the current position. Does not advance the input.
134
+
135
+
PCRE2 defines explicitly spelled out versions of the syntax, e.g `(*negative_lookbehind:)`.
Introduced by .NET, balancing groups extend the `GroupNameBody` syntax to support the ability to refer to a prior group. Upon matching, the prior group is deleted, and any intermediate matched input becomes the capture of the current group.
A matching option sequence may be used as a group specifier, and denotes a change in matching options for the scope of that group. For example `(?x:a b c)` enables extended syntax for `a b c`. A matching option sequence may be part of an "isolated group" which has an implicit scope that wraps the remaining elements of the current group. For example, `(?x)a b c` also enables extended syntax for `a b c`.
164
+
165
+
We support all the matching options accepted by PCRE, ICU, and Oniguruma. In addition, we accept some matching options unique to our matching engine.
166
+
167
+
#### PCRE options
168
+
169
+
-`i`: Case insensitive matching
170
+
-`J`: Allows multiple groups to share the same name, which is otherwise forbidden
171
+
-`m`: Enables `^` and `$` to match against the start and end of a line rather than only the start and end of the entire string
172
+
-`n`: Disables capturing of `(...)` groups. Named capture groups must be used instead.
173
+
-`s`: Changes `.` to match any character, including newlines.
174
+
-`U`: Changes quantifiers to be reluctant by default, with the `?` specifier changing to mean greedy.
175
+
-`x`, `xx`: Enables extended syntax mode, which allows non-semantic whitespace and end-of-line comments. See the *trivia* section for more info.
176
+
177
+
#### ICU options
178
+
179
+
-`w`: Enables the Unicode interpretation of word boundaries `\b`. **TODO: Should this be the default?**
180
+
181
+
#### Oniguruma options
182
+
183
+
-`D`: Enables ASCII-only digit matching for `\d`, `\p{Digit}`, `[:digit:]`
184
+
-`S`: Enables ASCII-only space matching for `\s`, `\p{Space}`, `[:space:]`
185
+
-`W`: Enables ASCII-only word matching for `\w`, `\p{Word}`, `[:word:]`, and `\b`
186
+
-`P`: Enables ASCII-only for all POSIX properties (including `digit`, `space`, and `word`)
187
+
-`y{g}`, `y{w}`: Changes the meaning of `\X`, `\y`, `\Y`. These are mutually exclusive options, with `y{g}` specifying extended grapheme cluster mode, and `y{w}` specifying word mode.
188
+
189
+
#### Swift options
190
+
191
+
These options are specific to the Swift regex matching engine and control the semantic level at which matching takes place.
These escape sequences denote a specific character. Note that `\b` may only be used in a custom character class, otherwise it represents a word boundary.
167
242
243
+
**TODO: List these out with a very brief description of what they mean.**
-`\O`: Any character (including newlines). This is syntax from Oniguruma.
257
+
-`\R`: Newline sequence
258
+
-`\s`: Whitespace character
259
+
-`\S`: Non-whitespace character
260
+
-`\v`: Vertical space character
261
+
-`\V`: Non-vertical-space character
262
+
-`\w`: Word character
263
+
-`\W`: Non-word character
264
+
-`\X`: Any extended grapheme cluster
265
+
174
266
### Custom character classes
175
267
176
268
```
@@ -179,25 +271,43 @@ Start -> '[' '^'?
179
271
Set -> Member+
180
272
Member -> CustomCharClass | !']' !SetOp (Range | Atom)
181
273
Range -> Atom `-` Atom
274
+
SetOp -> '&&' | '--' | '~~'
182
275
```
183
276
184
-
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal.
277
+
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though not all atoms are valid, e.g a backreference cannot be made.
278
+
279
+
Ranges of characters may be specified with the `-` character, e.g `[a-z]` matches against the letters from `a` to `z`. If `-` does not appear between two characters, it is interpreted as literal, e.g `[-a]` is the character class of `-` and `a`. **TODO: .NET's use of it for subtraction**
280
+
281
+
Custom character classes may be nested within each other, and may be used with set operations. The supported set operations are intersection `&&`, subtraction `--`, and symmetric difference `~~`.
A character property specifies a particular Unicode or POSIX property to match against. In general, a property consists of both a key and a value, e.g `General_Category=Whitespace`, however some keys and values may appear on their own (with the other name being inferred). **TODO: Clarify the exact cases this happens**.
295
+
296
+
**TODO: Spell out the properties we recognize while parsing vs. those we just parse as String?**
297
+
298
+
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]`.
299
+
194
300
### Named characters
195
301
196
302
```
197
303
NamedCharacter -> '\N{' CharName '}'
198
304
CharName -> 'U+' HexDigit{1...8} | [\s\w-]+
199
305
```
200
306
307
+
Allows a specific Unicode scalar to be specified by name or code point.
308
+
309
+
**TODO: Should this be called "named scalar" or similar?**
310
+
201
311
### Trivia
202
312
203
313
```
@@ -208,15 +318,7 @@ Whitespace -> \s+
208
318
209
319
Trivia is consumed by the regular expression parser, but has no semantic meaning. Non-semantic whitespace may only occur when the either of the extended syntax matching options `(?x)`, `(?xx)` are enabled.
A reference is an abstract identifier for a particular capturing group in a regular expression. It can either be named or numbered, and in the latter case may be specified relative to the current group. For example `-2` refers to the capture group `N - 2` where `N` is the number of the next capture group. References may refer to groups ahead of the current position e.g `+3`, or the name of a future group. These may be useful in recursive cases where the group being referenced has been matched in a prior iteration.
A subpattern causes a particular group to be re-evaluated at the current position. The syntax `(?R)` is equivalent to `(?0)`, and causes the entire pattern to be recursed.
A conditional evaluates a particular condition, and chooses a branch to match against accordingly. 1 or 2 branches may be specified. If 1 branch is specified e.g `(?(...)x)`, it is treated as the true branch. Note this includes an empty true branch, e.g `(?(...))` which is the null pattern as described in *top-level regular expression*. If 2 branches are specified, e.g `(?(...)x|y)`, the first is treated as the true branch, the second being the false branch.
387
+
388
+
A condition may be:
389
+
390
+
- A reference to a capture group, which checks whether the group matched successfully.
391
+
- A recursion check on either a particular group or the entire regex. In the former case, this checks to see if the last recursive call is through that group. In the latter case, it checks if the match is currently taking place in any kind of recursive call.
392
+
- An arbitrary recursive regular expression, which is matched against, and evaluates to true if the match is successful. (**TODO: Clarify whether it introduces captures**)
393
+
- A PCRE version check.
394
+
395
+
The `DEFINE` keyword is not used as a condition, but rather a way in which to define a group which is not evaluated, but may be referenced by a subpattern.
This is syntax specific to PCRE, and is used to control backtracking behavior. Any of the directives may include an optional tag, however `MARK` must have a tag. The empty directive is treated as `MARK`. Only the `ACCEPT` directive may be quantified, as it can use the backtracking behavior of the engine to be evaluated only if needed by a reluctant quantification.
405
+
406
+
-`ACCEPT`: Causes matching to terminate immediately as a successful match. If used within a subpattern, only that level of recursion is terminated.
407
+
-`FAIL`, `F`: Causes matching to fail, forcing backtracking to occur if possible.
408
+
-`MARK`: Assigns a label to the current matching path, which is passed back to the caller on success. Subsequent `MARK` directives overwrite the label assigned, so only the last is passed back.
409
+
-`COMMIT`: Prevents backtracking from reaching any point prior to this directive.
This is syntax specific to PCRE, and allows a set of global options to appear at the start of a regular expression. They may not appear at any other position.
436
+
437
+
-`LIMIT_DEPTH`, `LIMIT_HEAP`, `LIMIT_MATCH`: These place certain limits on the resources the matching engine may consume, and matches it may make.
438
+
-`CRLF`, `CR`, `ANYCRLF`, `ANY`, `LF`, `NUL`: These control the definition of a newline character, which is used when matching e.g the `.` character class, and evaluating where a line ends in multi-line mode.
439
+
-`BSR_ANYCRLF`, `BSR_UNICODE`: These change the definition of `\R`.
A callout is a feature that allows a user-supplied function to be called when matching reaches that point in the pattern. We supported parsing both the PCRE and Oniguruma callout syntax. The PCRE syntax accepts a string or numeric argument that is passed to the function. The Oniguruma syntax is more involved, and may accept a tag, argument list, or even an arbitrary program in the 'callout of contents' syntax.
An absent function is an Oniguruma feature that allows for the easy inversion of a given pattern. There are 4 variants of the syntax:
493
+
494
+
-`(?~|absent|expr)`: Absent expression, which attempts to match against `expr`, but is limited by the range that is not matched by `absent`.
495
+
-`(?~absent)`: Absent repeater, which matches against any input not matched by `absent`. Equivalent to `(?~|absent|\O*)`.
496
+
-`(?~|absent)`: Absent stopper, which limits any subsequent matching to not include `absent`.
497
+
-`(?~|)`: Absent clearer, which undoes the effects of the absent stopper.
498
+
339
499
## Syntactic differences between engines
340
500
501
+
**TODO: Intro**
502
+
341
503
### Character class set operations
342
504
343
505
In a custom character class, some engines allow for binary set operations that take two character class inputs, and produce a new character class output. However which set operations are supported and the spellings used differ by engine.
@@ -390,7 +552,7 @@ In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 t
390
552
391
553
### Implicitly-scoped matching option scopes
392
554
393
-
PCRE and Oniguruma both support changing the active matching options through the `(?i)` expression. However, they have differing semantics when it comes to their scoping. In Oniguruma, it is treated as an implicit new scope that wraps everything until the end of the current group. In PCRE, it is treated as changing the matching option for all the following expressions until the end of the group.
555
+
PCRE and Oniguruma both support changing the active matching options through an isolated group e.g `(?i)`. However, they have differing semantics when it comes to their scoping. In Oniguruma, it is treated as an implicit new scope that wraps everything until the end of the current group. In PCRE, it is treated as changing the matching option for all the following expressions until the end of the group.
394
556
395
557
These sound similar, but have different semantics around alternations, e.g for `a(?i)b|c|d`, in Oniguruma this becomes `a(?i:b|c|d)`, where `a` is no longer part of the alternation. However in PCRE it becomes `a(?i:b)|(?i:c)|(?i:d)`, where `a` remains a child of the alternation.
0 commit comments