You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+33-35Lines changed: 33 additions & 35 deletions
Original file line number
Diff line number
Diff line change
@@ -136,11 +136,11 @@ PCRE2 defines explicitly spelled out versions of the syntax, e.g `(*negative_loo
136
136
137
137
#### Atomic groups
138
138
139
-
**TODO: Add description**
139
+
An atomic group e.g `(?>...)` specifies that its contents should not be re-evaluated for backtracking. This the same semantics as a possessive quantifier, but applies more generally to any regex pattern.
140
140
141
141
#### Script runs
142
142
143
-
**TODO: Add description**
143
+
A script run e.g `(*script_run:...)` specifies that the contents must match against a sequence of characters from the same Unicode script, e.g Latin or Greek.
Anchors match against a certain position in the input rather than on a particular character of the input.
205
205
206
-
#### Start and end of line
207
-
208
-
`^` matches against the start of a line of input, `$` matches against the end of a line.
209
-
210
-
#### Word boundaries
211
-
212
-
`\b` matches a word boundary, which is [...]
213
-
214
-
**TODO: List these out with a very brief description of what they mean.**
206
+
-`^`: Matches at the start of a line.
207
+
-`$`: Matches at the end of a line.
208
+
-`\A`: Matches at the very start of the input string.
209
+
-`\Z`: Matches at the very end of the input string, in addition to before a newline at the very end of the input string.
210
+
-`\z`: Like `\Z`, but only matches at the very end of the input string.
211
+
-`\G`: Like `\A`, but also matches against the start position of where matching resumes in global matching mode (e.g `\Gab` matches twice in `abab`, `\Aab` would only match once).
212
+
-`\b` matches a boundary between a word character and a non-word character. The definitions of which vary depending on matching engine.
213
+
-`\B` matches a non-word-boundary.
214
+
-`\y` matches a text segment boundary, the definition of which varies based on the `y{w}` and `y{g}` matching option.
These sequences define a unicode scalar value to be matched against.
@@ -311,12 +312,16 @@ Allows a specific Unicode scalar to be specified by name or code point.
311
312
### Trivia
312
313
313
314
```
314
-
Trivia -> Comment | Whitespace
315
-
Comment -> '(?#' (!')')* ')'
315
+
Trivia -> Comment | Whitespace
316
+
Comment -> InlineComment | EndOfLineComment
317
+
318
+
InlineComment -> '(?#' (!')')* ')'
319
+
EndOfLineComment -> '#' .*$
320
+
316
321
Whitespace -> \s+
317
322
```
318
323
319
-
Trivia is consumed by the regular expression parser, but has no semantic meaning. Non-semantic whitespace may only occur when the either of the extended syntax matching options `(?x)`, `(?xx)` are enabled.
324
+
Trivia is consumed by the regular expression parser, but has no semantic meaning. This includes inline PCRE-style comments e.g `(?#comment)`. It also includes non-semantic whitespace and end-of-line comments which may only occur when either of the extended syntax matching options `(?x)`, `(?xx)` are enabled.
320
325
321
326
**TODO: Differences between PCRE extended syntax and our syntax**
322
327
@@ -406,14 +411,10 @@ This is syntax specific to PCRE, and is used to control backtracking behavior. A
406
411
-`ACCEPT`: Causes matching to terminate immediately as a successful match. If used within a subpattern, only that level of recursion is terminated.
407
412
-`FAIL`, `F`: Causes matching to fail, forcing backtracking to occur if possible.
408
413
-`MARK`: Assigns a label to the current matching path, which is passed back to the caller on success. Subsequent `MARK` directives overwrite the label assigned, so only the last is passed back.
409
-
-`COMMIT`: Prevents backtracking from reaching any point prior to this directive.
410
-
411
-
412
-
**TODO:**
413
-
414
-
-`PRUNE`:
415
-
-`SKIP`:
416
-
-`THEN`:
414
+
-`COMMIT`: Prevents backtracking from reaching any point prior to this directive, causing the match to fail. This does not allow advancing the input to try a different starting match position.
415
+
-`PRUNE`: Similar to `COMMIT`, but allows advancing the input to try and find a different starting match position.
416
+
-`SKIP`: Similar to `PRUNE`, but skips ahead to the position of `SKIP` to try again as the starting position.
417
+
-`THEN`: Similar to `PRUNE`, but when used inside an alternation will try to match in the subsequent branch before attempting to advance the input to find a different starting position.
417
418
418
419
### PCRE global matching options
419
420
@@ -437,17 +438,14 @@ This is syntax specific to PCRE, and allows a set of global options to appear at
437
438
-`LIMIT_DEPTH`, `LIMIT_HEAP`, `LIMIT_MATCH`: These place certain limits on the resources the matching engine may consume, and matches it may make.
438
439
-`CRLF`, `CR`, `ANYCRLF`, `ANY`, `LF`, `NUL`: These control the definition of a newline character, which is used when matching e.g the `.` character class, and evaluating where a line ends in multi-line mode.
439
440
-`BSR_ANYCRLF`, `BSR_UNICODE`: These change the definition of `\R`.
440
-
441
-
**TODO:**
442
-
443
-
-`NOTEMPTY_ATSTART`:
444
-
-`NOTEMPTY`:
445
-
-`NO_AUTO_POSSESS`:
446
-
-`NO_DOTSTAR_ANCHOR`:
447
-
-`NO_JIT`:
448
-
-`NO_START_OPT`:
449
-
-`UTF`:
450
-
-`UCP`:
441
+
-`NOTEMPTY`: Does not consider the empty string to be a valid match.
442
+
-`NOTEMPTY_ATSTART`: Like `NOT_EMPTY`, but only applies to the first matching position in the input.
443
+
-`NO_AUTO_POSSESS`: Disables an optimization that treats a quantifier as possessive if the following construct clearly cannot be part of the match. In other words, disables the short-circuiting of backtracks in cases where the engine knows it will not produce a match. This is useful for debugging, or for ensuring a callout gets invoked.
444
+
-`NO_DOTSTAR_ANCHOR`: Disables an optimization that tries to automatically anchor `.*` at the start of a regex. Like `NO_AUTO_POSSESS`, this is mainly used for debugging or ensuring a callout gets invoked.
445
+
-`NO_JIT`: Disables JIT compilation
446
+
-`NO_START_OPT`: Disables various optimizations performed at the start of matching. Like `NO_DOTSTAR_ANCHOR`, is mainly used for debugging or ensuring a callout gets invoked.
0 commit comments