Skip to content

Commit 06981c7

Browse files
committed
Update RegexSyntax.md
1 parent 4a692ca commit 06981c7

File tree

1 file changed

+33
-35
lines changed

1 file changed

+33
-35
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 33 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -136,11 +136,11 @@ PCRE2 defines explicitly spelled out versions of the syntax, e.g `(*negative_loo
136136

137137
#### Atomic groups
138138

139-
**TODO: Add description**
139+
An atomic group e.g `(?>...)` specifies that its contents should not be re-evaluated for backtracking. This the same semantics as a possessive quantifier, but applies more generally to any regex pattern.
140140

141141
#### Script runs
142142

143-
**TODO: Add description**
143+
A script run e.g `(*script_run:...)` specifies that the contents must match against a sequence of characters from the same Unicode script, e.g Latin or Greek.
144144

145145
#### Balancing groups
146146

@@ -203,15 +203,16 @@ Anchor -> '^' | '$' | '\A' | '\b' | '\B' | '\G' | '\y' | '\Y' | '\z' | '\Z'
203203

204204
Anchors match against a certain position in the input rather than on a particular character of the input.
205205

206-
#### Start and end of line
207-
208-
`^` matches against the start of a line of input, `$` matches against the end of a line.
209-
210-
#### Word boundaries
211-
212-
`\b` matches a word boundary, which is [...]
213-
214-
**TODO: List these out with a very brief description of what they mean.**
206+
- `^`: Matches at the start of a line.
207+
- `$`: Matches at the end of a line.
208+
- `\A`: Matches at the very start of the input string.
209+
- `\Z`: Matches at the very end of the input string, in addition to before a newline at the very end of the input string.
210+
- `\z`: Like `\Z`, but only matches at the very end of the input string.
211+
- `\G`: Like `\A`, but also matches against the start position of where matching resumes in global matching mode (e.g `\Gab` matches twice in `abab`, `\Aab` would only match once).
212+
- `\b` matches a boundary between a word character and a non-word character. The definitions of which vary depending on matching engine.
213+
- `\B` matches a non-word-boundary.
214+
- `\y` matches a text segment boundary, the definition of which varies based on the `y{w}` and `y{g}` matching option.
215+
- `\Y` matches a non-text-segment-boundary.
215216

216217
### Unicode scalars
217218

@@ -224,8 +225,8 @@ UniScalar -> '\u{' HexDigit{1...} '}'
224225
| '\o{' OctalDigit{1...} '}'
225226
| '\' OctalDigit{1...3}
226227
227-
HexDigit -> [0-9a-zA-Z]
228-
OctalDigit -> [0-7]
228+
HexDigit -> [0-9a-zA-Z]
229+
OctalDigit -> [0-7]
229230
```
230231

231232
These sequences define a unicode scalar value to be matched against.
@@ -311,12 +312,16 @@ Allows a specific Unicode scalar to be specified by name or code point.
311312
### Trivia
312313

313314
```
314-
Trivia -> Comment | Whitespace
315-
Comment -> '(?#' (!')')* ')'
315+
Trivia -> Comment | Whitespace
316+
Comment -> InlineComment | EndOfLineComment
317+
318+
InlineComment -> '(?#' (!')')* ')'
319+
EndOfLineComment -> '#' .*$
320+
316321
Whitespace -> \s+
317322
```
318323

319-
Trivia is consumed by the regular expression parser, but has no semantic meaning. Non-semantic whitespace may only occur when the either of the extended syntax matching options `(?x)`, `(?xx)` are enabled.
324+
Trivia is consumed by the regular expression parser, but has no semantic meaning. This includes inline PCRE-style comments e.g `(?#comment)`. It also includes non-semantic whitespace and end-of-line comments which may only occur when either of the extended syntax matching options `(?x)`, `(?xx)` are enabled.
320325

321326
**TODO: Differences between PCRE extended syntax and our syntax**
322327

@@ -406,14 +411,10 @@ This is syntax specific to PCRE, and is used to control backtracking behavior. A
406411
- `ACCEPT`: Causes matching to terminate immediately as a successful match. If used within a subpattern, only that level of recursion is terminated.
407412
- `FAIL`, `F`: Causes matching to fail, forcing backtracking to occur if possible.
408413
- `MARK`: Assigns a label to the current matching path, which is passed back to the caller on success. Subsequent `MARK` directives overwrite the label assigned, so only the last is passed back.
409-
- `COMMIT`: Prevents backtracking from reaching any point prior to this directive.
410-
411-
412-
**TODO:**
413-
414-
- `PRUNE`:
415-
- `SKIP`:
416-
- `THEN`:
414+
- `COMMIT`: Prevents backtracking from reaching any point prior to this directive, causing the match to fail. This does not allow advancing the input to try a different starting match position.
415+
- `PRUNE`: Similar to `COMMIT`, but allows advancing the input to try and find a different starting match position.
416+
- `SKIP`: Similar to `PRUNE`, but skips ahead to the position of `SKIP` to try again as the starting position.
417+
- `THEN`: Similar to `PRUNE`, but when used inside an alternation will try to match in the subsequent branch before attempting to advance the input to find a different starting position.
417418

418419
### PCRE global matching options
419420

@@ -437,17 +438,14 @@ This is syntax specific to PCRE, and allows a set of global options to appear at
437438
- `LIMIT_DEPTH`, `LIMIT_HEAP`, `LIMIT_MATCH`: These place certain limits on the resources the matching engine may consume, and matches it may make.
438439
- `CRLF`, `CR`, `ANYCRLF`, `ANY`, `LF`, `NUL`: These control the definition of a newline character, which is used when matching e.g the `.` character class, and evaluating where a line ends in multi-line mode.
439440
- `BSR_ANYCRLF`, `BSR_UNICODE`: These change the definition of `\R`.
440-
441-
**TODO:**
442-
443-
- `NOTEMPTY_ATSTART`:
444-
- `NOTEMPTY`:
445-
- `NO_AUTO_POSSESS`:
446-
- `NO_DOTSTAR_ANCHOR`:
447-
- `NO_JIT`:
448-
- `NO_START_OPT`:
449-
- `UTF`:
450-
- `UCP`:
441+
- `NOTEMPTY`: Does not consider the empty string to be a valid match.
442+
- `NOTEMPTY_ATSTART`: Like `NOT_EMPTY`, but only applies to the first matching position in the input.
443+
- `NO_AUTO_POSSESS`: Disables an optimization that treats a quantifier as possessive if the following construct clearly cannot be part of the match. In other words, disables the short-circuiting of backtracks in cases where the engine knows it will not produce a match. This is useful for debugging, or for ensuring a callout gets invoked.
444+
- `NO_DOTSTAR_ANCHOR`: Disables an optimization that tries to automatically anchor `.*` at the start of a regex. Like `NO_AUTO_POSSESS`, this is mainly used for debugging or ensuring a callout gets invoked.
445+
- `NO_JIT`: Disables JIT compilation
446+
- `NO_START_OPT`: Disables various optimizations performed at the start of matching. Like `NO_DOTSTAR_ANCHOR`, is mainly used for debugging or ensuring a callout gets invoked.
447+
- `UTF`: Enables UTF pattern support.
448+
- `UCP`: Enables Unicode property support.
451449

452450
### Callouts
453451

0 commit comments

Comments
 (0)