Skip to content

Commit 5d49e82

Browse files
committed
Update RegexSyntax.md
1 parent 06981c7 commit 5d49e82

File tree

1 file changed

+78
-14
lines changed

1 file changed

+78
-14
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 78 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -81,10 +81,20 @@ A quantifier may optionally be followed by `?` or `+`, which adjust its semantic
8181
### Atom
8282

8383
```
84-
Atom -> Anchor | EscapeSequence | BuiltinCharClass | Backreference | Subpattern
84+
Atom -> Anchor
85+
| Backreference
86+
| BacktrackingDirective
87+
| BuiltinCharClass
88+
| Callout
89+
| CharacterProperty
90+
| EscapeSequence
91+
| NamedCharacter
92+
| Subpattern
93+
| UniScalar
94+
| '\'? <Character>
8595
```
8696

87-
Atoms are the smallest unit of regular expression syntax that cannot be split into smaller syntactic expressions. They mainly include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`.
97+
Atoms are the smallest units of regular expression syntax. They include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`. They also include some larger syntactic constructs such as backreferences and callouts. The most basic form of atom is a literal character. A meta-character may be treated as literal by preceding it with a backslash. Other characters may also be preceded with a backslash, but it has no effect, e.g `\I` is literal `I`.
8898

8999
### Groups
90100

@@ -229,19 +239,30 @@ HexDigit -> [0-9a-zA-Z]
229239
OctalDigit -> [0-7]
230240
```
231241

232-
These sequences define a unicode scalar value to be matched against.
242+
These sequences define a unicode scalar value to be matched against. There is both syntax for specifying the scalar value in hex notation, as well as octal notation.
233243

234-
**TODO: Some discussion of the fun `\DDD` syntax**
244+
The `\DDD` syntax that accepts up to 3 octal digits is syntactically ambiguous with backreference syntax. The ambiguity is resolved in the same way as PCRE. If the first digit is `0`, that is always an octal sequence (including `\0` for the NUL character). Otherwise, if any of the following hold, it is treated as a backreference:
245+
246+
- Its `0 < n < 10`.
247+
- Its first digit is `8` or `9`.
248+
- Its value corresponds to a valid prior group number.
235249

236250
### Escape sequences
237251

238252
```
239253
EscapeSequence -> '\a' | '\b' | '\c' <Char> | '\e' | '\f' | '\n' | '\r' | '\t'
240254
```
241255

242-
These escape sequences denote a specific character. Note that `\b` may only be used in a custom character class, otherwise it represents a word boundary.
256+
These escape sequences denote a specific character.
243257

244-
**TODO: List these out with a very brief description of what they mean.**
258+
- `\a`: The alert (bell) character `U+7`.
259+
- `\b`: The backspace character `U+8`. Note this may only be used in a custom character class, otherwise it represents a word boundary.
260+
- `\c <Char>`: A control character sequence (`U+00` - `U+7F`).
261+
- `\e`: The escape character `U+1B`.
262+
- `\f`: The form-feed character `U+C`.
263+
- `\n`: The newline character `U+A`.
264+
- `\r`: The carriage return character `U+D`.
265+
- `\t`: The tab character `U+9`
245266

246267
### Builtin character classes
247268

@@ -270,17 +291,27 @@ BuiltinCharClass -> '.' | '\d' | '\D' | '\h' | '\H' | '\O' | '\R' | '\s' | '\S'
270291
CustomCharClass -> Start Set (SetOp Set)* ']'
271292
Start -> '[' '^'?
272293
Set -> Member+
273-
Member -> CustomCharClass | !']' !SetOp (Range | Atom)
294+
Member -> CustomCharClass | Quote | Range | Atom
274295
Range -> Atom `-` Atom
275296
SetOp -> '&&' | '--' | '~~'
276297
```
277298

278-
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though not all atoms are valid, e.g a backreference cannot be made.
299+
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though only a few atoms are considered valid:
300+
301+
- Builtin character classes, except `.`, `\O`, and `\X`
302+
- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary)
303+
- Unicode scalars
304+
- Named characters
305+
- Character properties
306+
- Plain literal characters
307+
308+
Ranges of characters may be specified with `-`, e.g `[a-z]` matches against the letters from `a` to `z`. Only unicode scalars and literal characters are valid range operands. If `-` does not appear in a valid position, it is interpreted as literal, e.g `[-a]` is the character class of `-` and `a`. **TODO: .NET's use of it for subtraction**
279309

280-
Ranges of characters may be specified with the `-` character, e.g `[a-z]` matches against the letters from `a` to `z`. If `-` does not appear between two characters, it is interpreted as literal, e.g `[-a]` is the character class of `-` and `a`. **TODO: .NET's use of it for subtraction**
310+
**TODO: Different grammar for range?**
281311

282312
Custom character classes may be nested within each other, and may be used with set operations. The supported set operations are intersection `&&`, subtraction `--`, and symmetric difference `~~`.
283313

314+
Quoted sequences may appear with custom character classes, e.g `[\Q]\E]`, and escape the contained characters.
284315

285316
### Character properties
286317

@@ -292,11 +323,30 @@ PropertyContents -> PropertyName ('=' PropertyName)?
292323
PropertyName -> [\s\w-]+
293324
```
294325

295-
A character property specifies a particular Unicode or POSIX property to match against. In general, a property consists of both a key and a value, e.g `General_Category=Whitespace`, however some keys and values may appear on their own (with the other name being inferred). **TODO: Clarify the exact cases this happens**.
326+
A character property specifies a particular Unicode or POSIX property to match against. Fuzzy matching is used when parsing the property name, and is done according to rules set out by [UAX44-LM3]. This means that the following property names are considered equivalent:
327+
328+
- `whitespace`
329+
- `isWhitespace`
330+
- `is-White_Space`
331+
- `iSwHiTeSpaCe`
332+
- `i s w h i t e s p a c e`
333+
334+
Unicode properties consist of both a key and a value, e.g `General_Category=Whitespace`. However there are some properties where the key or value may be inferred. These include:
335+
336+
- General category properties e.g `\p{Whitespace}` is inferred as `\p{General_Category=Whitespace}`.
337+
- Script properties e.g `\p{Greek}` is inferred as `\p{Script=Greek}`.
338+
- Boolean properties that are inferred to have a `True` value, e.g `\p{Lowercase}` is inferred as `\p{Lowercase=True}`.
339+
340+
Other Unicode properties however must specify both a key and value.
341+
342+
For non-Unicode properties, only a value is required. These include:
343+
344+
- The special properties `any`, `assigned`, `ascii`.
345+
- The POSIX compatibility properties `alnum`, `blank`, `graph`, `print`, `word`, `xdigit`. The remaining POSIX properties are already covered by boolean Unicode property spellings.
296346

297347
**TODO: Spell out the properties we recognize while parsing vs. those we just parse as String?**
298348

299-
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]`.
349+
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
300350

301351
### Named characters
302352

@@ -325,6 +375,18 @@ Trivia is consumed by the regular expression parser, but has no semantic meaning
325375

326376
**TODO: Differences between PCRE extended syntax and our syntax**
327377

378+
### Quotes
379+
380+
```
381+
Quote -> '\Q' (!'\E' .)* '\E'
382+
```
383+
384+
A quoted sequence is delimited by `\Q...\E`, and allows the escaping of metacharacters such that they are interpreted literally. For example, `\Q^[xy]+$\E`, is treated as the literal characters `^[xy]+$` rather than an anchored quantified character class.
385+
386+
The backslash character is also treated as literal within a quoted sequence, and may not be used to escape the closing delimiter, e.g `\Q\\E` is a literal `\`.
387+
388+
`\E` may appear without a preceding `\Q`, in which case it is a literal `E`.
389+
328390
### References
329391

330392
```
@@ -335,6 +397,8 @@ RecursionLevel -> '+' <Int> | '-' <Int>
335397

336398
A reference is an abstract identifier for a particular capturing group in a regular expression. It can either be named or numbered, and in the latter case may be specified relative to the current group. For example `-2` refers to the capture group `N - 2` where `N` is the number of the next capture group. References may refer to groups ahead of the current position e.g `+3`, or the name of a future group. These may be useful in recursive cases where the group being referenced has been matched in a prior iteration.
337399

400+
**TODO: Describe how capture groups are numbered? Including nesting & resets?**
401+
338402
#### Backreferences
339403

340404
```
@@ -347,7 +411,7 @@ Backreference -> '\g{' NameOrNumberRef '}'
347411
| '(?P=' Identifier ')'
348412
```
349413

350-
A backreference evaluates to the value last captured by a given capturing group.
414+
A backreference evaluates to the value last captured by the referenced capturing group. Note the `\D` form of this syntax is syntactically ambiguous with octal syntax, see the *unicode scalars* section on how this ambiguity is resolved.
351415

352416
#### Subpatterns
353417

@@ -362,8 +426,7 @@ GroupLikeSubpatternBody -> 'P>' <String>
362426
| NumberRef
363427
```
364428

365-
A subpattern causes a particular group to be re-evaluated at the current position. The syntax `(?R)` is equivalent to `(?0)`, and causes the entire pattern to be recursed.
366-
429+
A subpattern causes the referenced group to be re-evaluated at the current position. The syntax `(?R)` is equivalent to `(?0)`, and causes the entire pattern to be recursed.
367430

368431
### Conditionals
369432

@@ -606,3 +669,4 @@ The least intuitive spelling being `'\' [1-9] [0-9]+`, as it can be a backrefere
606669
[icu-syntax]: https://unicode-org.github.io/icu/userguide/strings/regexp.html
607670
[uts18]: https://www.unicode.org/reports/tr18/
608671
[.net-syntax]: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expressions
672+
[UAX44-LM3]: https://www.unicode.org/reports/tr44/#UAX44-LM3

0 commit comments

Comments
 (0)