You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Atoms are the smallest unit of regular expression syntax that cannot be split into smaller syntactic expressions. They mainly include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`.
97
+
Atoms are the smallest units of regular expression syntax. They include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`. They also include some larger syntactic constructs such as backreferences and callouts. The most basic form of atom is a literal character. A meta-character may be treated as literal by preceding it with a backslash. Other characters may also be preceded with a backslash, but it has no effect, e.g `\I` is literal `I`.
88
98
89
99
### Groups
90
100
@@ -229,19 +239,30 @@ HexDigit -> [0-9a-zA-Z]
229
239
OctalDigit -> [0-7]
230
240
```
231
241
232
-
These sequences define a unicode scalar value to be matched against.
242
+
These sequences define a unicode scalar value to be matched against. There is both syntax for specifying the scalar value in hex notation, as well as octal notation.
233
243
234
-
**TODO: Some discussion of the fun `\DDD` syntax**
244
+
The `\DDD` syntax that accepts up to 3 octal digits is syntactically ambiguous with backreference syntax. The ambiguity is resolved in the same way as PCRE. If the first digit is `0`, that is always an octal sequence (including `\0` for the NUL character). Otherwise, if any of the following hold, it is treated as a backreference:
245
+
246
+
- Its `0 < n < 10`.
247
+
- Its first digit is `8` or `9`.
248
+
- Its value corresponds to a valid prior group number.
These escape sequences denote a specific character. Note that `\b` may only be used in a custom character class, otherwise it represents a word boundary.
256
+
These escape sequences denote a specific character.
243
257
244
-
**TODO: List these out with a very brief description of what they mean.**
258
+
-`\a`: The alert (bell) character `U+7`.
259
+
-`\b`: The backspace character `U+8`. Note this may only be used in a custom character class, otherwise it represents a word boundary.
260
+
-`\c <Char>`: A control character sequence (`U+00` - `U+7F`).
Member -> CustomCharClass | !']' !SetOp (Range | Atom)
294
+
Member -> CustomCharClass | Quote | Range | Atom
274
295
Range -> Atom `-` Atom
275
296
SetOp -> '&&' | '--' | '~~'
276
297
```
277
298
278
-
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though not all atoms are valid, e.g a backreference cannot be made.
299
+
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though only a few atoms are considered valid:
300
+
301
+
- Builtin character classes, except `.`, `\O`, and `\X`
302
+
- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary)
303
+
- Unicode scalars
304
+
- Named characters
305
+
- Character properties
306
+
- Plain literal characters
307
+
308
+
Ranges of characters may be specified with `-`, e.g `[a-z]` matches against the letters from `a` to `z`. Only unicode scalars and literal characters are valid range operands. If `-` does not appear in a valid position, it is interpreted as literal, e.g `[-a]` is the character class of `-` and `a`. **TODO: .NET's use of it for subtraction**
279
309
280
-
Ranges of characters may be specified with the `-` character, e.g `[a-z]` matches against the letters from `a` to `z`. If `-` does not appear between two characters, it is interpreted as literal, e.g `[-a]` is the character class of `-` and `a`. **TODO: .NET's use of it for subtraction**
310
+
**TODO: Different grammar for range?**
281
311
282
312
Custom character classes may be nested within each other, and may be used with set operations. The supported set operations are intersection `&&`, subtraction `--`, and symmetric difference `~~`.
283
313
314
+
Quoted sequences may appear with custom character classes, e.g `[\Q]\E]`, and escape the contained characters.
A character property specifies a particular Unicode or POSIX property to match against. In general, a property consists of both a key and a value, e.g `General_Category=Whitespace`, however some keys and values may appear on their own (with the other name being inferred). **TODO: Clarify the exact cases this happens**.
326
+
A character property specifies a particular Unicode or POSIX property to match against. Fuzzy matching is used when parsing the property name, and is done according to rules set out by [UAX44-LM3]. This means that the following property names are considered equivalent:
327
+
328
+
-`whitespace`
329
+
-`isWhitespace`
330
+
-`is-White_Space`
331
+
-`iSwHiTeSpaCe`
332
+
-`i s w h i t e s p a c e`
333
+
334
+
Unicode properties consist of both a key and a value, e.g `General_Category=Whitespace`. However there are some properties where the key or value may be inferred. These include:
335
+
336
+
- General category properties e.g `\p{Whitespace}` is inferred as `\p{General_Category=Whitespace}`.
337
+
- Script properties e.g `\p{Greek}` is inferred as `\p{Script=Greek}`.
338
+
- Boolean properties that are inferred to have a `True` value, e.g `\p{Lowercase}` is inferred as `\p{Lowercase=True}`.
339
+
340
+
Other Unicode properties however must specify both a key and value.
341
+
342
+
For non-Unicode properties, only a value is required. These include:
343
+
344
+
- The special properties `any`, `assigned`, `ascii`.
345
+
- The POSIX compatibility properties `alnum`, `blank`, `graph`, `print`, `word`, `xdigit`. The remaining POSIX properties are already covered by boolean Unicode property spellings.
296
346
297
347
**TODO: Spell out the properties we recognize while parsing vs. those we just parse as String?**
298
348
299
-
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]`.
349
+
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
300
350
301
351
### Named characters
302
352
@@ -325,6 +375,18 @@ Trivia is consumed by the regular expression parser, but has no semantic meaning
325
375
326
376
**TODO: Differences between PCRE extended syntax and our syntax**
327
377
378
+
### Quotes
379
+
380
+
```
381
+
Quote -> '\Q' (!'\E' .)* '\E'
382
+
```
383
+
384
+
A quoted sequence is delimited by `\Q...\E`, and allows the escaping of metacharacters such that they are interpreted literally. For example, `\Q^[xy]+$\E`, is treated as the literal characters `^[xy]+$` rather than an anchored quantified character class.
385
+
386
+
The backslash character is also treated as literal within a quoted sequence, and may not be used to escape the closing delimiter, e.g `\Q\\E` is a literal `\`.
387
+
388
+
`\E` may appear without a preceding `\Q`, in which case it is a literal `E`.
A reference is an abstract identifier for a particular capturing group in a regular expression. It can either be named or numbered, and in the latter case may be specified relative to the current group. For example `-2` refers to the capture group `N - 2` where `N` is the number of the next capture group. References may refer to groups ahead of the current position e.g `+3`, or the name of a future group. These may be useful in recursive cases where the group being referenced has been matched in a prior iteration.
337
399
400
+
**TODO: Describe how capture groups are numbered? Including nesting & resets?**
A backreference evaluates to the value last captured by a given capturing group.
414
+
A backreference evaluates to the value last captured by the referenced capturing group. Note the `\D` form of this syntax is syntactically ambiguous with octal syntax, see the *unicode scalars* section on how this ambiguity is resolved.
A subpattern causes a particular group to be re-evaluated at the current position. The syntax `(?R)` is equivalent to `(?0)`, and causes the entire pattern to be recursed.
366
-
429
+
A subpattern causes the referenced group to be re-evaluated at the current position. The syntax `(?R)` is equivalent to `(?0)`, and causes the entire pattern to be recursed.
367
430
368
431
### Conditionals
369
432
@@ -606,3 +669,4 @@ The least intuitive spelling being `'\' [1-9] [0-9]+`, as it can be a backrefere
0 commit comments