Skip to content

Commit 442c53f

Browse files
committed
Update RegexSyntax.md
1 parent 7903609 commit 442c53f

File tree

1 file changed

+220
-17
lines changed

1 file changed

+220
-17
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 220 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -26,14 +26,22 @@ We also intend to achieve at least Level 1 (**TODO: do we want to promise Level
2626

2727
We're proposing the following regular expression syntactic superset for Swift.
2828

29+
### Top-level regular expression
30+
31+
```
32+
Regex -> GlobalMatchingOptionSequence? RegexNode
33+
RegexNode -> '' | Alternation
34+
```
35+
36+
A top-level regular expression may consist of a sequence of global matching options followed by a `RegexNode`, which is the recursive part of the grammar that may be nested within e.g a group.
37+
2938
### Alternation
3039

3140
```
32-
Regex -> '' | Alternation
3341
Alternation -> Concatenation ('|' Concatenation)*
3442
```
3543

36-
This is the operator with the lowest precedence in a regular expression, and checks if any of its branches match the input.
44+
The `|` operator denotes what is formally called an alternation, or a choice between alternatives. Any number of alternatives may appear, including empty alternatives. This operator has the lowest precedence of all operators in a regex literal.
3745

3846
### Concatenation
3947

@@ -42,38 +50,67 @@ Concatenation -> (!'|' !')' ConcatComponent)*
4250
ConcatComponent -> Trivia | Quote | Quantification
4351
```
4452

45-
Implicitly denoted by adjacent expressions, a concatenation matches against a sequence of regular expression patterns. This has a higher precedence than an alternation, so e.g `abc|def` matches against `abc` or `def`. The `ConcatComponent` token varies across engine, but at least matches some form of trivia, e.g comments, quoted sequences e.g `\Q...\E`, and a quantified expression.
53+
Implicitly denoted by adjacent expressions, a concatenation matches against a sequence of regular expression patterns. This has a higher precedence than an alternation, so e.g `abc|def` matches against `abc` or `def`. The `ConcatComponent` token varies across engine, but at least matches some form of trivia, e.g comments, quoted sequences e.g `\Q...\E`, and a potentially quantified expression.
4654

4755
### Quantification
4856

4957
```
5058
Quantification -> QuantOperand Quantifier?
51-
Quantifier -> ('*' | '+' | '?' | '{' Range '}') QuantKind?
59+
Quantifier -> QuantAmount QuantKind?
60+
QuantAmount -> '?' | '*' | '+' | '{' Range '}'
5261
QuantKind -> '?' | '+'
62+
Range -> ',' <Int> | <Int> ',' <Int>? | <Int>
63+
64+
QuantOperand -> AbsentFunction | Atom | Conditional | CustomCharClass | Group
5365
```
5466

55-
Specifies that the operand may be matched against a certain number of times.
67+
A quantification consists of an operand optionally followed by a quantifier that specifier how many times it may be matched. An operand without a quantifier is matched once.
68+
69+
The quantifiers supported are:
70+
71+
- `?`: 0 or 1 matches
72+
- `*`: 0 or more matches
73+
- `+`: 1 or more matches
74+
- `{n,m}`: Between `n` and `m` (inclusive) matches
75+
- `{n,}`: `n` or more matches
76+
- `{,m}`: Up to `m` matches
77+
- `{n}`: Exactly `n` matches
78+
79+
A quantifier may optionally followed by `?` or `+`, which apply certain semantics to the quantification. If neither are specified, by default the quantification happens eagerly, meaning that it will try to maximize the number of matches made. However, if `?` is specified, the number of matches will instead be minimized. If `+` is specified, eager matching occurs, but with the additional semantic that it may not be backtracked into to try a different number of matches.
80+
81+
### Atom
5682

57-
**TODO: Briefly mention each and what it means, noting that options can swap eager/reluctant. Might be a good time to introduce the eager/reluctant/possessive terminology**
83+
```
84+
Atom -> Anchor | EscapeSequence | BuiltinCharClass
85+
```
86+
87+
Atoms are the smallest unit of regular expression syntax that cannot be split into smaller syntactic expressions.
5888

5989
### Groups
6090

6191
```
62-
GroupStart -> '(?' GroupKind | '('
63-
GroupKind -> ':' | '|' | '>' | '=' | '!' | '*' | '<=' | '<!' | '<*'
64-
| NamedGroup | MatchingOptionSeq (':' | ')')
65-
66-
NamedGroup -> 'P<' GroupNameBody '>'
67-
| '<' GroupNameBody '>'
68-
| "'" GroupNameBody "'"
92+
Group -> GroupStart RegexNode ')'
93+
GroupStart -> '(?' GroupKind | '('
94+
GroupKind -> ':' | '|' | '>' | '=' | '!' | '*' | '<=' | '<!' | '<*'
95+
| NamedGroup | MatchingOptionSeq (':' | ')')
96+
97+
NamedGroup -> 'P<' GroupNameBody '>'
98+
| '<' GroupNameBody '>'
99+
| "'" GroupNameBody "'"
69100
70101
GroupNameBody -> Identifier | BalancingGroupBody
102+
103+
Identifier -> [\w--\d] \w*
71104
```
72105

73106
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced, some may capture the nested match, some may match against the input without advancing, some may change the matching options set in the new scope, etc.
74107

75108
**TODO: Something like "note that there are other things that may syntactically appear similarly to groups, but are their own constructs. See .... in-line options, backreferences, ... **
76109

110+
#### Lookahead and lookbehind
111+
112+
#### Script runs
113+
77114
#### Balancing groups
78115

79116
```
@@ -86,15 +123,39 @@ Introduced by .NET, balancing groups extend the `GroupNameBody` syntax to suppor
86123
### Anchors
87124

88125
```
89-
Anchor -> '^' | '$' | '\b' | '\B' | '\A' | '\G' | '\z' | '\Z'
126+
Anchor -> '^' | '$' | '\A' | '\b' | '\B' | '\G' | '\y' | '\Y' | '\z' | '\Z'
90127
```
91128

92129
Anchors match against a certain position in the input rather than on a particular character of the input.
93130

131+
#### Start and end of line
132+
133+
`^` matches against the start of a line of input, `$` matches against the end of a line.
134+
135+
#### Word boundaries
136+
137+
`\b` matches a word boundary, which is [...]
138+
94139
**TODO: List these out with a very brief description of what they mean.**
95140

96141
### Unicode scalars
97142

143+
```
144+
UniScalar -> '\u{' HexDigit{1...} '}'
145+
| '\u' HexDigit{4}
146+
| '\x{' HexDigit{1...} '}'
147+
| '\x' HexDigit{0...2}
148+
| '\U' HexDigit{8}
149+
| '\o{' OctalDigit{1...} '}'
150+
| '\' OctalDigit{1...3}
151+
152+
HexDigit -> [0-9a-zA-Z]
153+
OctalDigit -> [0-7]
154+
```
155+
156+
These sequences define a unicode scalar value to be matched against.
157+
158+
**TODO: Some discussion of the fun `\DDD` syntax**
98159

99160
### Escape sequences
100161

@@ -107,7 +168,7 @@ These escape sequences denote a specific character. Note that `\b` may only be u
107168
### Builtin character classes
108169

109170
```
110-
BuiltinCharClass -> '\d' | '\D' | '\h' | '\H' | '\R' | '\s' | '\S' | '\v' | '\V' | '\w' | '\W' | '\X'
171+
BuiltinCharClass -> '.' | '\d' | '\D' | '\h' | '\H' | '\O' | '\R' | '\s' | '\S' | '\v' | '\V' | '\w' | '\W' | '\X'
111172
```
112173

113174
### Custom character classes
@@ -125,13 +186,155 @@ Custom characters classes introduce their own language, in which most regular ex
125186

126187
### Character properties
127188

189+
```
190+
CharacterProperty -> ('p{' | 'P{') PropertyName ('=' PropertyName)? '}'
191+
PropertyName -> [\s\w-]+
192+
```
193+
194+
### Named characters
195+
196+
```
197+
NamedCharacter -> '\N{' CharName '}'
198+
CharName -> 'U+' HexDigit{1...8} | [\s\w-]+
199+
```
200+
201+
### Trivia
202+
203+
```
204+
Trivia -> Comment | Whitespace
205+
Comment -> '(?#' (!')')* ')'
206+
Whitespace -> \s+
207+
```
208+
209+
Trivia is consumed by the regular expression parser, but has no semantic meaning. Non-semantic whitespace may only occur when the either of the extended syntax matching options `(?x)`, `(?xx)` are enabled.
210+
211+
### Matching options
212+
213+
```
214+
MatchingOptionSeq -> '^' MatchingOption*
215+
| MatchingOption+
216+
| MatchingOption* '-' MatchingOption*
217+
218+
MatchingOption -> 'i' | 'J' | 'm' | 'n' | 's' | 'U' | 'x' | 'xx' | 'w' | 'D' | 'P' | 'S' | 'W' | 'y{' ('g' | 'w') '}'
219+
```
220+
221+
### References
222+
223+
```
224+
NamedRef -> Identifier
225+
NumberRef -> ('+' | '-')? <Decimal Number> RecursionLevel?
226+
RecursionLevel -> '+' <Int> | '-' <Int>
227+
```
228+
229+
#### Backreferences
230+
231+
```
232+
Backreference -> '\g{' NameOrNumberRef '}'
233+
| '\g' NumberRef
234+
| '\k<' Identifier '>'
235+
| "\k'" Identifier "'"
236+
| '\k{' Identifier '}'
237+
| '\' [1-9] [0-9]+
238+
| '(?P=' Identifier ')'
239+
```
240+
241+
#### Subpatterns
242+
243+
```
244+
Subpattern -> '\g<' NameOrNumberRef '>'
245+
| "\g'" NameOrNumberRef "'"
246+
| '(?' GroupLikeSubpatternBody ')'
247+
248+
GroupLikeSubpatternBody -> 'P>' <String>
249+
| '&' <String>
250+
| 'R'
251+
| NumberRef
252+
```
253+
254+
### Conditionals
255+
256+
```
257+
Conditional -> ConditionalStart Concatenation ('|' Concatenation)? ')'
258+
ConditionalStart -> KnownConditionalStart | GroupConditionalStart
259+
260+
KnownConditionalStart -> '(?(' KnownCondition ')'
261+
GroupConditionalStart -> '(?' GroupStart
262+
263+
KnownCondition -> 'R'
264+
| 'R' NumberRef
265+
| 'R&' <String> !')'
266+
| '<' NameRef '>'
267+
| "'" NameRef "'"
268+
| 'DEFINE'
269+
| 'VERSION' VersionCheck
270+
| NumberRef
271+
| NameRef
272+
273+
PCREVersionCheck -> '>'? '=' PCREVersionNumber
274+
PCREVersionNumber -> <Int> '.' <Int>
275+
```
276+
277+
### PCRE backtracking directives
278+
279+
```
280+
BacktrackingDirective -> '(*' BacktrackingDirectiveKind (':' <String>)? ')'
281+
BacktrackingDirectiveKind -> 'ACCEPT' | 'FAIL' | 'F' | 'MARK' | '' | 'COMMIT' | 'PRUNE' | 'SKIP' | 'THEN'
282+
```
283+
284+
### PCRE global matching options
285+
286+
```
287+
GlobalMatchingOptionSequence -> GlobalMatchingOption+
288+
GlobalMatchingOption -> '(*' GlobalMatchingOptionKind ')'
289+
290+
GlobalMatchingOptionKind -> LimitOptionKind '=' <Int>
291+
| NewlineKind | NewlineSequenceKind
292+
| 'NOTEMPTY_ATSTART' | 'NOTEMPTY'
293+
| 'NO_AUTO_POSSESS' | 'NO_DOTSTAR_ANCHOR'
294+
| 'NO_JIT' | 'NO_START_OPT' | 'UTF' | 'UCP'
295+
296+
LimitOptionKind -> 'LIMIT_DEPTH' | 'LIMIT_HEAP' | 'LIMIT_MATCH'
297+
NewlineKind -> 'CRLF' | 'CR' | 'ANYCRLF' | 'ANY' | 'LF' | 'NUL'
298+
NewlineSequenceKind -> 'BSR_ANYCRLF' | 'BSR_UNICODE'
299+
```
128300

129301
### Callouts
130302

303+
```
304+
Callout -> PCRECallout | OnigurumaCallout
305+
306+
PCRECallout -> '(?C' CalloutBody ')'
307+
PCRECalloutBody -> '' | <Number>
308+
| '`' <String> '`'
309+
| "'" <String> "'"
310+
| '"' <String> '"'
311+
| '^' <String> '^'
312+
| '%' <String> '%'
313+
| '#' <String> '#'
314+
| '$' <String> '$'
315+
| '{' <String> '}'
316+
317+
OnigurumaCallout -> OnigurumaNamedCallout | OnigurumaCalloutOfContents
318+
319+
OnigurumaNamedCallout -> '(*' Identifier OnigurumaTag? OnigurumaCalloutArgs? ')'
320+
OnigurumaCalloutArgs -> '{' OnigurumaCalloutArgList '}'
321+
OnigurumaCalloutArgList -> OnigurumaCalloutArg (',' OnigurumaCalloutArgList)*
322+
OnigurumaCalloutArg -> [^,}]+
323+
OnigurumaTag -> '[' Identifier ']'
324+
325+
OnigurumaCalloutOfContents -> '(?' '{'+ Contents '}'+ OnigurumaTag? Direction? ')'
326+
OnigurumaCalloutContents -> <String>
327+
OnigurumaCalloutDirection -> 'X' | '<' | '>'
328+
```
131329

132330
### Absent functions
133331

134-
332+
```
333+
AbsentFunction -> '(?~' RegexNode ')'
334+
| '(?~|' Concatenation '|' Concatenation ')'
335+
| '(?~|' Concatenation ')'
336+
| '(?~|)'
337+
```
135338

136339
## Syntactic differences between engines
137340

@@ -141,7 +344,7 @@ In a custom character class, some engines allow for binary set operations that t
141344

142345
| PCRE | ICU | UTS#18 | Oniguruma | .NET | Java |
143346
|------|-----|--------|-----------|------|------|
144-
|| Intersection `&&`, Subtraction `--` | Intersection & Subtraction | Intersection `&&` | Subtraction via `-` | Intersection `&&` |
347+
|| Intersection `&&`, Subtraction `--` | Intersection, Subtraction | Intersection `&&` | Subtraction via `-` | Intersection `&&` |
145348

146349
[UTS#18][uts18] requires intersection and subtraction, and uses the operation spellings `&&` and `--` in its examples, though it doesn't mandate a particular spelling. In particular, conforming implementations could spell the subtraction `[[x]--[y]]` as `[[x]&&[^y]]`. UTS#18 also suggests a symmetric difference operator `~~`, and uses an explicit `||` operator in examples, though doesn't require either operations.
147350

0 commit comments

Comments
 (0)