Skip to content

Commit 7903609

Browse files
committed
Update RegexSyntax.md
1 parent a34ffb3 commit 7903609

File tree

1 file changed

+55
-55
lines changed

1 file changed

+55
-55
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 55 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
We aim to parse a superset of the syntax accepted by a variety of popular regular expression engines.
88

9-
**TODO: Elaborate**
9+
**TODO(Michael): Elaborate**
1010

1111
## Engines supported
1212

@@ -20,13 +20,13 @@ We aim to implement a syntactic superset of:
2020

2121
We also intend to achieve at least Level 1 (**TODO: do we want to promise Level 2?**) [UTS#18][uts18] conformance, which specifies regular expression matching semantics without mandating any particular syntax. However we can infer syntactic feature sets from its guidance.
2222

23-
## Regex syntax supported
23+
**TODO(Michael): Rework and expand prose**
2424

25-
### General syntax
25+
## Detailed Design
2626

27-
The following syntax are supported by all the above engines.
27+
We're proposing the following regular expression syntactic superset for Swift.
2828

29-
#### Alternation
29+
### Alternation
3030

3131
```
3232
Regex -> '' | Alternation
@@ -35,7 +35,7 @@ Alternation -> Concatenation ('|' Concatenation)*
3535

3636
This is the operator with the lowest precedence in a regular expression, and checks if any of its branches match the input.
3737

38-
#### Concatenation
38+
### Concatenation
3939

4040
```
4141
Concatenation -> (!'|' !')' ConcatComponent)*
@@ -44,7 +44,7 @@ ConcatComponent -> Trivia | Quote | Quantification
4444

4545
Implicitly denoted by adjacent expressions, a concatenation matches against a sequence of regular expression patterns. This has a higher precedence than an alternation, so e.g `abc|def` matches against `abc` or `def`. The `ConcatComponent` token varies across engine, but at least matches some form of trivia, e.g comments, quoted sequences e.g `\Q...\E`, and a quantified expression.
4646

47-
#### Quantification
47+
### Quantification
4848

4949
```
5050
Quantification -> QuantOperand Quantifier?
@@ -54,7 +54,9 @@ QuantKind -> '?' | '+'
5454

5555
Specifies that the operand may be matched against a certain number of times.
5656

57-
#### Groups
57+
**TODO: Briefly mention each and what it means, noting that options can swap eager/reluctant. Might be a good time to introduce the eager/reluctant/possessive terminology**
58+
59+
### Groups
5860

5961
```
6062
GroupStart -> '(?' GroupKind | '('
@@ -65,28 +67,50 @@ NamedGroup -> 'P<' GroupNameBody '>'
6567
| '<' GroupNameBody '>'
6668
| "'" GroupNameBody "'"
6769
68-
GroupNameBody -> Identifier
70+
GroupNameBody -> Identifier | BalancingGroupBody
6971
```
7072

7173
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced, some may capture the nested match, some may match against the input without advancing, some may change the matching options set in the new scope, etc.
7274

73-
#### Anchors
75+
**TODO: Something like "note that there are other things that may syntactically appear similarly to groups, but are their own constructs. See .... in-line options, backreferences, ... **
76+
77+
#### Balancing groups
78+
79+
```
80+
BalancingGroupBody -> Identifier? '-' Identifier
81+
```
82+
83+
Introduced by .NET, balancing groups extend the `GroupNameBody` syntax to support the ability to refer to a prior group. Upon matching, the prior group is deleted, and any intermediate matched input becomes the capture of the current group.
84+
85+
86+
### Anchors
7487

7588
```
76-
Anchor -> '^' | '$' | '\b'
89+
Anchor -> '^' | '$' | '\b' | '\B' | '\A' | '\G' | '\z' | '\Z'
7790
```
7891

7992
Anchors match against a certain position in the input rather than on a particular character of the input.
8093

81-
#### Unicode scalars
94+
**TODO: List these out with a very brief description of what they mean.**
95+
96+
### Unicode scalars
8297

8398

99+
### Escape sequences
100+
101+
```
102+
EscapeSequence -> '\a' | '\b' | '\c' <Char> | '\e' | '\f' | '\n' | '\r' | '\t'
103+
```
84104

85-
#### Builtin character classes
105+
These escape sequences denote a specific character. Note that `\b` may only be used in a custom character class, otherwise it represents a word boundary.
86106

107+
### Builtin character classes
87108

109+
```
110+
BuiltinCharClass -> '\d' | '\D' | '\h' | '\H' | '\R' | '\s' | '\S' | '\v' | '\V' | '\w' | '\W' | '\X'
111+
```
88112

89-
#### Custom character classes
113+
### Custom character classes
90114

91115
```
92116
CustomCharClass -> Start Set (SetOp Set)* ']'
@@ -96,46 +120,22 @@ Member -> CustomCharClass | !']' !SetOp (Range | Atom)
96120
Range -> Atom `-` Atom
97121
```
98122

99-
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal
100-
123+
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal.
101124

102-
#### Character properties
103125

104-
### PCRE-specific syntax
126+
### Character properties
105127

106-
#### Callouts
107128

108-
### Oniguruma-specific syntax
129+
### Callouts
109130

110-
#### Custom reference syntax
111131

112-
#### Callout syntax
113-
114-
#### Absent functions
115-
116-
### ICU-specific syntax
117-
118-
119-
120-
### .NET-specific syntax
121-
122-
#### Balancing groups
123-
124-
```
125-
GroupNameBody -> Identifier | Identifier? '-' Identifier
126-
```
127-
128-
.NET supports the ability for a group to reference a prior group, causing the prior group to be deleted, and any intermediate matched input to become the capture of the current group.
129-
130-
#### Character class subtraction with `-`
132+
### Absent functions
131133

132134

133135

134136
## Syntactic differences between engines
135137

136-
### Conflicting differences
137-
138-
#### Character class set operations
138+
### Character class set operations
139139

140140
In a custom character class, some engines allow for binary set operations that take two character class inputs, and produce a new character class output. However which set operations are supported and the spellings used differ by engine.
141141

@@ -147,9 +147,11 @@ In a custom character class, some engines allow for binary set operations that t
147147

148148
These differences are conflicting, as engines that don't support a particular operator treat them as literal, e.g `[x&&y]` in PCRE is the character class of `["x", "&", "y"]` rather than an intersection.
149149

150+
Another conflict arises with .NET's support of using the `-` character in a custom character class to denote both a range as well as a set subtraction. .NET disambiguates this by only permitting its use as a subtraction if the right hand operand is a nested custom character class, otherwise it is a range. This conflicts with e.g ICU where `[x-[y]]`, in which the `-` is treated as literal.
151+
150152
We intend to support the operators `&&`, `--`, `-`, and `~~`. This means that any regex literal containing these sequences in a custom character class while being written for an engine not supporting that operation will have a different semantic meaning in our engine. However this ought not to be a common occurrence, as specifying a character multiple times in a custom character class is redundant. However, we intend on providing a strict compatibility mode that may be used to emulate behavior of a particular engine (**TODO: all engines, or just PCRE?**).
151153

152-
#### Nested custom character classes
154+
### Nested custom character classes
153155

154156
This allows e.g `[[a]b[c]]`, which is interpreted the same as `[abc]`.
155157

@@ -163,35 +165,35 @@ PCRE does not support this feature, and as such treats `]` as the closing charac
163165

164166
We aim to support nested custom character classes, with a strict PCRE mode for emulating the PCRE behavior if desired.
165167

166-
#### `\U`
168+
### `\U`
167169

168170
In PCRE, if `PCRE2_ALT_BSUX` or `PCRE2_EXTRA_ALT_BSUX` are specified, `\U` matches literal `U`. However in ICU, `\Uhhhhhhhh` matches a hex sequence.
169171

170-
#### `{,n}`
172+
### `{,n}`
171173

172174
This quantifier is supported by Oniguruma, but in PCRE it matches the literal chars.
173175

174-
#### \0DDD
176+
### \0DDD
175177

176178
In ICU, `DDD` are interpreted as an octal code. In PCRE, only the first two digits are interpreted as octal, the last is literal.
177179

178-
#### `\x`
180+
### `\x`
179181

180182
In PCRE, a bare `\x` denotes the NUL character (`U+00`). In Oniguruma, it denotes literal `x`.
181183

182-
#### Whitespace in ranges
184+
### Whitespace in ranges
183185

184186
In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 to 4 times. However if whitespace is introduced in the range, it becomes invalid and is then treated as the literal characters. We find this behavior to be unintuitive, and therefore intend to parse any intermixed whitespace in the range, but will emit a warning telling users that we're doing so (**TODO: how would they silence? move to modern syntax?**).
185187

186-
#### Implicitly-scoped matching option scopes
188+
### Implicitly-scoped matching option scopes
187189

188190
PCRE and Oniguruma both support changing the active matching options through the `(?i)` expression. However, they have differing semantics when it comes to their scoping. In Oniguruma, it is treated as an implicit new scope that wraps everything until the end of the current group. In PCRE, it is treated as changing the matching option for all the following expressions until the end of the group.
189191

190192
These sound similar, but have different semantics around alternations, e.g for `a(?i)b|c|d`, in Oniguruma this becomes `a(?i:b|c|d)`, where `a` is no longer part of the alternation. However in PCRE it becomes `a(?i:b)|(?i:c)|(?i:d)`, where `a` remains a child of the alternation.
191193

192194
We aim to support the Oniguruma behavior by default, with a strict-PCRE mode that emulates the PCRE behavior. **TODO: The PCRE behavior is more complex for the parser, but seems less surprising, maybe that should become the default?**
193195

194-
#### Backreference condition kinds
196+
### Backreference condition kinds
195197

196198
PCRE and .NET allow for conditional patterns to reference a group by its name, e.g:
197199

@@ -203,15 +205,13 @@ where `y` will only be matched if `(?<group1>x)` was matched. PCRE will always t
203205

204206
We intend to always parse such conditions as an arbitrary regular expression condition, and will emit a warning asking users to explicitly use the syntax `(?('group1')y)` if they want a backreference condition. This more explicit syntax is supported by PCRE.
205207

206-
### Non-conflicting differences
207-
208-
#### `\N`
208+
### `\N`
209209

210210
- PCRE supports `\N` meaning "not a newline"
211211
- PCRE also supports `\N{U+hhhh}`
212212
- ICU supports `\N{UNICODE CHAR NAME}` only
213213

214-
#### Extended character property syntax
214+
### Extended character property syntax
215215

216216
**TODO: Can this be conflicting?**
217217

0 commit comments

Comments
 (0)