You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+55-55Lines changed: 55 additions & 55 deletions
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@
6
6
7
7
We aim to parse a superset of the syntax accepted by a variety of popular regular expression engines.
8
8
9
-
**TODO: Elaborate**
9
+
**TODO(Michael): Elaborate**
10
10
11
11
## Engines supported
12
12
@@ -20,13 +20,13 @@ We aim to implement a syntactic superset of:
20
20
21
21
We also intend to achieve at least Level 1 (**TODO: do we want to promise Level 2?**) [UTS#18][uts18] conformance, which specifies regular expression matching semantics without mandating any particular syntax. However we can infer syntactic feature sets from its guidance.
22
22
23
-
## Regex syntax supported
23
+
**TODO(Michael): Rework and expand prose**
24
24
25
-
### General syntax
25
+
##Detailed Design
26
26
27
-
The following syntax are supported by all the above engines.
27
+
We're proposing the following regular expression syntactic superset for Swift.
Implicitly denoted by adjacent expressions, a concatenation matches against a sequence of regular expression patterns. This has a higher precedence than an alternation, so e.g `abc|def` matches against `abc` or `def`. The `ConcatComponent` token varies across engine, but at least matches some form of trivia, e.g comments, quoted sequences e.g `\Q...\E`, and a quantified expression.
46
46
47
-
####Quantification
47
+
### Quantification
48
48
49
49
```
50
50
Quantification -> QuantOperand Quantifier?
@@ -54,7 +54,9 @@ QuantKind -> '?' | '+'
54
54
55
55
Specifies that the operand may be matched against a certain number of times.
56
56
57
-
#### Groups
57
+
**TODO: Briefly mention each and what it means, noting that options can swap eager/reluctant. Might be a good time to introduce the eager/reluctant/possessive terminology**
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced, some may capture the nested match, some may match against the input without advancing, some may change the matching options set in the new scope, etc.
72
74
73
-
#### Anchors
75
+
**TODO: Something like "note that there are other things that may syntactically appear similarly to groups, but are their own constructs. See .... in-line options, backreferences, ... **
76
+
77
+
#### Balancing groups
78
+
79
+
```
80
+
BalancingGroupBody -> Identifier? '-' Identifier
81
+
```
82
+
83
+
Introduced by .NET, balancing groups extend the `GroupNameBody` syntax to support the ability to refer to a prior group. Upon matching, the prior group is deleted, and any intermediate matched input becomes the capture of the current group.
These escape sequences denote a specific character. Note that `\b` may only be used in a custom character class, otherwise it represents a word boundary.
.NET supports the ability for a group to reference a prior group, causing the prior group to be deleted, and any intermediate matched input to become the capture of the current group.
129
-
130
-
#### Character class subtraction with `-`
132
+
### Absent functions
131
133
132
134
133
135
134
136
## Syntactic differences between engines
135
137
136
-
### Conflicting differences
137
-
138
-
#### Character class set operations
138
+
### Character class set operations
139
139
140
140
In a custom character class, some engines allow for binary set operations that take two character class inputs, and produce a new character class output. However which set operations are supported and the spellings used differ by engine.
141
141
@@ -147,9 +147,11 @@ In a custom character class, some engines allow for binary set operations that t
147
147
148
148
These differences are conflicting, as engines that don't support a particular operator treat them as literal, e.g `[x&&y]` in PCRE is the character class of `["x", "&", "y"]` rather than an intersection.
149
149
150
+
Another conflict arises with .NET's support of using the `-` character in a custom character class to denote both a range as well as a set subtraction. .NET disambiguates this by only permitting its use as a subtraction if the right hand operand is a nested custom character class, otherwise it is a range. This conflicts with e.g ICU where `[x-[y]]`, in which the `-` is treated as literal.
151
+
150
152
We intend to support the operators `&&`, `--`, `-`, and `~~`. This means that any regex literal containing these sequences in a custom character class while being written for an engine not supporting that operation will have a different semantic meaning in our engine. However this ought not to be a common occurrence, as specifying a character multiple times in a custom character class is redundant. However, we intend on providing a strict compatibility mode that may be used to emulate behavior of a particular engine (**TODO: all engines, or just PCRE?**).
151
153
152
-
####Nested custom character classes
154
+
### Nested custom character classes
153
155
154
156
This allows e.g `[[a]b[c]]`, which is interpreted the same as `[abc]`.
155
157
@@ -163,35 +165,35 @@ PCRE does not support this feature, and as such treats `]` as the closing charac
163
165
164
166
We aim to support nested custom character classes, with a strict PCRE mode for emulating the PCRE behavior if desired.
165
167
166
-
####`\U`
168
+
### `\U`
167
169
168
170
In PCRE, if `PCRE2_ALT_BSUX` or `PCRE2_EXTRA_ALT_BSUX` are specified, `\U` matches literal `U`. However in ICU, `\Uhhhhhhhh` matches a hex sequence.
169
171
170
-
####`{,n}`
172
+
### `{,n}`
171
173
172
174
This quantifier is supported by Oniguruma, but in PCRE it matches the literal chars.
173
175
174
-
####\0DDD
176
+
### \0DDD
175
177
176
178
In ICU, `DDD` are interpreted as an octal code. In PCRE, only the first two digits are interpreted as octal, the last is literal.
177
179
178
-
####`\x`
180
+
### `\x`
179
181
180
182
In PCRE, a bare `\x` denotes the NUL character (`U+00`). In Oniguruma, it denotes literal `x`.
181
183
182
-
####Whitespace in ranges
184
+
### Whitespace in ranges
183
185
184
186
In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 to 4 times. However if whitespace is introduced in the range, it becomes invalid and is then treated as the literal characters. We find this behavior to be unintuitive, and therefore intend to parse any intermixed whitespace in the range, but will emit a warning telling users that we're doing so (**TODO: how would they silence? move to modern syntax?**).
185
187
186
-
####Implicitly-scoped matching option scopes
188
+
### Implicitly-scoped matching option scopes
187
189
188
190
PCRE and Oniguruma both support changing the active matching options through the `(?i)` expression. However, they have differing semantics when it comes to their scoping. In Oniguruma, it is treated as an implicit new scope that wraps everything until the end of the current group. In PCRE, it is treated as changing the matching option for all the following expressions until the end of the group.
189
191
190
192
These sound similar, but have different semantics around alternations, e.g for `a(?i)b|c|d`, in Oniguruma this becomes `a(?i:b|c|d)`, where `a` is no longer part of the alternation. However in PCRE it becomes `a(?i:b)|(?i:c)|(?i:d)`, where `a` remains a child of the alternation.
191
193
192
194
We aim to support the Oniguruma behavior by default, with a strict-PCRE mode that emulates the PCRE behavior. **TODO: The PCRE behavior is more complex for the parser, but seems less surprising, maybe that should become the default?**
193
195
194
-
####Backreference condition kinds
196
+
### Backreference condition kinds
195
197
196
198
PCRE and .NET allow for conditional patterns to reference a group by its name, e.g:
197
199
@@ -203,15 +205,13 @@ where `y` will only be matched if `(?<group1>x)` was matched. PCRE will always t
203
205
204
206
We intend to always parse such conditions as an arbitrary regular expression condition, and will emit a warning asking users to explicitly use the syntax `(?('group1')y)` if they want a backreference condition. This more explicit syntax is supported by PCRE.
0 commit comments