Description
For the regex literal syntax, we're looking at supporting a syntactic superset of:
-
PCRE2, an "industry standard" of sorts, and a rough superset of Perl, Python, etc.
-
Oniguruma, an internationalization-oriented engine with some modern features
-
ICU, used by NSRegularExpression, a Unicode-focused engine
-
Our interpretation of UTS#18's guidance, which is about semantics, but we can infer syntactic feature sets.
-
TODO: .NET, which has delimiter-balancing and some interesting minor details on conditional patterns
These aren't all strictly compatible (e.g. a set operator in PCRE2 would just be a redundant statement of a set member). We can explore adding strict compatibility modes, but in general the syntactic superset is fairly straight-forward.
Status
The below are (roughly) implemented. There may be bugs, but we have some support and some testing coverage:
- Alternations
a|b
- Capture groups e.g
(x)
,(?:x)
,(?<name>x)
- Escaped character sequences e.g
\n
,\a
- Unicode scalars e.g
\u{...}
,\x{...}
,\uHHHH
- Builtin character classes e.g
.
,\d
,\w
,\s
- Custom character classes
[...]
, including binary operators&&
,~~
,--
- Quantifiers
x?
,x+
,x*
,x{n,m}
- Anchors e.g
\b
,^
,$
- Quoted sequences
\Q ... \E
- Comments
(?#comment)
- Character properties
\p{...}
,[:...:]
- Named characters
\N{...}
,\N{U+hh}
- Lookahead and lookbehind e.g
(?=)
,(?!)
,(*pla:)
,(?*...)
,(?<*...)
,(napla:...)
- Script runs e.g
(*script_run:...)
,(*sr:...)
,(*atomic_script_run:...)
,(*asr:...)
- Octal sequences
\ddd
,\o{...}
- Backreferences e.g
\1
,\g2
,\g{2}
,\k<name>
,\k'name'
,\g{name}
,\k{name}
,(?P=name)
- Matching options e.g
(?m)
,(?-i)
,(?:si)
,(?^m)
- Sub-patterns e.g
\g<n>
,\g'n'
,(?R)
,(?1)
,(?&name)
,(?P>name)
- Conditional patterns e.g
(?(R)...)
,(?(n)...)
,(?(<n>)...)
,(?('n')...)
,(?(condition)then|else)
- PCRE callouts e.g
(?C2)
,(?C"text")
- PCRE backtracking directives e.g
(*ACCEPT)
,(*SKIP:NAME)
- [.NET] Balancing group definitions
(?<name1-name2>...)
- [Oniguruma] Recursion level for backreferences e.g
\k<n+level>
,(?(n+level))
- [Oniguruma] Extended callout syntax e.g
(?{...})
,(*name)
- NOTE: In Perl,
(?{...})
has in-line code in it, we could consider the same (for now, we just parse an arbitrary string)
- NOTE: In Perl,
- [Oniguruma] Absent functions e.g
(?~absent)
- PCRE global matching options e.g
(*LIMIT_MATCH=d)
,(*LF)
- Extended-mode
(?x)
/(?xx)
syntax allowing for non-semantic whitespace and end-of-line commentsabc # comment
Experimental syntax
Additionally, we have (even more experimental) support for some syntactic conveniences, if specified. Note that each of these (except perhaps ranges) may introduce a syntactic incompatibility with existing traditional-syntax regexes. Thus, they are mostly illustrative, showing what happens and where we go as we slide down this "slippery slope".
- Non-semantic whitespace:
/a b c/ === /abc/
- Modern quotes:
/"a.b"/ === /\Qa.b\E/
- Swift style ranges:
/a{2..<10} b{...3}/ === /a{2,9}b{0,3}/
- Non-captures:
/a (_: b) c/ === /a(?:b)c/
TBD:
- Modern named captures:
/a (name: b) c/ === /a(?<name>b)c/
- Modern comments using
/* comment */ or
// commentinstead of
(?#. comment)` - Multi-line expressions
- Line-terminating comments as
// comment
- Line-terminating comments as
- Full Swift-lexed comments, string literals as quotes (includes raw and interpolation), etc.
- Makes sense to add as we suck actual literal lexing through our wormhole in the compiler
Swift's syntactic additions
- Options for selecting a semantic level
X
: grapheme cluster semanticsO
: Unicode scalar semanticsb
: byte semantics
Source location tracking
Implemented:
- Location of
|
in alternation - Location of
-
in[a-f]
TBD:
- TODO: @hamishknight, can you start tracking some of this here?
Integration with the Swift compiler
Initial parser support landed in swiftlang/swift#40595, using the delimiters '/.../'
, which are lexed in-package.