Skip to content

Syntax Status and Roadmap #63

Closed
Closed
@milseman

Description

@milseman

For the regex literal syntax, we're looking at supporting a syntactic superset of:

  • PCRE2, an "industry standard" of sorts, and a rough superset of Perl, Python, etc.

  • Oniguruma, an internationalization-oriented engine with some modern features

  • ICU, used by NSRegularExpression, a Unicode-focused engine

  • Our interpretation of UTS#18's guidance, which is about semantics, but we can infer syntactic feature sets.

  • TODO: .NET, which has delimiter-balancing and some interesting minor details on conditional patterns

These aren't all strictly compatible (e.g. a set operator in PCRE2 would just be a redundant statement of a set member). We can explore adding strict compatibility modes, but in general the syntactic superset is fairly straight-forward.

Status

The below are (roughly) implemented. There may be bugs, but we have some support and some testing coverage:

  • Alternations a|b
  • Capture groups e.g (x), (?:x), (?<name>x)
  • Escaped character sequences e.g \n, \a
  • Unicode scalars e.g \u{...}, \x{...}, \uHHHH
  • Builtin character classes e.g ., \d, \w, \s
  • Custom character classes [...], including binary operators &&, ~~, --
  • Quantifiers x?, x+, x*, x{n,m}
  • Anchors e.g \b, ^, $
  • Quoted sequences \Q ... \E
  • Comments (?#comment)
  • Character properties \p{...}, [:...:]
  • Named characters \N{...}, \N{U+hh}
  • Lookahead and lookbehind e.g (?=), (?!), (*pla:), (?*...), (?<*...), (napla:...)
  • Script runs e.g (*script_run:...), (*sr:...), (*atomic_script_run:...), (*asr:...)
  • Octal sequences \ddd, \o{...}
  • Backreferences e.g \1, \g2, \g{2}, \k<name>, \k'name', \g{name}, \k{name}, (?P=name)
  • Matching options e.g (?m), (?-i), (?:si), (?^m)
  • Sub-patterns e.g \g<n>, \g'n', (?R), (?1), (?&name), (?P>name)
  • Conditional patterns e.g (?(R)...), (?(n)...), (?(<n>)...), (?('n')...), (?(condition)then|else)
  • PCRE callouts e.g (?C2), (?C"text")
  • PCRE backtracking directives e.g (*ACCEPT), (*SKIP:NAME)
  • [.NET] Balancing group definitions (?<name1-name2>...)
  • [Oniguruma] Recursion level for backreferences e.g \k<n+level>, (?(n+level))
  • [Oniguruma] Extended callout syntax e.g (?{...}), (*name)
    • NOTE: In Perl, (?{...}) has in-line code in it, we could consider the same (for now, we just parse an arbitrary string)
  • [Oniguruma] Absent functions e.g (?~absent)
  • PCRE global matching options e.g (*LIMIT_MATCH=d), (*LF)
  • Extended-mode (?x)/(?xx) syntax allowing for non-semantic whitespace and end-of-line comments abc # comment

Experimental syntax

Additionally, we have (even more experimental) support for some syntactic conveniences, if specified. Note that each of these (except perhaps ranges) may introduce a syntactic incompatibility with existing traditional-syntax regexes. Thus, they are mostly illustrative, showing what happens and where we go as we slide down this "slippery slope".

  • Non-semantic whitespace: /a b c/ === /abc/
  • Modern quotes: /"a.b"/ === /\Qa.b\E/
  • Swift style ranges: /a{2..<10} b{...3}/ === /a{2,9}b{0,3}/
  • Non-captures: /a (_: b) c/ === /a(?:b)c/

TBD:

  • Modern named captures: /a (name: b) c/ === /a(?<name>b)c/
  • Modern comments using /* comment */ or // commentinstead of(?#. comment)`
  • Multi-line expressions
    • Line-terminating comments as // comment
  • Full Swift-lexed comments, string literals as quotes (includes raw and interpolation), etc.
    • Makes sense to add as we suck actual literal lexing through our wormhole in the compiler

Swift's syntactic additions

  • Options for selecting a semantic level
    • X: grapheme cluster semantics
    • O: Unicode scalar semantics
    • b: byte semantics

Source location tracking

Implemented:

  • Location of | in alternation
  • Location of - in [a-f]

TBD:

Integration with the Swift compiler

Initial parser support landed in swiftlang/swift#40595, using the delimiters '/.../', which are lexed in-package.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions