From 381a3c7ed397b9e5b19d5b1511625a173140b4de Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Fri, 18 Mar 2022 17:57:28 +0000 Subject: [PATCH 01/36] Add DelimiterSyntax.md --- Documentation/Evolution/DelimiterSyntax.md | 164 +++++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100644 Documentation/Evolution/DelimiterSyntax.md diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md new file mode 100644 index 000000000..436c27a84 --- /dev/null +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -0,0 +1,164 @@ +# Regular Expression Literal Delimiters + +- Authors: Hamish Knight, Michael Ilseman + +## Introduction + +**TODO** + +**TODO: Motivation for regex literals in the first place? Or is that a given?** + +**TODO: Overview of regex literals in other languages?** + +## Detailed Design + +A regular expression literal will be introduced using `re'...'` delimiters, within which the compiler will parse a regular expression (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]): + +``` +// Matches " = ", extracting the identifier and hex number +let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)' +``` + +The use of a two letter prefix allows for easy future extensibility of such literals, by allowing different prefixes to indicate different types of literal. **TODO: examples** + +### Parsing ambiguities + +The use of a single quote delimiter has a minor conflict with a couple of items of regex grammar, mainly around named groups. This includes `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, and `(?C'arg')`. Fortunately, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`. However we still aim to parse the single quote variants of the syntax to achieve the syntactic superset of regex grammar. + +To do this, a heuristic will be used when lexing a regex literal, and will check for the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. On encountering these, the lexer will attempt to scan ahead to the next `'` character, and then to the `'` that closes the literal. It should be noted that these are not valid regex endings, and as such this cannot break valid code. + +**TODO: Or do we want to insist on the user using raw `re#'...'#` syntax?** + +## Future Directions + +### Raw literals + +The `re'...'` syntax could be naturally extended to supporting "raw text" through allowing additional `#` characters to surround the quote characters e.g `re#'...'#`. Such literals would follow the same rules as the string literals introduced in [SE-0200]. + +In particular: + +- `\` and `'` characters would become literal, e.g `re#''\n''#` expresses a regular expression pattern that literally matches against the characters `'\n'` (including the quotes). +- Any number of `#` characters may surround the literal. +- Escape sequences would require the same number of `#` characters as in the delimiter to be treated specially. For example, `re##'\##n'##` would be required for a newline character sequence. + +### Multi-line literals + +A natural extension to the `re'...'` syntax to support multi-line regex literals would be to allow triple quote syntax: + +``` +re''' + abc + def + ''' +``` + +This would follow the precedent set by [SE-0168] for multi-line string literals, and obey the same rules, in particular with the stripping of any leading whitespace prior to the position of the closing delimiter. + +## Alternatives Considered + +### Double quoted `re"...."` + +We could choose to use double quotes instead of single quotes. This would be similar in appearance to string literals, however it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote could express this difference. + +### Single letter `r'...'` + +We could choose to shorten the literal prefix to just `r`. However this could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings. The syntax `re'...'` could also set the precedent for a 2 letter namespace for future literals. + +### Forward slashes `/.../` + +Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternative choices). However, they would be an awkward fit in Swift's language grammar, and would not provide a path for extensibility. + +#### Parsing ambiguities + +The primary parsing ambiguity with `/.../` delimiters is with comment syntax. + +An empty regex literal would conflict with line comment syntax `//`. While this isn't a particularly useful thing to express, it may lead to an awkward user typing experience. In particular, as you begin to type a regex literal, a comment could be formed before you start typing the contents. This could however be mitigated by source tooling. + +Line comment syntax additionally means that a potential multi-line version of a regular expression literal would not be able to use `///` delimiters, in accordance with the precedent set by multi-line string literals `"""`. + +There is also a conflict with block comment syntax, when surrounding a regex literal ending with `*`, for example: + +```swift +/* +let regex = /x*/ +*/ +``` + +In this case, the block comment would prematurely end on the second line, rather than extending all the way to the third line as the user would expect. This is already an issue today with `*/` in a string literal, however it is much more likely to occur in a regular expression given the prevalence of the `*` quantifier. + +Block comment syntax also means that a regex literal would not be able to start with the `*` character, however this is less of a concern as it would not be valid regex syntax. + +Finally, there would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required `x + /y/` for regex literal interpretation. + +#### Regex limitations + +Another ambiguity with `/.../` arises when it is used to start a new line. This is particularly problematic for result builders, where we expect it to be frequently used, for example: + +```swift +Builder { + 1 + / 2 / + 3 +} +``` + +This is parsed as a single operator chain, however it is likely the user is expecting a regex literal. To resolve this ambiguity, a regex literal may not start with a space or tab character. This takes advantage of the fact that infix operators require consistent spacing. + +If a space or tab is needed as the first character, it must be escaped, e.g: + +```swift +Builder { + 1 + /\ 2 / + 3 +} +``` + +**TODO: Regex starting with `)`** + +#### Language changes required + +In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes: + +- Deprecation of prefix operators containing the `/` character. +- Potentially parsing `/,` as the start of a regex literal rather than an unapplied operator in an argument list e.g `fn(/, 5) + fn(/, 3)`. + +
Rationale + +##### Prefix operators starting with `/` + +We'd need to ban prefix operators starting with `/`, to avoid ambiguity with cases such as: + +```swift +let x = /0; let y = 1/ +let z = /^x^/ +``` + +Postfix `/` operators would be okay, as they'd only be treated as regex literal delimiters if we were already trying to lex as a regex literal. + +##### Prefix operators containing `/` + +Prefix operators *containing* `/` (not just at the start) would likely need banning too, in order to allow prefix operators to be used with regex literals in an unambiguous way, e.g: + +```swift +let x = !/y / .foo() +``` + +Otherwise it would be interpreted as the prefix operator `!/` by default, and require parens `!(/y /)` for regex parsing. + +##### Comma as the starting character of a regex literal + +**TODO: Or do we want to ban it as the starting character?** + +### Pound slash `#/.../#` + +This would be less syntactically ambiguous than `/.../`, while retaining some of the term-of-art familiarity. It would also provide a natural path through which to introduce `/.../` in a new language mode, as users could drop the `#` characters once they upgrade. + +However this option would also have the same block comment issue as `/.../` where e.g `#/x*/#` nested inside a block comment would prematurely end. Similarly, it's not clear how a multi-line version of the literal would be spelled. + +Additionally, introducing this syntax would introduce an inconsistency with raw string literal syntax, as `#/.../#` on its own would not treat backslashes as literal, unlike `#"..."#`. If raw regex syntax were implemented, it would start at `##/.../##`. With raw strings, escape sequences must use the same number of `#`s as the delimiter, e.g `#"\#n"#` for a newline. However for raw regex literals it would be one fewer `#` than the delimiter e.g `##/\#n/##`. + + +[SE-0168]: https://github.com/apple/swift-evolution/blob/main/proposals/0168-multi-line-string-literals.md +[SE-0200]: https://github.com/apple/swift-evolution/blob/main/proposals/0200-raw-string-escaping.md +[internal-syntax]: https://forums.swift.org/t/pitch-regex-syntax/55711 From c79f457b738c74f5a85e22f1f2ce8d38e905b360 Mon Sep 17 00:00:00 2001 From: David Ewing Date: Sat, 19 Mar 2022 23:34:06 -0600 Subject: [PATCH 02/36] Expand on parsing issues with `/` as delimited. Add a note about editor support. --- Documentation/Evolution/DelimiterSyntax.md | 38 ++++++++++++++++------ 1 file changed, 28 insertions(+), 10 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 436c27a84..c8ba8b0ed 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -1,6 +1,6 @@ # Regular Expression Literal Delimiters -- Authors: Hamish Knight, Michael Ilseman +- Authors: Hamish Knight, Michael Ilseman, David Ewing ## Introduction @@ -29,6 +29,7 @@ To do this, a heuristic will be used when lexing a regex literal, and will check **TODO: Or do we want to insist on the user using raw `re#'...'#` syntax?** + ## Future Directions ### Raw literals @@ -66,17 +67,17 @@ We could choose to shorten the literal prefix to just `r`. However this could po ### Forward slashes `/.../` -Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternative choices). However, they would be an awkward fit in Swift's language grammar, and would not provide a path for extensibility. +Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternative choices). However, they would be an awkward fit in Swift's language grammar, and would not provide a path for extensibility. Here we give an extensive list of drawbacks to the choice. While no individual issue is terribly bad and each could be overcome, the list of issues is quite long. #### Parsing ambiguities -The primary parsing ambiguity with `/.../` delimiters is with comment syntax. +The obvious parsing ambiguity with `/.../` delimiters is with comment syntaxes. -An empty regex literal would conflict with line comment syntax `//`. While this isn't a particularly useful thing to express, it may lead to an awkward user typing experience. In particular, as you begin to type a regex literal, a comment could be formed before you start typing the contents. This could however be mitigated by source tooling. +- An empty regex literal would conflict with line comment syntax `//`. But this isn't a particularly useful thing to express, and could be disallowed. -Line comment syntax additionally means that a potential multi-line version of a regular expression literal would not be able to use `///` delimiters, in accordance with the precedent set by multi-line string literals `"""`. +- The obvious choice for a multi-line regular expression literal would be to use `///` delimiters, in accordance with the precedent set by multi-line string literals `"""`. A different multi-line delimiter would be needed, with no obvious choice. -There is also a conflict with block comment syntax, when surrounding a regex literal ending with `*`, for example: +- There is also a conflict with block comment syntax, when surrounding a regex literal ending with `*`, for example: ```swift /* @@ -84,11 +85,11 @@ let regex = /x*/ */ ``` -In this case, the block comment would prematurely end on the second line, rather than extending all the way to the third line as the user would expect. This is already an issue today with `*/` in a string literal, however it is much more likely to occur in a regular expression given the prevalence of the `*` quantifier. + In this case, the block comment would prematurely end on the second line, rather than extending all the way to the third line as the user would expect. This is already an issue today with `*/` in a string literal, however it is much more likely to occur in a regular expression given the prevalence of the `*` quantifier. -Block comment syntax also means that a regex literal would not be able to start with the `*` character, however this is less of a concern as it would not be valid regex syntax. +- Block comment syntax also means that a regex literal would not be able to start with the `*` character, however this is less of a concern as it would not be valid regex syntax. -Finally, there would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required `x + /y/` for regex literal interpretation. +- Finally, there would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required `x + /y/` for regex literal interpretation. #### Regex limitations @@ -145,11 +146,28 @@ let x = !/y / .foo() ``` Otherwise it would be interpreted as the prefix operator `!/` by default, and require parens `!(/y /)` for regex parsing. + + + +**TODO: More cases from slack discussion ** + +```swift +func foo(_ x: (Int, Int) -> Int, _ y: (Int, Int) -> Int) {} +foo(/, /) +``` + +`foo(/, "(") / 2` !!! + + ##### Comma as the starting character of a regex literal **TODO: Or do we want to ban it as the starting character?** - + +#### Editor Considerations + +Many source editors in use today do rather simplistic syntax coloring of programming languages. And there's a long history of complaints about syntax coloring of regular expressions in Perl, JavaScript and Ruby to be found on the internet. While the most popular editors do a very good job recognizing the most common incantations of a regular expression in each language, most still don't get it 100% right. There's just a lot of work involved in doing that. If parsing Swift regular expressions is as difficult as these other languages because of the choice of delimiter, it becomes a barrier to entry for support by those editors. + ### Pound slash `#/.../#` This would be less syntactically ambiguous than `/.../`, while retaining some of the term-of-art familiarity. It would also provide a natural path through which to introduce `/.../` in a new language mode, as users could drop the `#` characters once they upgrade. From 731292e2ad3bffbc261edc9abe90ffc875c3c89e Mon Sep 17 00:00:00 2001 From: David Ewing Date: Sun, 20 Mar 2022 21:54:34 -0600 Subject: [PATCH 03/36] Rewrite that Editor Considerations paragraph. --- Documentation/Evolution/DelimiterSyntax.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index c8ba8b0ed..b870f521e 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -166,7 +166,7 @@ foo(/, /) #### Editor Considerations -Many source editors in use today do rather simplistic syntax coloring of programming languages. And there's a long history of complaints about syntax coloring of regular expressions in Perl, JavaScript and Ruby to be found on the internet. While the most popular editors do a very good job recognizing the most common incantations of a regular expression in each language, most still don't get it 100% right. There's just a lot of work involved in doing that. If parsing Swift regular expressions is as difficult as these other languages because of the choice of delimiter, it becomes a barrier to entry for support by those editors. +As described above, there would be a lot involved in handling the parsing ambiguities with `/.../` delimiters. It's one thing to do this in the compiler. But the language also has to be understood by a plethora of source code editors. Those editors either need encode all those ambiguities, or they need to provide a "best effort" at handling the most common cases. It's all too common for editors to take the "best effort" route. There's a long history of complaints with editors that don't completely support a language's features. And indeed, there's plenty of history of editors that don't correctly support regular expression literals in other languages. By choosing a literal that is easily parsed, we should avoid seeing those complaints regarding Swift. ### Pound slash `#/.../#` From 104927631a479c573a384bc55edbb474fa5d169c Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Mon, 21 Mar 2022 14:03:36 +0000 Subject: [PATCH 04/36] Change single quote constructs to be invalid --- Documentation/Evolution/DelimiterSyntax.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index b870f521e..45af08118 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -21,14 +21,11 @@ let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)' The use of a two letter prefix allows for easy future extensibility of such literals, by allowing different prefixes to indicate different types of literal. **TODO: examples** -### Parsing ambiguities +### Regex limitations -The use of a single quote delimiter has a minor conflict with a couple of items of regex grammar, mainly around named groups. This includes `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, and `(?C'arg')`. Fortunately, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`. However we still aim to parse the single quote variants of the syntax to achieve the syntactic superset of regex grammar. - -To do this, a heuristic will be used when lexing a regex literal, and will check for the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. On encountering these, the lexer will attempt to scan ahead to the next `'` character, and then to the `'` that closes the literal. It should be noted that these are not valid regex endings, and as such this cannot break valid code. - -**TODO: Or do we want to insist on the user using raw `re#'...'#` syntax?** +There are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. Fortunately, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`, and `(?C"arg")`. +As such, the single quote variants of the syntax will be considered invalid in a `re'...'` literal, and users must use the alternative syntax. If a raw variant of the syntax `re#'...'#` of the syntax is later added, that may also be used. In order to improve diagnostic behavior, the compiler will attempt to scan ahead when encountering the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. This will enable a more accurate error to be emitted that suggests the alternative syntax. ## Future Directions From c7d556cb6b25531b7f455b23127b641f5cf11c14 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Mon, 21 Mar 2022 14:40:25 +0000 Subject: [PATCH 05/36] Elaborate on starting character limitations --- Documentation/Evolution/DelimiterSyntax.md | 50 ++++++++++++++++------ 1 file changed, 38 insertions(+), 12 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 45af08118..8beb0b113 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -25,7 +25,7 @@ The use of a two letter prefix allows for easy future extensibility of such lite There are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. Fortunately, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`, and `(?C"arg")`. -As such, the single quote variants of the syntax will be considered invalid in a `re'...'` literal, and users must use the alternative syntax. If a raw variant of the syntax `re#'...'#` of the syntax is later added, that may also be used. In order to improve diagnostic behavior, the compiler will attempt to scan ahead when encountering the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. This will enable a more accurate error to be emitted that suggests the alternative syntax. +As such, the single quote variants of the syntax will be considered invalid in a `re'...'` literal, and users must use the alternative syntax instead. If a raw variant of the syntax `re#'...'#` of the syntax is later added, that may also be used. In order to improve diagnostic behavior, the compiler will attempt to scan ahead when encountering the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. This will enable a more accurate error to be emitted that suggests the alternative syntax. ## Future Directions @@ -35,7 +35,7 @@ The `re'...'` syntax could be naturally extended to supporting "raw text" throug In particular: -- `\` and `'` characters would become literal, e.g `re#''\n''#` expresses a regular expression pattern that literally matches against the characters `'\n'` (including the quotes). +- `\` and `'` characters would become literal, e.g `re#''\n''#` expresses a regular expression pattern that literally matches against the characters `'\n'` (including the quotes). **TODO: Do we really want to treat backslash as literal? Seems consistent, but escape sequences are frequently used in regex.** - Any number of `#` characters may surround the literal. - Escape sequences would require the same number of `#` characters as in the delimiter to be treated specially. For example, `re##'\##n'##` would be required for a newline character sequence. @@ -70,17 +70,17 @@ Forward slashes are a regex term of art, and are used as the delimiters for rege The obvious parsing ambiguity with `/.../` delimiters is with comment syntaxes. -- An empty regex literal would conflict with line comment syntax `//`. But this isn't a particularly useful thing to express, and could be disallowed. +- An empty regex literal would conflict with line comment syntax `//`. But this isn't a particularly useful thing to express, and can therefore be disallowed without significant impact. - The obvious choice for a multi-line regular expression literal would be to use `///` delimiters, in accordance with the precedent set by multi-line string literals `"""`. A different multi-line delimiter would be needed, with no obvious choice. -- There is also a conflict with block comment syntax, when surrounding a regex literal ending with `*`, for example: +- There is a conflict with block comment syntax, when surrounding a regex literal ending with `*`, for example: -```swift -/* -let regex = /x*/ -*/ -``` + ```swift + /* + let regex = /x*/ + */ + ``` In this case, the block comment would prematurely end on the second line, rather than extending all the way to the third line as the user would expect. This is already an issue today with `*/` in a string literal, however it is much more likely to occur in a regular expression given the prevalence of the `*` quantifier. @@ -90,7 +90,11 @@ let regex = /x*/ #### Regex limitations -Another ambiguity with `/.../` arises when it is used to start a new line. This is particularly problematic for result builders, where we expect it to be frequently used, for example: +In order to help avoid parsing ambiguities, a regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax. + +
Rationale + +This is due to 2 main ambiguities. The first of which arises when a `/.../` regex literal is used to start a new line. This is particularly problematic for result builders, where we expect it to be frequently used, for example: ```swift Builder { @@ -100,7 +104,7 @@ Builder { } ``` -This is parsed as a single operator chain, however it is likely the user is expecting a regex literal. To resolve this ambiguity, a regex literal may not start with a space or tab character. This takes advantage of the fact that infix operators require consistent spacing. +This is parsed as a single operator chain, however it is likely the user is expecting a regex literal. To resolve this ambiguity, a regex literal may not start with a space or tab character. This takes advantage of the fact that infix operators require consistent spacing on either side. If a space or tab is needed as the first character, it must be escaped, e.g: @@ -112,7 +116,27 @@ Builder { } ``` -**TODO: Regex starting with `)`** +The second ambiguity arises with Swift's ability to pass an unapplied operator reference as an argument to a function, for example: + +```swift +let arr: [Double] = [2, 3, 4] +let x = arr.reduce(1, /) / 5 +``` + +The `/` in the call to `reduce` is in a valid expression context, and as such could be passed as a regular expression literal. To help mitigate this ambiguity, a regex literal will not be parsed if the first character is `)`. Note this would not be valid regex syntax anyway. + +This is also applicable to unapplied operator references in parentheses and tuples. + +It should be noted that this only mitigates the issue, as another ambiguity arises if the next character is a comma: + +```swift +func foo(_ x: (Int, Int) -> Int, _ y: (Int, Int) -> Int) {} +foo(/, /) +``` + +However we feel that starting a regex with a comma is likely to be a common case, and as such we intend to change the parser such that the above becomes a regex literal. + +
#### Language changes required @@ -161,6 +185,8 @@ foo(/, /) **TODO: Or do we want to ban it as the starting character?** +
+ #### Editor Considerations As described above, there would be a lot involved in handling the parsing ambiguities with `/.../` delimiters. It's one thing to do this in the compiler. But the language also has to be understood by a plethora of source code editors. Those editors either need encode all those ambiguities, or they need to provide a "best effort" at handling the most common cases. It's all too common for editors to take the "best effort" route. There's a long history of complaints with editors that don't completely support a language's features. And indeed, there's plenty of history of editors that don't correctly support regular expression literals in other languages. By choosing a literal that is easily parsed, we should avoid seeing those complaints regarding Swift. From 7074cfb08cd0ec58eeb106274fcfd0d520254119 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Mon, 21 Mar 2022 15:10:21 +0000 Subject: [PATCH 06/36] Elaborate on comma case --- Documentation/Evolution/DelimiterSyntax.md | 33 +++++++++++++--------- 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 8beb0b113..6c0eb28ab 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -90,11 +90,11 @@ The obvious parsing ambiguity with `/.../` delimiters is with comment syntaxes. #### Regex limitations -In order to help avoid parsing ambiguities, a regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax. +In order to help avoid further parsing ambiguities, a regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax.
Rationale -This is due to 2 main ambiguities. The first of which arises when a `/.../` regex literal is used to start a new line. This is particularly problematic for result builders, where we expect it to be frequently used, for example: +This is due to 2 main ambiguities. The first of which arises when a `/.../` regex literal starts a new line. This is particularly problematic for result builders, where we expect it to be frequently used, for example: ```swift Builder { @@ -116,16 +116,14 @@ Builder { } ``` -The second ambiguity arises with Swift's ability to pass an unapplied operator reference as an argument to a function, for example: +The second ambiguity arises with Swift's ability to pass an unapplied operator reference as an argument to a function for example: ```swift let arr: [Double] = [2, 3, 4] let x = arr.reduce(1, /) / 5 ``` -The `/` in the call to `reduce` is in a valid expression context, and as such could be passed as a regular expression literal. To help mitigate this ambiguity, a regex literal will not be parsed if the first character is `)`. Note this would not be valid regex syntax anyway. - -This is also applicable to unapplied operator references in parentheses and tuples. +The `/` in the call to `reduce` is in a valid expression context, and as such could be parsed as a regex literal. This is also applicable to operators in tuples and parentheses. To help mitigate this ambiguity, a regex literal will not be parsed if the first character is `)`. This should have minimal impact, as this would not be valid regex syntax anyway. It should be noted that this only mitigates the issue, as another ambiguity arises if the next character is a comma: @@ -143,7 +141,7 @@ However we feel that starting a regex with a comma is likely to be a common case In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes: - Deprecation of prefix operators containing the `/` character. -- Potentially parsing `/,` as the start of a regex literal rather than an unapplied operator in an argument list e.g `fn(/, 5) + fn(/, 3)`. +- Potentially parsing `/,` as the start of a regex literal rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments. **TODO: Or do we want to ban it as the starting character? Seems like a common regex case**
Rationale @@ -167,23 +165,32 @@ let x = !/y / .foo() ``` Otherwise it would be interpreted as the prefix operator `!/` by default, and require parens `!(/y /)` for regex parsing. + +##### Comma as the starting character of a regex literal +As stated previously, there is a parsing ambiguity with unapplied operators in argument lists, tuples, and parentheses. Some of these cases can be mitigated by not parsing a regex literal if the starting character is `)`. However it does not solve the issue when the next character is `,`, i.e `/` is used in an argument list before another argument. - -**TODO: More cases from slack discussion ** +For example: ```swift func foo(_ x: (Int, Int) -> Int, _ y: (Int, Int) -> Int) {} foo(/, /) ``` -`foo(/, "(") / 2` !!! +This is currently parsed as 2 unapplied operator arguments. However, given the fact that a regex starting with a comma is not an uncommon case, this will become a regex literal. +The above case seems uncommon, however note this may also occur when the closing `/` appears outside of the argument list, e.g: - -##### Comma as the starting character of a regex literal +```swift +foo(/, 2) + foo(/, 3) +``` + +This would also become a regex literal, i.e it would be parsed as the argument `/, 2) + foo(/`. + +**TODO: More cases from slack discussion ** + +`foo(/, "(") / 2` !!! -**TODO: Or do we want to ban it as the starting character?**
From 945219eec7b8b27ee5c6709d73ce473e4e0fb8f9 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Mon, 21 Mar 2022 15:11:26 +0000 Subject: [PATCH 07/36] grammar --- Documentation/Evolution/DelimiterSyntax.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 6c0eb28ab..c866619c2 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -116,7 +116,7 @@ Builder { } ``` -The second ambiguity arises with Swift's ability to pass an unapplied operator reference as an argument to a function for example: +The second ambiguity arises with Swift's ability to pass an unapplied operator reference as an argument to a function, for example: ```swift let arr: [Double] = [2, 3, 4] From d1d0d5710fc5720319f15140a5256cb7fa1c2429 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Mon, 21 Mar 2022 15:24:57 +0000 Subject: [PATCH 08/36] Add comma disambiguation --- Documentation/Evolution/DelimiterSyntax.md | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index c866619c2..ed8b06756 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -21,7 +21,7 @@ let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)' The use of a two letter prefix allows for easy future extensibility of such literals, by allowing different prefixes to indicate different types of literal. **TODO: examples** -### Regex limitations +### Regex syntax limitations There are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. Fortunately, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`, and `(?C"arg")`. @@ -88,7 +88,7 @@ The obvious parsing ambiguity with `/.../` delimiters is with comment syntaxes. - Finally, there would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required `x + /y/` for regex literal interpretation. -#### Regex limitations +#### Regex syntax limitations In order to help avoid further parsing ambiguities, a regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax. @@ -141,7 +141,7 @@ However we feel that starting a regex with a comma is likely to be a common case In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes: - Deprecation of prefix operators containing the `/` character. -- Potentially parsing `/,` as the start of a regex literal rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments. **TODO: Or do we want to ban it as the starting character? Seems like a common regex case** +- Parsing `/,` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments. **TODO: Or do we want to ban it as the starting character? Seems like a common regex case**
Rationale @@ -185,13 +185,18 @@ The above case seems uncommon, however note this may also occur when the closing foo(/, 2) + foo(/, 3) ``` -This would also become a regex literal, i.e it would be parsed as the argument `/, 2) + foo(/`. +This would also become a regex literal, i.e it would be parsed as the argument `/, 2) + foo(/`. If users wish to disambiguate, they will need to surround at least the opening `/` with parentheses, e.g: + +```swift +foo((/), 2) + foo(/, 3) +``` + +This takes advantage of the fact that a regex literal will not be parsed if the first character is `)`. **TODO: More cases from slack discussion ** `foo(/, "(") / 2` !!! -
#### Editor Considerations From bcebfc63f7f3a77a05eb6b137b0b611e832c3044 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Mon, 21 Mar 2022 16:37:15 +0000 Subject: [PATCH 09/36] Update comma disambiguation --- Documentation/Evolution/DelimiterSyntax.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index ed8b06756..e267d2274 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -185,6 +185,8 @@ The above case seems uncommon, however note this may also occur when the closing foo(/, 2) + foo(/, 3) ``` +**TODO: Do we want to talk about a heuristic that looks for unbalanced parens? I'm kind of hesitant to implement that, as it would have edge cases and might screw with regex errors that should be diagnosed as invalid regex, rather than some cryptic Swift syntactic error. Which would also make it harder to explain to users.** + This would also become a regex literal, i.e it would be parsed as the argument `/, 2) + foo(/`. If users wish to disambiguate, they will need to surround at least the opening `/` with parentheses, e.g: ```swift @@ -193,10 +195,6 @@ foo((/), 2) + foo(/, 3) This takes advantage of the fact that a regex literal will not be parsed if the first character is `)`. -**TODO: More cases from slack discussion ** - -`foo(/, "(") / 2` !!! -
#### Editor Considerations From 1c2b7ad00d2d99fd67f764b3f53b52069e347929 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Tue, 22 Mar 2022 19:20:36 +0000 Subject: [PATCH 10/36] Flip pitch to `/.../` as the main syntax A quick pass to flip `/.../` out of the alternatives and into the main syntax. Still needs a bunch of work. Also add some commentary on a regex with `]` as the starting character. --- Documentation/Evolution/DelimiterSyntax.md | 153 +++++++++++---------- 1 file changed, 81 insertions(+), 72 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index e267d2274..c2c6287b8 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -12,61 +12,22 @@ ## Detailed Design -A regular expression literal will be introduced using `re'...'` delimiters, within which the compiler will parse a regular expression (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]): +**TODO: Say that this is Swift 6 syntax only, `#/.../#` would be 5.7 syntax** + +A regular expression literal will be introduced using `/.../` delimiters, within which the compiler will parse a regular expression (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]): ``` // Matches " = ", extracting the identifier and hex number -let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)' +let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/ ``` -The use of a two letter prefix allows for easy future extensibility of such literals, by allowing different prefixes to indicate different types of literal. **TODO: examples** - -### Regex syntax limitations - -There are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. Fortunately, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`, and `(?C"arg")`. - -As such, the single quote variants of the syntax will be considered invalid in a `re'...'` literal, and users must use the alternative syntax instead. If a raw variant of the syntax `re#'...'#` of the syntax is later added, that may also be used. In order to improve diagnostic behavior, the compiler will attempt to scan ahead when encountering the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. This will enable a more accurate error to be emitted that suggests the alternative syntax. - -## Future Directions - -### Raw literals +Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternative choices). Due to its existing use in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is particularly high. -The `re'...'` syntax could be naturally extended to supporting "raw text" through allowing additional `#` characters to surround the quote characters e.g `re#'...'#`. Such literals would follow the same rules as the string literals introduced in [SE-0200]. +**TODO: Do we want to present a stronger argument for `/.../`?** -In particular: +**TODO: Anything else we want to say here before segueing into the massive list?** -- `\` and `'` characters would become literal, e.g `re#''\n''#` expresses a regular expression pattern that literally matches against the characters `'\n'` (including the quotes). **TODO: Do we really want to treat backslash as literal? Seems consistent, but escape sequences are frequently used in regex.** -- Any number of `#` characters may surround the literal. -- Escape sequences would require the same number of `#` characters as in the delimiter to be treated specially. For example, `re##'\##n'##` would be required for a newline character sequence. - -### Multi-line literals - -A natural extension to the `re'...'` syntax to support multi-line regex literals would be to allow triple quote syntax: - -``` -re''' - abc - def - ''' -``` - -This would follow the precedent set by [SE-0168] for multi-line string literals, and obey the same rules, in particular with the stripping of any leading whitespace prior to the position of the closing delimiter. - -## Alternatives Considered - -### Double quoted `re"...."` - -We could choose to use double quotes instead of single quotes. This would be similar in appearance to string literals, however it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote could express this difference. - -### Single letter `r'...'` - -We could choose to shorten the literal prefix to just `r`. However this could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings. The syntax `re'...'` could also set the precedent for a 2 letter namespace for future literals. - -### Forward slashes `/.../` - -Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternative choices). However, they would be an awkward fit in Swift's language grammar, and would not provide a path for extensibility. Here we give an extensive list of drawbacks to the choice. While no individual issue is terribly bad and each could be overcome, the list of issues is quite long. - -#### Parsing ambiguities +### Parsing ambiguities The obvious parsing ambiguity with `/.../` delimiters is with comment syntaxes. @@ -88,7 +49,7 @@ The obvious parsing ambiguity with `/.../` delimiters is with comment syntaxes. - Finally, there would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required `x + /y/` for regex literal interpretation. -#### Regex syntax limitations +### Regex syntax limitations In order to help avoid further parsing ambiguities, a regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax. @@ -125,27 +86,20 @@ let x = arr.reduce(1, /) / 5 The `/` in the call to `reduce` is in a valid expression context, and as such could be parsed as a regex literal. This is also applicable to operators in tuples and parentheses. To help mitigate this ambiguity, a regex literal will not be parsed if the first character is `)`. This should have minimal impact, as this would not be valid regex syntax anyway. -It should be noted that this only mitigates the issue, as another ambiguity arises if the next character is a comma: - -```swift -func foo(_ x: (Int, Int) -> Int, _ y: (Int, Int) -> Int) {} -foo(/, /) -``` - -However we feel that starting a regex with a comma is likely to be a common case, and as such we intend to change the parser such that the above becomes a regex literal. +It should be noted that this only mitigates the issue, as it does not handle the case where the next character is a comma or right square bracket. These cases are explored further in the following section. -#### Language changes required +### Language changes required -In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes: +In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes in Swift 6 mode: - Deprecation of prefix operators containing the `/` character. -- Parsing `/,` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments. **TODO: Or do we want to ban it as the starting character? Seems like a common regex case** +- Parsing `/,` and `/]` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments.
Rationale -##### Prefix operators starting with `/` +#### Prefix operators starting with `/` We'd need to ban prefix operators starting with `/`, to avoid ambiguity with cases such as: @@ -156,7 +110,7 @@ let z = /^x^/ Postfix `/` operators would be okay, as they'd only be treated as regex literal delimiters if we were already trying to lex as a regex literal. -##### Prefix operators containing `/` +#### Prefix operators containing `/` Prefix operators *containing* `/` (not just at the start) would likely need banning too, in order to allow prefix operators to be used with regex literals in an unambiguous way, e.g: @@ -166,49 +120,104 @@ let x = !/y / .foo() Otherwise it would be interpreted as the prefix operator `!/` by default, and require parens `!(/y /)` for regex parsing. -##### Comma as the starting character of a regex literal +#### `/,` and `/]` as regex literal openings -As stated previously, there is a parsing ambiguity with unapplied operators in argument lists, tuples, and parentheses. Some of these cases can be mitigated by not parsing a regex literal if the starting character is `)`. However it does not solve the issue when the next character is `,`, i.e `/` is used in an argument list before another argument. +As stated previously, there is a parsing ambiguity with unapplied operators in argument lists, tuples, and parentheses. Some of these cases can be mitigated by not parsing a regex literal if the starting character is `)`. However it does not solve the issue when the next character is `,` or `]`. Both of these are valid regex starting characters, and comma in particular may be a fairly common case for a regex. For example: ```swift +// Ambiguity with comma: func foo(_ x: (Int, Int) -> Int, _ y: (Int, Int) -> Int) {} foo(/, /) -``` - -This is currently parsed as 2 unapplied operator arguments. However, given the fact that a regex starting with a comma is not an uncommon case, this will become a regex literal. -The above case seems uncommon, however note this may also occur when the closing `/` appears outside of the argument list, e.g: +// Also affects cases where the closing '/' is outside the argument list. +func bar(_ fn: (Int, Int) -> Int, _ x: Int) -> Int { 0 } +bar(/, 2) + bar(/, 3) -```swift -foo(/, 2) + foo(/, 3) +// Ambiguity with right square bracket: +struct S { + subscript(_ fn: (Int, Int) -> Int) -> Int { 0 } +} +func baz(_ x: S) -> Int { + x[/] + x[/] +} ``` +`foo(/, /)` is currently parsed as 2 unapplied operator arguments. `bar(/, 2) + bar(/, 3)` is currently parsed as two independent calls that each take an unapplied `/` operator reference. Both of these would become regex literals arguments, `/, /` and `/, 2) + bar(/` respectively (though the latter would produce a regex error). + **TODO: Do we want to talk about a heuristic that looks for unbalanced parens? I'm kind of hesitant to implement that, as it would have edge cases and might screw with regex errors that should be diagnosed as invalid regex, rather than some cryptic Swift syntactic error. Which would also make it harder to explain to users.** -This would also become a regex literal, i.e it would be parsed as the argument `/, 2) + foo(/`. If users wish to disambiguate, they will need to surround at least the opening `/` with parentheses, e.g: +To disambiguate these cases, users will need to surround at least the opening `/` with parentheses, e.g: ```swift -foo((/), 2) + foo(/, 3) +foo((/), /) +bar((/), 2) + bar(/, 3) + +func baz(_ x: S) -> Int { + x[(/)] + x[/] +} ``` This takes advantage of the fact that a regex literal will not be parsed if the first character is `)`.
-#### Editor Considerations +### Editor Considerations + +**TODO: Rewrite now that `/.../` is the syntax being pitched?** As described above, there would be a lot involved in handling the parsing ambiguities with `/.../` delimiters. It's one thing to do this in the compiler. But the language also has to be understood by a plethora of source code editors. Those editors either need encode all those ambiguities, or they need to provide a "best effort" at handling the most common cases. It's all too common for editors to take the "best effort" route. There's a long history of complaints with editors that don't completely support a language's features. And indeed, there's plenty of history of editors that don't correctly support regular expression literals in other languages. By choosing a literal that is easily parsed, we should avoid seeing those complaints regarding Swift. + ### Pound slash `#/.../#` +**TODO: This needs to be rewritten to say that it's a transition syntax** + This would be less syntactically ambiguous than `/.../`, while retaining some of the term-of-art familiarity. It would also provide a natural path through which to introduce `/.../` in a new language mode, as users could drop the `#` characters once they upgrade. However this option would also have the same block comment issue as `/.../` where e.g `#/x*/#` nested inside a block comment would prematurely end. Similarly, it's not clear how a multi-line version of the literal would be spelled. Additionally, introducing this syntax would introduce an inconsistency with raw string literal syntax, as `#/.../#` on its own would not treat backslashes as literal, unlike `#"..."#`. If raw regex syntax were implemented, it would start at `##/.../##`. With raw strings, escape sequences must use the same number of `#`s as the delimiter, e.g `#"\#n"#` for a newline. However for raw regex literals it would be one fewer `#` than the delimiter e.g `##/\#n/##`. +## Future Directions + +**TODO: What do we want to say here?** + +## Alternatives Considered + +### Prefixed quote `re'...'` + +**TODO: Do a pass over this to make sure it sounds correct now that it's an alternative** + +We could choose to use `re'...'` delimiters, for example: + +``` +// Matches " = ", extracting the identifier and hex number +let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)' +``` + +**TODO: Fill in reasons why not to pick this** + +**TODO: Mention that it nicely extends to raw and multiline?** + +#### Regex syntax limitations + +There are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. Fortunately, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`, and `(?C"arg")`. + +As such, the single quote variants of the syntax would be considered invalid in a `re'...'` literal, and users must use the alternative syntax instead. If a raw variant of the syntax `re#'...'#` of the syntax is later added, that may also be used. In order to improve diagnostic behavior, the compiler would attempt to scan ahead when encountering the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. This would enable a more accurate error to be emitted that suggests the alternative syntax. + +**TODO: Do we actually want to include the below? They're less relevant if `re'...'` is itself the alternative** + +### Double quoted `re"...."` + +We could choose to use double quotes instead of single quotes. This would be similar in appearance to string literals, however it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote could express this difference. + +### Single letter `r'...'` + +We could choose to shorten the literal prefix to just `r`. However this could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings. The syntax `re'...'` could also set the precedent for a 2 letter namespace for future literals. + +**TODO: Add the other alternatives e.g `#regex(...)`** [SE-0168]: https://github.com/apple/swift-evolution/blob/main/proposals/0168-multi-line-string-literals.md [SE-0200]: https://github.com/apple/swift-evolution/blob/main/proposals/0200-raw-string-escaping.md From e8411847c80233a4745e38cd7fd35fccc7fed462 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Wed, 23 Mar 2022 17:30:11 +0000 Subject: [PATCH 11/36] Update alternatives considered --- Documentation/Evolution/DelimiterSyntax.md | 52 ++++++++++++++++++---- 1 file changed, 44 insertions(+), 8 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index c2c6287b8..b45872337 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -188,8 +188,6 @@ Additionally, introducing this syntax would introduce an inconsistency with raw ### Prefixed quote `re'...'` -**TODO: Do a pass over this to make sure it sounds correct now that it's an alternative** - We could choose to use `re'...'` delimiters, for example: ``` @@ -197,6 +195,8 @@ We could choose to use `re'...'` delimiters, for example: let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)' ``` +The use of two letter prefix could potentially be used as a namespace for future literal types. However, it is unusual for a Swift literal to be prefixed in this way. + **TODO: Fill in reasons why not to pick this** **TODO: Mention that it nicely extends to raw and multiline?** @@ -207,17 +207,53 @@ There are a few items of regex grammar that use the single quote character as a As such, the single quote variants of the syntax would be considered invalid in a `re'...'` literal, and users must use the alternative syntax instead. If a raw variant of the syntax `re#'...'#` of the syntax is later added, that may also be used. In order to improve diagnostic behavior, the compiler would attempt to scan ahead when encountering the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. This would enable a more accurate error to be emitted that suggests the alternative syntax. -**TODO: Do we actually want to include the below? They're less relevant if `re'...'` is itself the alternative** +### Prefixed double quote `re"...."` + +This would be a double quoted version of `re'...'`, more similar to string literal syntax. This has the advantage that single quote regex syntax e.g `(?'name')` would continue to work without requiring the use of the alternative syntax or "raw syntax" delimiters. However it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote may be useful in expressing this difference. + +### Single letter prefixed quote `r'...'` + +This would be a slightly shorter version of `re'...'`. While it's more concise, it could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings. + +#### Single quotes `'...'` + +This would be an even more concise version of `re'...'` that drops the prefix entirely. However, given how close it is to string literal syntax, it may not be entirely clear to users that `'...'` denotes a regular expression as opposed to some different form of string literal (e.g some form of character literal, or a string literal with different escaping rules). + +We could help distinguish it from a string literal by requiring e.g `'/.../'`, though it may not be clear that the `/` characters are part of the delimiters rather than part of the literal. Additionally, this would potentially rule out the use of `'...'` as a future literal kind. + +#### Magic literal `#regex(...)` + +We could opt for for a more explicitly spelled out literal syntax such as `#regex(...)`. This is an even more heavyweight option, similar to `#selector(...)`. As such, it may be considered syntactically noisy as e.g a function argument `str.match(#regex([abc]+))` vs `str.match(/[abc]+/)`. -### Double quoted `re"...."` +Such a syntax would require the containing regex to correctly balance capture group parentheses, otherwise the rest of the line might be incorrectly considered a regex. This could place additional cognitive burden on the user, and may lead to an awkward typing experience. For example, if the user is editing a previously written regex, the syntax highlighting for the rest of the line may change, and unhelpful spurious errors may be reported. With a different delimiter, the compiler would be able to detect and better diagnose unbalanced parentheses in the regex. + +We could avoid the parenthesis balancing issue by requiring an additional internal delimiter such as `#regex(/.../)`. However it is even more heavyweight, and it may be unclear that `/` is part of the delimiter rather than part of the literal. Alternatively, we could replace the internal delimiter with another character such as ```#regex`...` ```, `#regex{...}`, or `#regex/.../`. However those would be inconsistent with the existing `#literal(...)` syntax and the first two would overload the existing meanings for the ``` `` ``` and `{}` delimiters. + +It should also be noted that `#regex(...)` would introduce a syntactic inconsistency where the argument of a `#literal(...)` is no longer necessarily valid Swift syntax, despite being written in the form of an argument. + +#### Shortened magic literal `#(...)` + +We could reduce the visual weight of `#regex(...)` by only requiring `#(...)`. This would retain the same advantages e.g not requiring to escape `/`. However it would also still retain the same issues, such as still looking potentially visually noisy as an argument, and having suboptimal behavior for parenthesis balancing. It is also not clear why regex literals would deserve such privileged syntax. + +### Reusing string literal syntax + +Instead of supporting a first-class literal kind for regular expressions, we could instead allow users to write a regular expression in a string literal, and parse, diagnose, and generate the appropriate code when it's coerced to an `ExpressibleByRegexLiteral` conforming type. + +```swift +let regex: Regex = "([[:alpha:]]\w*) = ([0-9A-F]+)" +``` -We could choose to use double quotes instead of single quotes. This would be similar in appearance to string literals, however it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote could express this difference. +However we decided against this because: -### Single letter `r'...'` +- We would not be able to easily apply custom syntax highlighting for the regex syntax. +- It would require an `ExpressibleByRegexLiteral` contextual type to be treated as a regex, otherwise it would be defaulted to `String`, which may be undesired. +- In an overloaded context it may be ambiguous or unclear whether a string literal is meant to be interpreted as a literal string or regex. +- Regex escape sequences aren't currently compatible with string literal escape sequence rules, e.g `\w` is currently illegal in a string literal. +- It wouldn't be compatible with other string literal features such as interpolations. -We could choose to shorten the literal prefix to just `r`. However this could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings. The syntax `re'...'` could also set the precedent for a 2 letter namespace for future literals. +### No custom literal -**TODO: Add the other alternatives e.g `#regex(...)`** +Instead of adding a custom regex literal, we could require users to explicitly write `Regex(compiling: "[abc]+")`. This would however lose all the benefits of parsing the literal at compile time, meaning that parse errors will instead be diagnosed at runtime, and no source tooling support (e.g syntax highlighting, refactoring actions) would be available. [SE-0168]: https://github.com/apple/swift-evolution/blob/main/proposals/0168-multi-line-string-literals.md [SE-0200]: https://github.com/apple/swift-evolution/blob/main/proposals/0200-raw-string-escaping.md From be7a8022dd54990e984550a59699fe4021ac05db Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Wed, 23 Mar 2022 17:31:19 +0000 Subject: [PATCH 12/36] Fix headings --- Documentation/Evolution/DelimiterSyntax.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index b45872337..fd66d11cf 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -215,13 +215,13 @@ This would be a double quoted version of `re'...'`, more similar to string liter This would be a slightly shorter version of `re'...'`. While it's more concise, it could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings. -#### Single quotes `'...'` +### Single quotes `'...'` This would be an even more concise version of `re'...'` that drops the prefix entirely. However, given how close it is to string literal syntax, it may not be entirely clear to users that `'...'` denotes a regular expression as opposed to some different form of string literal (e.g some form of character literal, or a string literal with different escaping rules). We could help distinguish it from a string literal by requiring e.g `'/.../'`, though it may not be clear that the `/` characters are part of the delimiters rather than part of the literal. Additionally, this would potentially rule out the use of `'...'` as a future literal kind. -#### Magic literal `#regex(...)` +### Magic literal `#regex(...)` We could opt for for a more explicitly spelled out literal syntax such as `#regex(...)`. This is an even more heavyweight option, similar to `#selector(...)`. As such, it may be considered syntactically noisy as e.g a function argument `str.match(#regex([abc]+))` vs `str.match(/[abc]+/)`. @@ -231,7 +231,7 @@ We could avoid the parenthesis balancing issue by requiring an additional intern It should also be noted that `#regex(...)` would introduce a syntactic inconsistency where the argument of a `#literal(...)` is no longer necessarily valid Swift syntax, despite being written in the form of an argument. -#### Shortened magic literal `#(...)` +### Shortened magic literal `#(...)` We could reduce the visual weight of `#regex(...)` by only requiring `#(...)`. This would retain the same advantages e.g not requiring to escape `/`. However it would also still retain the same issues, such as still looking potentially visually noisy as an argument, and having suboptimal behavior for parenthesis balancing. It is also not clear why regex literals would deserve such privileged syntax. From 06c2b28f53eeb2b13eccb89d2f3635d363f4fc90 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Wed, 23 Mar 2022 17:38:13 +0000 Subject: [PATCH 13/36] Tweak phrasing --- Documentation/Evolution/DelimiterSyntax.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index fd66d11cf..9455dd0d0 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -223,7 +223,7 @@ We could help distinguish it from a string literal by requiring e.g `'/.../'`, t ### Magic literal `#regex(...)` -We could opt for for a more explicitly spelled out literal syntax such as `#regex(...)`. This is an even more heavyweight option, similar to `#selector(...)`. As such, it may be considered syntactically noisy as e.g a function argument `str.match(#regex([abc]+))` vs `str.match(/[abc]+/)`. +We could opt for for a more explicitly spelled out literal syntax such as `#regex(...)`. This is a more heavyweight option, similar to `#selector(...)`. As such, it may be considered syntactically noisy as e.g a function argument `str.match(#regex([abc]+))` vs `str.match(/[abc]+/)`. Such a syntax would require the containing regex to correctly balance capture group parentheses, otherwise the rest of the line might be incorrectly considered a regex. This could place additional cognitive burden on the user, and may lead to an awkward typing experience. For example, if the user is editing a previously written regex, the syntax highlighting for the rest of the line may change, and unhelpful spurious errors may be reported. With a different delimiter, the compiler would be able to detect and better diagnose unbalanced parentheses in the regex. @@ -233,14 +233,14 @@ It should also be noted that `#regex(...)` would introduce a syntactic inconsist ### Shortened magic literal `#(...)` -We could reduce the visual weight of `#regex(...)` by only requiring `#(...)`. This would retain the same advantages e.g not requiring to escape `/`. However it would also still retain the same issues, such as still looking potentially visually noisy as an argument, and having suboptimal behavior for parenthesis balancing. It is also not clear why regex literals would deserve such privileged syntax. +We could reduce the visual weight of `#regex(...)` by only requiring `#(...)`. However it would still retain the same issues, such as still looking potentially visually noisy as an argument, and having suboptimal behavior for parenthesis balancing. It is also not clear why regex literals would deserve such privileged syntax. ### Reusing string literal syntax Instead of supporting a first-class literal kind for regular expressions, we could instead allow users to write a regular expression in a string literal, and parse, diagnose, and generate the appropriate code when it's coerced to an `ExpressibleByRegexLiteral` conforming type. ```swift -let regex: Regex = "([[:alpha:]]\w*) = ([0-9A-F]+)" +let regex: Regex = #"([[:alpha:]]\w*) = ([0-9A-F]+)"# ``` However we decided against this because: @@ -248,7 +248,7 @@ However we decided against this because: - We would not be able to easily apply custom syntax highlighting for the regex syntax. - It would require an `ExpressibleByRegexLiteral` contextual type to be treated as a regex, otherwise it would be defaulted to `String`, which may be undesired. - In an overloaded context it may be ambiguous or unclear whether a string literal is meant to be interpreted as a literal string or regex. -- Regex escape sequences aren't currently compatible with string literal escape sequence rules, e.g `\w` is currently illegal in a string literal. +- Regex-specific escape sequences such as `\w` would likely require the use of raw string syntax `#"..."#`, as they are otherwise invalid in a string literal. - It wouldn't be compatible with other string literal features such as interpolations. ### No custom literal From 91a93a89ecc1d7cb80db5c626a7648c71752ebb7 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Wed, 23 Mar 2022 20:22:17 +0000 Subject: [PATCH 14/36] Small tweaks --- Documentation/Evolution/DelimiterSyntax.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 9455dd0d0..e1d3fc65a 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -16,14 +16,14 @@ A regular expression literal will be introduced using `/.../` delimiters, within which the compiler will parse a regular expression (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]): -``` +```swift // Matches " = ", extracting the identifier and hex number let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/ ``` -Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternative choices). Due to its existing use in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is particularly high. +Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternatives). Their ubiquity and familiarity makes them a compelling choice for Swift. -**TODO: Do we want to present a stronger argument for `/.../`?** +Due to the existing use of `/` in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is sufficient to disqualify the syntax. **TODO: Anything else we want to say here before segueing into the massive list?** @@ -182,7 +182,7 @@ Additionally, introducing this syntax would introduce an inconsistency with raw ## Future Directions -**TODO: What do we want to say here?** +**TODO: What do we want to say here? Talk about raw and multiline? Don't really have a good option for the latter tho** ## Alternatives Considered @@ -190,14 +190,14 @@ Additionally, introducing this syntax would introduce an inconsistency with raw We could choose to use `re'...'` delimiters, for example: -``` +```swift // Matches " = ", extracting the identifier and hex number let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)' ``` The use of two letter prefix could potentially be used as a namespace for future literal types. However, it is unusual for a Swift literal to be prefixed in this way. -**TODO: Fill in reasons why not to pick this** +**TODO: Any other reasons why not to pick this?** **TODO: Mention that it nicely extends to raw and multiline?** @@ -225,7 +225,7 @@ We could help distinguish it from a string literal by requiring e.g `'/.../'`, t We could opt for for a more explicitly spelled out literal syntax such as `#regex(...)`. This is a more heavyweight option, similar to `#selector(...)`. As such, it may be considered syntactically noisy as e.g a function argument `str.match(#regex([abc]+))` vs `str.match(/[abc]+/)`. -Such a syntax would require the containing regex to correctly balance capture group parentheses, otherwise the rest of the line might be incorrectly considered a regex. This could place additional cognitive burden on the user, and may lead to an awkward typing experience. For example, if the user is editing a previously written regex, the syntax highlighting for the rest of the line may change, and unhelpful spurious errors may be reported. With a different delimiter, the compiler would be able to detect and better diagnose unbalanced parentheses in the regex. +Such a syntax would require the containing regex to correctly balance parentheses for groups, otherwise the rest of the line might be incorrectly considered a regex. This could place additional cognitive burden on the user, and may lead to an awkward typing experience. For example, if the user is editing a previously written regex, the syntax highlighting for the rest of the line may change, and unhelpful spurious errors may be reported. With a different delimiter, the compiler would be able to detect and better diagnose unbalanced parentheses in the regex. We could avoid the parenthesis balancing issue by requiring an additional internal delimiter such as `#regex(/.../)`. However it is even more heavyweight, and it may be unclear that `/` is part of the delimiter rather than part of the literal. Alternatively, we could replace the internal delimiter with another character such as ```#regex`...` ```, `#regex{...}`, or `#regex/.../`. However those would be inconsistent with the existing `#literal(...)` syntax and the first two would overload the existing meanings for the ``` `` ``` and `{}` delimiters. From c0e3befca172b1d1d07dc0b357d86cadef02bdef Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Wed, 23 Mar 2022 20:25:53 +0000 Subject: [PATCH 15/36] Tweak --- Documentation/Evolution/DelimiterSyntax.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index e1d3fc65a..e2d53f280 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -112,7 +112,7 @@ Postfix `/` operators would be okay, as they'd only be treated as regex literal #### Prefix operators containing `/` -Prefix operators *containing* `/` (not just at the start) would likely need banning too, in order to allow prefix operators to be used with regex literals in an unambiguous way, e.g: +Prefix operators *containing* `/` (not just at the start) need banning too, in order to allow prefix operators to be used with regex literals in an unambiguous way, e.g: ```swift let x = !/y / .foo() From 298092c5bd3c86870998bdca436cf174b031d4c5 Mon Sep 17 00:00:00 2001 From: David Ewing Date: Wed, 23 Mar 2022 22:21:31 -0600 Subject: [PATCH 16/36] Flesh things out a bit more. Initial bits for the Intro and Motivation. Split out Proposed solution from Detailed design. Parallelize the structure a bit better. --- Documentation/Evolution/DelimiterSyntax.md | 75 ++++++++++++---------- 1 file changed, 42 insertions(+), 33 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index e2d53f280..8feed983e 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -1,20 +1,27 @@ -# Regular Expression Literal Delimiters +# Regex Literal Delimiters - Authors: Hamish Knight, Michael Ilseman, David Ewing ## Introduction -**TODO** +This proposal introduces regex literals to Swift source code. The proposed syntax mirrors literals in other programing languages such as Perl, JavaScript and Ruby. As in those languages, literals are delimited with the `/` character: -**TODO: Motivation for regex literals in the first place? Or is that a given?** +```swift +let re = /[0-9]+/ +``` + +## Motivation -**TODO: Overview of regex literals in other languages?** +This proposal helps complete the story told in [Regex Type and Overview][regex-type] and [elsewhere][pitch-status]. Literals are compiled directly, allowing errors to be found at compile time, rather than at run time. Using a literal also allows editors to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see [Regex builder DSL][regex-dsl]). It would be difficult to support all of this if regexes could only be defined inside a string. -## Detailed Design + +## Proposed solution **TODO: Say that this is Swift 6 syntax only, `#/.../#` would be 5.7 syntax** -A regular expression literal will be introduced using `/.../` delimiters, within which the compiler will parse a regular expression (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]): +**TODO: But is it?** + +A regex literal will be introduced using `/.../` delimiters, within which the compiler will parse a regular expression (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]): ```swift // Matches " = ", extracting the identifier and hex number @@ -25,21 +32,21 @@ Forward slashes are a regex term of art, and are used as the delimiters for rege Due to the existing use of `/` in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is sufficient to disqualify the syntax. -**TODO: Anything else we want to say here before segueing into the massive list?** +## Detailed design -### Parsing ambiguities +Choice of `/` as the regex literal delimiter requires a number of ambiguities to be resolved. And it requires some existing features of the language to be disallowed. -The obvious parsing ambiguity with `/.../` delimiters is with comment syntaxes. +### Ambiguities with comment syntax -- An empty regex literal would conflict with line comment syntax `//`. But this isn't a particularly useful thing to express, and can therefore be disallowed without significant impact. +Perhaps the most obvious parsing ambiguity with `/.../` delimiters is with comment syntax. -- The obvious choice for a multi-line regular expression literal would be to use `///` delimiters, in accordance with the precedent set by multi-line string literals `"""`. A different multi-line delimiter would be needed, with no obvious choice. +- An empty regex literal would conflict with line comment syntax `//`. But an empty regex isn't a particularly useful thing to express, and can be disallowed without significant impact. - There is a conflict with block comment syntax, when surrounding a regex literal ending with `*`, for example: ```swift /* - let regex = /x*/ + let regex = /[0-9]*/ */ ``` @@ -47,7 +54,10 @@ The obvious parsing ambiguity with `/.../` delimiters is with comment syntaxes. - Block comment syntax also means that a regex literal would not be able to start with the `*` character, however this is less of a concern as it would not be valid regex syntax. -- Finally, there would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required `x + /y/` for regex literal interpretation. + +### Ambiguity with infix operators + +There would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required `x + /y/` for regex literal interpretation. ### Regex syntax limitations @@ -163,12 +173,22 @@ This takes advantage of the fact that a regex literal will not be parsed if the -### Editor Considerations -**TODO: Rewrite now that `/.../` is the syntax being pitched?** +## Future Directions + +### Raw literals + +The obvious choice here would follow string literals and use `#/.../#`. + +### Multi-line literals + +The obvious choice for a multi-line regex literal would be to use `///` delimiters, in accordance with the precedent set by multi-line string literals `"""`. But this signifies a (documentation) comment, so a different multi-line delimiter would be needed, with no obvious choice. However, it's not clear that we need multi-line regex literals. The existing literals can be used inside a regex builder DSL. -As described above, there would be a lot involved in handling the parsing ambiguities with `/.../` delimiters. It's one thing to do this in the compiler. But the language also has to be understood by a plethora of source code editors. Those editors either need encode all those ambiguities, or they need to provide a "best effort" at handling the most common cases. It's all too common for editors to take the "best effort" route. There's a long history of complaints with editors that don't completely support a language's features. And indeed, there's plenty of history of editors that don't correctly support regular expression literals in other languages. By choosing a literal that is easily parsed, we should avoid seeing those complaints regarding Swift. +### Regex extended syntax +Allowing non-semantic whitespace and other features of the extended syntax would be highly desired, with no obvious choice for a literal. Perhaps the need is also lessened by the ability to use regex literals inside the regex builder DSL. + +## Alternatives Considered ### Pound slash `#/.../#` @@ -180,12 +200,6 @@ However this option would also have the same block comment issue as `/.../` wher Additionally, introducing this syntax would introduce an inconsistency with raw string literal syntax, as `#/.../#` on its own would not treat backslashes as literal, unlike `#"..."#`. If raw regex syntax were implemented, it would start at `##/.../##`. With raw strings, escape sequences must use the same number of `#`s as the delimiter, e.g `#"\#n"#` for a newline. However for raw regex literals it would be one fewer `#` than the delimiter e.g `##/\#n/##`. -## Future Directions - -**TODO: What do we want to say here? Talk about raw and multiline? Don't really have a good option for the latter tho** - -## Alternatives Considered - ### Prefixed quote `re'...'` We could choose to use `re'...'` delimiters, for example: @@ -195,17 +209,9 @@ We could choose to use `re'...'` delimiters, for example: let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)' ``` -The use of two letter prefix could potentially be used as a namespace for future literal types. However, it is unusual for a Swift literal to be prefixed in this way. - -**TODO: Any other reasons why not to pick this?** - -**TODO: Mention that it nicely extends to raw and multiline?** - -#### Regex syntax limitations - -There are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. Fortunately, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`, and `(?C"arg")`. +The use of two letter prefix could potentially be used as a namespace for future literal types. It would also have obvious extensions to raw and multi-line literals using `re#'...'#` and `re'''...'''` respectively. However, it is unusual for a Swift literal to be prefixed in this way. We also feel that its similarity to a string literal might have users confuse it with a raw string literal. -As such, the single quote variants of the syntax would be considered invalid in a `re'...'` literal, and users must use the alternative syntax instead. If a raw variant of the syntax `re#'...'#` of the syntax is later added, that may also be used. In order to improve diagnostic behavior, the compiler would attempt to scan ahead when encountering the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. This would enable a more accurate error to be emitted that suggests the alternative syntax. +Also, there are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. However, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`, and `(?C"arg")`. Those could be required instead. If a raw regex literal were later added, the single quote syntax could also be used. ### Prefixed double quote `re"...."` @@ -245,7 +251,7 @@ let regex: Regex = #"([[:alpha:]]\w*) = ([0-9A-F]+)"# However we decided against this because: -- We would not be able to easily apply custom syntax highlighting for the regex syntax. +- We would not be able to easily apply custom syntax highlighting and other editor features for the regex syntax. - It would require an `ExpressibleByRegexLiteral` contextual type to be treated as a regex, otherwise it would be defaulted to `String`, which may be undesired. - In an overloaded context it may be ambiguous or unclear whether a string literal is meant to be interpreted as a literal string or regex. - Regex-specific escape sequences such as `\w` would likely require the use of raw string syntax `#"..."#`, as they are otherwise invalid in a string literal. @@ -258,3 +264,6 @@ Instead of adding a custom regex literal, we could require users to explicitly w [SE-0168]: https://github.com/apple/swift-evolution/blob/main/proposals/0168-multi-line-string-literals.md [SE-0200]: https://github.com/apple/swift-evolution/blob/main/proposals/0200-raw-string-escaping.md [internal-syntax]: https://forums.swift.org/t/pitch-regex-syntax/55711 +[regex-type]: https://forums.swift.org/t/pitch-regex-type-and-overview/56029 +[pitch-status]: https://github.com/apple/swift-experimental-string-processing/issues/107 +[regex-dsl]: https://forums.swift.org/t/pitch-regex-builder-dsl/56007 \ No newline at end of file From 1ebefa1b4a4dd304948e54e66131603a8ac54eb2 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Thu, 24 Mar 2022 14:56:45 +0000 Subject: [PATCH 17/36] Expand out disclosure triangles, and other tweaks --- Documentation/Evolution/DelimiterSyntax.md | 52 ++++++++++------------ 1 file changed, 23 insertions(+), 29 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 8feed983e..da889082b 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -21,7 +21,7 @@ This proposal helps complete the story told in [Regex Type and Overview][regex-t **TODO: But is it?** -A regex literal will be introduced using `/.../` delimiters, within which the compiler will parse a regular expression (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]): +A regex literal will be introduced using `/.../` delimiters, within which the compiler will parse a regex (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]): ```swift // Matches " = ", extracting the identifier and hex number @@ -34,7 +34,7 @@ Due to the existing use of `/` in comment syntax and operators, there are some s ## Detailed design -Choice of `/` as the regex literal delimiter requires a number of ambiguities to be resolved. And it requires some existing features of the language to be disallowed. +Choosing `/` as the regex literal delimiter requires a number of ambiguities to be resolved. It also requires a couple of source breaking language changes to be introduced in a new language mode. ### Ambiguities with comment syntax @@ -50,7 +50,7 @@ Perhaps the most obvious parsing ambiguity with `/.../` delimiters is with comme */ ``` - In this case, the block comment would prematurely end on the second line, rather than extending all the way to the third line as the user would expect. This is already an issue today with `*/` in a string literal, however it is much more likely to occur in a regular expression given the prevalence of the `*` quantifier. + In this case, the block comment would prematurely end on the second line, rather than extending all the way to the third line as the user would expect. This is already an issue today with `*/` in a string literal, though it is more likely to occur in a regex given the prevalence of the `*` quantifier. This issue can be avoided in many cases by using line comment syntax `//` instead, which it should be noted is the syntax that Xcode uses when commenting out multiple lines. - Block comment syntax also means that a regex literal would not be able to start with the `*` character, however this is less of a concern as it would not be valid regex syntax. @@ -63,7 +63,7 @@ There would be a minor ambiguity with infix operators used with regex literals. In order to help avoid further parsing ambiguities, a regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax. -
Rationale +#### Rationale This is due to 2 main ambiguities. The first of which arises when a `/.../` regex literal starts a new line. This is particularly problematic for result builders, where we expect it to be frequently used, for example: @@ -75,7 +75,7 @@ Builder { } ``` -This is parsed as a single operator chain, however it is likely the user is expecting a regex literal. To resolve this ambiguity, a regex literal may not start with a space or tab character. This takes advantage of the fact that infix operators require consistent spacing on either side. +This is parsed as a single operator chain, however it is likely the user is expecting a regex literal. To resolve this ambiguity, a regex literal may not start with a space or tab character. The above therefore remains an operator chain. This takes advantage of the fact that infix operators require consistent spacing on either side. If a space or tab is needed as the first character, it must be escaped, e.g: @@ -87,7 +87,7 @@ Builder { } ``` -The second ambiguity arises with Swift's ability to pass an unapplied operator reference as an argument to a function, for example: +The second ambiguity arises with Swift's ability to pass an unapplied operator reference as an argument to a function or subscript, for example: ```swift let arr: [Double] = [2, 3, 4] @@ -98,37 +98,31 @@ The `/` in the call to `reduce` is in a valid expression context, and as such co It should be noted that this only mitigates the issue, as it does not handle the case where the next character is a comma or right square bracket. These cases are explored further in the following section. -
- ### Language changes required In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes in Swift 6 mode: - Deprecation of prefix operators containing the `/` character. - Parsing `/,` and `/]` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments. - -
Rationale -#### Prefix operators starting with `/` +#### Prefix operators containing `/` -We'd need to ban prefix operators starting with `/`, to avoid ambiguity with cases such as: +We need to ban prefix operators starting with `/`, to avoid ambiguity with cases such as: ```swift let x = /0; let y = 1/ let z = /^x^/ ``` - -Postfix `/` operators would be okay, as they'd only be treated as regex literal delimiters if we were already trying to lex as a regex literal. -#### Prefix operators containing `/` - -Prefix operators *containing* `/` (not just at the start) need banning too, in order to allow prefix operators to be used with regex literals in an unambiguous way, e.g: +Prefix operators containing `/` more generally also need banning, in order to allow prefix operators to be used with regex literals in an unambiguous way, e.g: ```swift let x = !/y / .foo() ``` - -Otherwise it would be interpreted as the prefix operator `!/` by default, and require parens `!(/y /)` for regex parsing. + +Today, this is interpreted as the prefix operator `!/` on `y`. With the banning of prefix operators containing `/`, it becomes prefix `!` on a regex literal, with a member access `.foo`. + +Postfix `/` operators do not require banning, as they'd only be treated as regex literal delimiters if we are already trying to lex as a regex literal. #### `/,` and `/]` as regex literal openings @@ -156,8 +150,6 @@ func baz(_ x: S) -> Int { `foo(/, /)` is currently parsed as 2 unapplied operator arguments. `bar(/, 2) + bar(/, 3)` is currently parsed as two independent calls that each take an unapplied `/` operator reference. Both of these would become regex literals arguments, `/, /` and `/, 2) + bar(/` respectively (though the latter would produce a regex error). -**TODO: Do we want to talk about a heuristic that looks for unbalanced parens? I'm kind of hesitant to implement that, as it would have edge cases and might screw with regex errors that should be diagnosed as invalid regex, rather than some cryptic Swift syntactic error. Which would also make it harder to explain to users.** - To disambiguate these cases, users will need to surround at least the opening `/` with parentheses, e.g: ```swift @@ -180,6 +172,8 @@ This takes advantage of the fact that a regex literal will not be parsed if the The obvious choice here would follow string literals and use `#/.../#`. +**TODO: What backslash rules do we want?** + ### Multi-line literals The obvious choice for a multi-line regex literal would be to use `///` delimiters, in accordance with the precedent set by multi-line string literals `"""`. But this signifies a (documentation) comment, so a different multi-line delimiter would be needed, with no obvious choice. However, it's not clear that we need multi-line regex literals. The existing literals can be used inside a regex builder DSL. @@ -192,7 +186,7 @@ Allowing non-semantic whitespace and other features of the extended syntax would ### Pound slash `#/.../#` -**TODO: This needs to be rewritten to say that it's a transition syntax** +**TODO: This needs to be rewritten to say that it's a potential transition syntax** This would be less syntactically ambiguous than `/.../`, while retaining some of the term-of-art familiarity. It would also provide a natural path through which to introduce `/.../` in a new language mode, as users could drop the `#` characters once they upgrade. @@ -211,11 +205,11 @@ let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)' The use of two letter prefix could potentially be used as a namespace for future literal types. It would also have obvious extensions to raw and multi-line literals using `re#'...'#` and `re'''...'''` respectively. However, it is unusual for a Swift literal to be prefixed in this way. We also feel that its similarity to a string literal might have users confuse it with a raw string literal. -Also, there are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. However, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`, and `(?C"arg")`. Those could be required instead. If a raw regex literal were later added, the single quote syntax could also be used. +Also, there are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. However, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`, and `(?C"arg")`. Those could be required instead. A raw regex literal syntax e.g `re#'...'#` would also avoid this issue. ### Prefixed double quote `re"...."` -This would be a double quoted version of `re'...'`, more similar to string literal syntax. This has the advantage that single quote regex syntax e.g `(?'name')` would continue to work without requiring the use of the alternative syntax or "raw syntax" delimiters. However it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote may be useful in expressing this difference. +This would be a double quoted version of `re'...'`, more similar to string literal syntax. This has the advantage that single quote regex syntax e.g `(?'name')` would continue to work without requiring the use of the alternative syntax or raw literal syntax. However it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote may be useful in expressing this difference. ### Single letter prefixed quote `r'...'` @@ -223,7 +217,7 @@ This would be a slightly shorter version of `re'...'`. While it's more concise, ### Single quotes `'...'` -This would be an even more concise version of `re'...'` that drops the prefix entirely. However, given how close it is to string literal syntax, it may not be entirely clear to users that `'...'` denotes a regular expression as opposed to some different form of string literal (e.g some form of character literal, or a string literal with different escaping rules). +This would be an even more concise version of `re'...'` that drops the prefix entirely. However, given how close it is to string literal syntax, it may not be entirely clear to users that `'...'` denotes a regex as opposed to some different form of string literal (e.g some form of character literal, or a string literal with different escaping rules). We could help distinguish it from a string literal by requiring e.g `'/.../'`, though it may not be clear that the `/` characters are part of the delimiters rather than part of the literal. Additionally, this would potentially rule out the use of `'...'` as a future literal kind. @@ -233,7 +227,7 @@ We could opt for for a more explicitly spelled out literal syntax such as `#rege Such a syntax would require the containing regex to correctly balance parentheses for groups, otherwise the rest of the line might be incorrectly considered a regex. This could place additional cognitive burden on the user, and may lead to an awkward typing experience. For example, if the user is editing a previously written regex, the syntax highlighting for the rest of the line may change, and unhelpful spurious errors may be reported. With a different delimiter, the compiler would be able to detect and better diagnose unbalanced parentheses in the regex. -We could avoid the parenthesis balancing issue by requiring an additional internal delimiter such as `#regex(/.../)`. However it is even more heavyweight, and it may be unclear that `/` is part of the delimiter rather than part of the literal. Alternatively, we could replace the internal delimiter with another character such as ```#regex`...` ```, `#regex{...}`, or `#regex/.../`. However those would be inconsistent with the existing `#literal(...)` syntax and the first two would overload the existing meanings for the ``` `` ``` and `{}` delimiters. +We could avoid the parenthesis balancing issue by requiring an additional internal delimiter such as `#regex(/.../)`. However this is even more heavyweight, and it may be unclear that `/` is part of the delimiter rather than part of an argument. Alternatively, we could replace the internal delimiter with another character such as ```#regex`...` ```, `#regex{...}`, or `#regex/.../`. However those would be inconsistent with the existing `#literal(...)` syntax and the first two would overload the existing meanings for the ``` `` ``` and `{}` delimiters. It should also be noted that `#regex(...)` would introduce a syntactic inconsistency where the argument of a `#literal(...)` is no longer necessarily valid Swift syntax, despite being written in the form of an argument. @@ -243,7 +237,7 @@ We could reduce the visual weight of `#regex(...)` by only requiring `#(...)`. H ### Reusing string literal syntax -Instead of supporting a first-class literal kind for regular expressions, we could instead allow users to write a regular expression in a string literal, and parse, diagnose, and generate the appropriate code when it's coerced to an `ExpressibleByRegexLiteral` conforming type. +Instead of supporting a first-class literal kind for regex, we could instead allow users to write a regex in a string literal, and parse, diagnose, and generate the appropriate code when it's coerced to the `Regex` type. ```swift let regex: Regex = #"([[:alpha:]]\w*) = ([0-9A-F]+)"# @@ -252,7 +246,7 @@ let regex: Regex = #"([[:alpha:]]\w*) = ([0-9A-F]+)"# However we decided against this because: - We would not be able to easily apply custom syntax highlighting and other editor features for the regex syntax. -- It would require an `ExpressibleByRegexLiteral` contextual type to be treated as a regex, otherwise it would be defaulted to `String`, which may be undesired. +- It would require a `Regex` contextual type to be treated as a regex, otherwise it would be defaulted to `String`, which may be undesired. - In an overloaded context it may be ambiguous or unclear whether a string literal is meant to be interpreted as a literal string or regex. - Regex-specific escape sequences such as `\w` would likely require the use of raw string syntax `#"..."#`, as they are otherwise invalid in a string literal. - It wouldn't be compatible with other string literal features such as interpolations. @@ -266,4 +260,4 @@ Instead of adding a custom regex literal, we could require users to explicitly w [internal-syntax]: https://forums.swift.org/t/pitch-regex-syntax/55711 [regex-type]: https://forums.swift.org/t/pitch-regex-type-and-overview/56029 [pitch-status]: https://github.com/apple/swift-experimental-string-processing/issues/107 -[regex-dsl]: https://forums.swift.org/t/pitch-regex-builder-dsl/56007 \ No newline at end of file +[regex-dsl]: https://forums.swift.org/t/pitch-regex-builder-dsl/56007 From 811bfcb7255c10410bd93393e32c8006c7b25535 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Thu, 24 Mar 2022 15:44:54 +0000 Subject: [PATCH 18/36] Generalize discussion on language mode --- Documentation/Evolution/DelimiterSyntax.md | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index da889082b..7c9cb2a6a 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -17,10 +17,6 @@ This proposal helps complete the story told in [Regex Type and Overview][regex-t ## Proposed solution -**TODO: Say that this is Swift 6 syntax only, `#/.../#` would be 5.7 syntax** - -**TODO: But is it?** - A regex literal will be introduced using `/.../` delimiters, within which the compiler will parse a regex (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]): ```swift @@ -32,6 +28,8 @@ Forward slashes are a regex term of art, and are used as the delimiters for rege Due to the existing use of `/` in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is sufficient to disqualify the syntax. +Some of these ambiguities require a couple of source breaking language changes, and as such the `/.../` syntax will require upgrading to a new language mode in order to use. + ## Detailed design Choosing `/` as the regex literal delimiter requires a number of ambiguities to be resolved. It also requires a couple of source breaking language changes to be introduced in a new language mode. @@ -186,13 +184,13 @@ Allowing non-semantic whitespace and other features of the extended syntax would ### Pound slash `#/.../#` -**TODO: This needs to be rewritten to say that it's a potential transition syntax** +This is a less syntactically ambiguous version of `/.../` that retains some of the term-of-art familiarity. It could potentially provide a natural path through which to introduce `/.../` in a new language mode, as users could drop the `#` characters once they upgrade. -This would be less syntactically ambiguous than `/.../`, while retaining some of the term-of-art familiarity. It would also provide a natural path through which to introduce `/.../` in a new language mode, as users could drop the `#` characters once they upgrade. +However, introducing this as non-raw regex literal syntax would introduce an inconsistency with raw string literal syntax, as `#/.../#` on its own would not treat backslashes as literal, unlike `#"..."#`. If raw regex syntax were added, they would likely start at `##/.../##`. With raw strings, escape sequences must use the same number of `#`s as the delimiter, e.g `#"\#n"#` for a newline. However for raw regex literals it would be one fewer `#` than the delimiter e.g `##/\#n/##`. -However this option would also have the same block comment issue as `/.../` where e.g `#/x*/#` nested inside a block comment would prematurely end. Similarly, it's not clear how a multi-line version of the literal would be spelled. +**TODO: What backslash rules do we want?** -Additionally, introducing this syntax would introduce an inconsistency with raw string literal syntax, as `#/.../#` on its own would not treat backslashes as literal, unlike `#"..."#`. If raw regex syntax were implemented, it would start at `##/.../##`. With raw strings, escape sequences must use the same number of `#`s as the delimiter, e.g `#"\#n"#` for a newline. However for raw regex literals it would be one fewer `#` than the delimiter e.g `##/\#n/##`. +It should also be noted that this option has the same block comment issue as `/.../` where e.g `#/[0-9]*/#` nested inside a block comment would prematurely end. Similarly, it's not clear how a multi-line version of the literal would be spelled. ### Prefixed quote `re'...'` From 70be0064db5fde6327954b9e13738a9fef7112a1 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Fri, 25 Mar 2022 12:50:16 +0000 Subject: [PATCH 19/36] Expand some prose --- Documentation/Evolution/DelimiterSyntax.md | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 7c9cb2a6a..4afd3af96 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -1,4 +1,4 @@ -# Regex Literal Delimiters +# Regex Literals - Authors: Hamish Knight, Michael Ilseman, David Ewing @@ -24,11 +24,11 @@ A regex literal will be introduced using `/.../` delimiters, within which the co let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/ ``` -Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternatives). Their ubiquity and familiarity makes them a compelling choice for Swift. +The above regex literal will be inferred to be [the regex type][regex-type] `Regex<(Substring, Substring, Substring)>`, where the capture types have been automatically inferred. Errors in the regex will be diagnosed by the compiler. -Due to the existing use of `/` in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is sufficient to disqualify the syntax. +Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternatives). Their ubiquity and familiarity makes them a compelling choice for Swift. -Some of these ambiguities require a couple of source breaking language changes, and as such the `/.../` syntax will require upgrading to a new language mode in order to use. +Due to the existing use of `/` in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is sufficient to disqualify the syntax. Some of these ambiguities require a couple of source breaking language changes, and as such the `/.../` syntax will require upgrading to a new language mode in order to use. ## Detailed design @@ -182,11 +182,13 @@ Allowing non-semantic whitespace and other features of the extended syntax would ## Alternatives Considered +Given the fact that `/` is an existing term of art for regular expressions, we feel it should be the preferred delimiter syntax. While it has some syntactic ambiguities, we do not feel that they are sufficient to disqualify the syntax. To evaluate this trade-off, below is a list of alternative delimiters that would not have the same ambiguities. + ### Pound slash `#/.../#` This is a less syntactically ambiguous version of `/.../` that retains some of the term-of-art familiarity. It could potentially provide a natural path through which to introduce `/.../` in a new language mode, as users could drop the `#` characters once they upgrade. -However, introducing this as non-raw regex literal syntax would introduce an inconsistency with raw string literal syntax, as `#/.../#` on its own would not treat backslashes as literal, unlike `#"..."#`. If raw regex syntax were added, they would likely start at `##/.../##`. With raw strings, escape sequences must use the same number of `#`s as the delimiter, e.g `#"\#n"#` for a newline. However for raw regex literals it would be one fewer `#` than the delimiter e.g `##/\#n/##`. +However, introducing this as non-raw regex literal syntax would introduce an inconsistency with raw string literal syntax, as `#/.../#` on its own would not treat backslashes as literal, unlike `#"..."#`. If raw regex syntax was added, it would likely start at `##/.../##`. With raw strings, escape sequences must use the same number of `#`s as the delimiter, e.g `#"\#n"#` for a newline. However for raw regex literals it would be one fewer `#` than the delimiter e.g `##/\#n/##`. **TODO: What backslash rules do we want?** @@ -251,7 +253,14 @@ However we decided against this because: ### No custom literal -Instead of adding a custom regex literal, we could require users to explicitly write `Regex(compiling: "[abc]+")`. This would however lose all the benefits of parsing the literal at compile time, meaning that parse errors will instead be diagnosed at runtime, and no source tooling support (e.g syntax highlighting, refactoring actions) would be available. +Instead of adding a custom regex literal, we could require users to explicitly write `try! Regex(compiling: "[abc]+")`. This would be similar to `NSRegularExpression`, and loses all the benefits of parsing the literal at compile time. This would mean: + +- No source tooling support (e.g syntax highlighting, refactoring actions) would be available. +- Parse errors would be diagnosed at run time rather than at compile time. +- We would lose the type safety of typed captures. +- More verbose syntax is required. + +We therefore feel this would be a much less compelling feature without first class literal support. [SE-0168]: https://github.com/apple/swift-evolution/blob/main/proposals/0168-multi-line-string-literals.md [SE-0200]: https://github.com/apple/swift-evolution/blob/main/proposals/0200-raw-string-escaping.md From 4bb25b3f60aa16d19ec1b193d1eeb9992d5e880e Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Mon, 28 Mar 2022 13:31:16 +0100 Subject: [PATCH 20/36] Update to also pitch `#/.../#` --- Documentation/Evolution/DelimiterSyntax.md | 80 ++++++++++++++-------- 1 file changed, 53 insertions(+), 27 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 4afd3af96..237be3b1f 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -17,7 +17,7 @@ This proposal helps complete the story told in [Regex Type and Overview][regex-t ## Proposed solution -A regex literal will be introduced using `/.../` delimiters, within which the compiler will parse a regex (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]): +A regex literal will be introduced in Swift 5.7 mode using `/.../` delimiters, within which the compiler will parse a regex (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]): ```swift // Matches " = ", extracting the identifier and hex number @@ -28,11 +28,42 @@ The above regex literal will be inferred to be [the regex type][regex-type] `Reg Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternatives). Their ubiquity and familiarity makes them a compelling choice for Swift. -Due to the existing use of `/` in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is sufficient to disqualify the syntax. Some of these ambiguities require a couple of source breaking language changes, and as such the `/.../` syntax will require upgrading to a new language mode in order to use. +Due to the existing use of `/` in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is sufficient to disqualify the syntax. Some of these ambiguities require a couple of source breaking language changes, and as such the `/.../` syntax requires upgrading to a new language mode in order to use. + +A regex literal may also be spelled using an extended syntax `#/.../#`, which allows the placement of an arbitrary number of balanced `#` characters around a regex literal. This syntax allows regex literals to contain unescaped forward slashes, and provides a delimiter option which does not require a new language mode to use. ## Detailed design -Choosing `/` as the regex literal delimiter requires a number of ambiguities to be resolved. It also requires a couple of source breaking language changes to be introduced in a new language mode. +### Extended delimiters `#/.../#`, `##/.../##` + +A regex literal may be surrounded by an arbitrary number of balanced pound characters. This is a similar to raw string literal syntax introduced by [SE-0200], and allows a regex literal to use forward slashes without the need to escape them, e.g: + +```swift +let regex = #//usr/lib/modules/([^/]+)/vmlinuz/# +``` + +Additionally, this syntax provides a way to write a regex literal without needing to upgrade to Swift 5.7 mode. + +#### Escaping of backslashes + +This syntax differs from raw string literals `#"..."#` in that it does not treat backslashes as literal within the regex. A string literal `#"\n"#` represents the literal characters `\n`. However a regex literal `#/\n/#` remains a newline escape sequence. + +One of the primary motivations behind this escaping behavior in raw string literals is that it allows the contents to be easily transportable to/from e.g external files where escaping is unnecessary. For string literals, this suggests that backslashes be treated as literal by default. For regex literals, it instead suggests that backslashes should retain their semantic meaning, as it enables interoperability with regexes taken from outside your code without having to adjust escape sequences to match the delimiters used. + +With string literals, escaping can be tricky without the use of raw syntax, as backslashes may have semantic meaning to the consumer, rather than the compiler. For example: + +```swift +// Matches '\' * '=' * + +let regex = try NSRegularExpression(pattern: "\\\\w\\s*=\\s*\\d+", options: []) +``` + +In this case, the intent is not for the compiler to recognize any of these sequences as string literal escapes, it is instead for `NSRegularExpression` to interpret them as regex escape sequences. However this is not an issue for regex literals, as the regex parser is the only possible consumer of such escape sequences. Such a regex would instead be spelled as: + +```swift +let regex = /\\\w\s*=\s*\d+/ +``` + +Backslashes still require escaping to be treated as literal, however we don't expect this to be as common of an occurrence as needing to write a regex escape sequence such as `\s`, `\w`, or `\p{...}`, within a regex literal with extended delimiters `#/.../#`. ### Ambiguities with comment syntax @@ -55,11 +86,11 @@ Perhaps the most obvious parsing ambiguity with `/.../` delimiters is with comme ### Ambiguity with infix operators -There would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required `x + /y/` for regex literal interpretation. +There would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required `x + /y/` for regex literal interpretation. Alternatively, extended syntax may be used, e.g `x+#/y/#`. ### Regex syntax limitations -In order to help avoid further parsing ambiguities, a regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax. +In order to help avoid further parsing ambiguities, a `/.../` regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax. This restriction may be avoided by using extended `#/.../#` syntax. #### Rationale @@ -75,7 +106,7 @@ Builder { This is parsed as a single operator chain, however it is likely the user is expecting a regex literal. To resolve this ambiguity, a regex literal may not start with a space or tab character. The above therefore remains an operator chain. This takes advantage of the fact that infix operators require consistent spacing on either side. -If a space or tab is needed as the first character, it must be escaped, e.g: +If a space or tab is needed as the first character, it must be either escaped, e.g: ```swift Builder { @@ -85,6 +116,16 @@ Builder { } ``` +or extended syntax must be used, e.g: + +```swift +Builder { + 1 + #/ 2 /# + 3 +} +``` + The second ambiguity arises with Swift's ability to pass an unapplied operator reference as an argument to a function or subscript, for example: ```swift @@ -98,7 +139,7 @@ It should be noted that this only mitigates the issue, as it does not handle the ### Language changes required -In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes in Swift 6 mode: +In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes in Swift 5.7 mode: - Deprecation of prefix operators containing the `/` character. - Parsing `/,` and `/]` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments. @@ -124,6 +165,8 @@ Postfix `/` operators do not require banning, as they'd only be treated as regex #### `/,` and `/]` as regex literal openings +**TODO: Do we still want to break source here given we're also proposing `#/.../#`?** + As stated previously, there is a parsing ambiguity with unapplied operators in argument lists, tuples, and parentheses. Some of these cases can be mitigated by not parsing a regex literal if the starting character is `)`. However it does not solve the issue when the next character is `,` or `]`. Both of these are valid regex starting characters, and comma in particular may be a fairly common case for a regex. For example: @@ -163,15 +206,8 @@ This takes advantage of the fact that a regex literal will not be parsed if the
- ## Future Directions -### Raw literals - -The obvious choice here would follow string literals and use `#/.../#`. - -**TODO: What backslash rules do we want?** - ### Multi-line literals The obvious choice for a multi-line regex literal would be to use `///` delimiters, in accordance with the precedent set by multi-line string literals `"""`. But this signifies a (documentation) comment, so a different multi-line delimiter would be needed, with no obvious choice. However, it's not clear that we need multi-line regex literals. The existing literals can be used inside a regex builder DSL. @@ -184,16 +220,6 @@ Allowing non-semantic whitespace and other features of the extended syntax would Given the fact that `/` is an existing term of art for regular expressions, we feel it should be the preferred delimiter syntax. While it has some syntactic ambiguities, we do not feel that they are sufficient to disqualify the syntax. To evaluate this trade-off, below is a list of alternative delimiters that would not have the same ambiguities. -### Pound slash `#/.../#` - -This is a less syntactically ambiguous version of `/.../` that retains some of the term-of-art familiarity. It could potentially provide a natural path through which to introduce `/.../` in a new language mode, as users could drop the `#` characters once they upgrade. - -However, introducing this as non-raw regex literal syntax would introduce an inconsistency with raw string literal syntax, as `#/.../#` on its own would not treat backslashes as literal, unlike `#"..."#`. If raw regex syntax was added, it would likely start at `##/.../##`. With raw strings, escape sequences must use the same number of `#`s as the delimiter, e.g `#"\#n"#` for a newline. However for raw regex literals it would be one fewer `#` than the delimiter e.g `##/\#n/##`. - -**TODO: What backslash rules do we want?** - -It should also be noted that this option has the same block comment issue as `/.../` where e.g `#/[0-9]*/#` nested inside a block comment would prematurely end. Similarly, it's not clear how a multi-line version of the literal would be spelled. - ### Prefixed quote `re'...'` We could choose to use `re'...'` delimiters, for example: @@ -203,13 +229,13 @@ We could choose to use `re'...'` delimiters, for example: let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)' ``` -The use of two letter prefix could potentially be used as a namespace for future literal types. It would also have obvious extensions to raw and multi-line literals using `re#'...'#` and `re'''...'''` respectively. However, it is unusual for a Swift literal to be prefixed in this way. We also feel that its similarity to a string literal might have users confuse it with a raw string literal. +The use of two letter prefix could potentially be used as a namespace for future literal types. It would also have obvious extensions to extended and multi-line literals using `re#'...'#` and `re'''...'''` respectively. However, it is unusual for a Swift literal to be prefixed in this way. We also feel that its similarity to a string literal might have users confuse it with a raw string literal. -Also, there are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. However, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`, and `(?C"arg")`. Those could be required instead. A raw regex literal syntax e.g `re#'...'#` would also avoid this issue. +Also, there are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. However, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`, and `(?C"arg")`. Those could be required instead. An extended regex literal syntax e.g `re#'...'#` would also avoid this issue. ### Prefixed double quote `re"...."` -This would be a double quoted version of `re'...'`, more similar to string literal syntax. This has the advantage that single quote regex syntax e.g `(?'name')` would continue to work without requiring the use of the alternative syntax or raw literal syntax. However it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote may be useful in expressing this difference. +This would be a double quoted version of `re'...'`, more similar to string literal syntax. This has the advantage that single quote regex syntax e.g `(?'name')` would continue to work without requiring the use of the alternative syntax or extended literal syntax. However it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote may be useful in expressing this difference. ### Single letter prefixed quote `r'...'` From 35d9132089d60596d558e4fac381c11d5b1c57f8 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Mon, 28 Mar 2022 21:18:21 +0100 Subject: [PATCH 21/36] Add DSL example --- Documentation/Evolution/DelimiterSyntax.md | 49 +++++++++++++++------- 1 file changed, 35 insertions(+), 14 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 237be3b1f..e421e7960 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -14,6 +14,21 @@ let re = /[0-9]+/ This proposal helps complete the story told in [Regex Type and Overview][regex-type] and [elsewhere][pitch-status]. Literals are compiled directly, allowing errors to be found at compile time, rather than at run time. Using a literal also allows editors to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see [Regex builder DSL][regex-dsl]). It would be difficult to support all of this if regexes could only be defined inside a string. +A regex literal also allows for seamless composition with the Regex DSL, enabling the intermixing of a regex syntax with other elements of the builder: + +```swift +// A regex literal for parsing an amount of currency in dollars or pounds. +let regex = Regex { + /([$£])/ + TryCapture { + OneOrMore(.digit) + "." + Repeat(.digit, count: 2) + } transform: { Amount(twoDecimalPlaces: $0) } +} +``` + +This flexibility allows for terse matching syntax to be used when it's suitable, and more explicit syntax where clarity and strong types are required. ## Proposed solution @@ -94,35 +109,41 @@ In order to help avoid further parsing ambiguities, a `/.../` regex literal will #### Rationale -This is due to 2 main ambiguities. The first of which arises when a `/.../` regex literal starts a new line. This is particularly problematic for result builders, where we expect it to be frequently used, for example: +This is due to 2 main parsing ambiguities. The first of which arises when a `/.../` regex literal starts a new line. This is particularly problematic for result builders, where we expect it to be frequently used, in particular within a `Regex` builder: ```swift -Builder { - 1 - / 2 / - 3 +let digit = Regex { + TryCapture(OneOrMore(.digit)) { Int($0) } +} +// Matches against + (' + ' | ' - ') + +let regex = Regex { + digit + / [+-] / + digit } ``` -This is parsed as a single operator chain, however it is likely the user is expecting a regex literal. To resolve this ambiguity, a regex literal may not start with a space or tab character. The above therefore remains an operator chain. This takes advantage of the fact that infix operators require consistent spacing on either side. +Instead of being parsed as 3 result builder elements, the second of which being a regex literal, this is instead parsed as a single operator chain with the operands `digit`, `[+-]`, and `digit`. This will therefore be diagnosed as semantically invalid. + +To avoid this issue, a regex literal may not start with a space or tab character. This takes advantage of the fact that infix operators require consistent spacing on either side. If a space or tab is needed as the first character, it must be either escaped, e.g: ```swift -Builder { - 1 - /\ 2 / - 3 +let regex = Regex { + digit + /\ [+-] / + digit } ``` or extended syntax must be used, e.g: ```swift -Builder { - 1 - #/ 2 /# - 3 +let regex = Regex { + digit + #/ [+-] /# + digit } ``` From f99dadb2b42bf5e859e8b944105216cc061597de Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Tue, 29 Mar 2022 12:32:05 +0100 Subject: [PATCH 22/36] Rejig motivation/solution --- Documentation/Evolution/DelimiterSyntax.md | 53 ++++++++++++++++------ 1 file changed, 38 insertions(+), 15 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index e421e7960..03046a7c3 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -4,7 +4,7 @@ ## Introduction -This proposal introduces regex literals to Swift source code. The proposed syntax mirrors literals in other programing languages such as Perl, JavaScript and Ruby. As in those languages, literals are delimited with the `/` character: +This proposal helps complete the story told in [Regex Type and Overview][regex-type] and [elsewhere][pitch-status]. We propose the introduction of regex literals to Swift source code. The proposed syntax mirrors literals in other programing languages such as Perl, JavaScript and Ruby. As in those languages, literals are delimited with the `/` character: ```swift let re = /[0-9]+/ @@ -12,9 +12,38 @@ let re = /[0-9]+/ ## Motivation -This proposal helps complete the story told in [Regex Type and Overview][regex-type] and [elsewhere][pitch-status]. Literals are compiled directly, allowing errors to be found at compile time, rather than at run time. Using a literal also allows editors to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see [Regex builder DSL][regex-dsl]). It would be difficult to support all of this if regexes could only be defined inside a string. +In [Regex Type and Overview][regex-type] we introduced the `Regex` type, which is able to dynamically compile a regex pattern: -A regex literal also allows for seamless composition with the Regex DSL, enabling the intermixing of a regex syntax with other elements of the builder: +```swift +let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"# +let regex = try! Regex(compiling: pattern) +// regex: Regex +``` + +The ability to compile regex patterns at runtime is useful for cases where it is e.g provided as user input, however it is suboptimal when the pattern is statically known for a number of reasons: + +- Regex syntax errors aren't detected until runtime, and explicit error handling (e.g `try!`) is required to deal with these errors. +- No special source tooling support, such as syntactic highlighting, code completion, and refactoring support, is available. +- Capture types aren't known until runtime, and as such a dynamic `AnyRegexOutput` capture type must be used. +- The syntax is overly verbose, especially for e.g an argument to a matching function. + +## Proposed solution + +We propose introducing a new kind of literal for a regex. In Swift 5.7 mode, a regex literal may be written using `/.../` delimiters: + +```swift +// Matches " = ", extracting the identifier and hex number +let regex = /(?[[:alpha:]]\w*) = (?[0-9A-F]+)/ +// regex: Regex<(Substring, identifier: Substring, hex: Substring)> +``` + +Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternatives). Their ubiquity and familiarity makes them a compelling choice for Swift. + +A regex literal may also be spelled using an extended syntax `#/.../#`, which allows the placement of an arbitrary number of balanced `#` characters around a regex literal. This syntax allows regex literals to contain unescaped forward slashes, and may be used without needing to upgrade to Swift 5.7 mode. + +Within a regex literal, the compiler will parse the regex syntax outlined in in [the Regex Syntax pitch][internal-syntax], and diagnose any errors at compile time. The capture types are automatically inferred based on the capture groups present in the regex. Using a literal allows editors to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see [Regex builder DSL][regex-dsl]). + +A regex literal also allows for seamless composition with the Regex DSL, enabling lightweight intermixing of a regex syntax with other elements of the builder: ```swift // A regex literal for parsing an amount of currency in dollars or pounds. @@ -30,24 +59,18 @@ let regex = Regex { This flexibility allows for terse matching syntax to be used when it's suitable, and more explicit syntax where clarity and strong types are required. -## Proposed solution - -A regex literal will be introduced in Swift 5.7 mode using `/.../` delimiters, within which the compiler will parse a regex (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]): +Due to the existing use of `/` in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is sufficient to disqualify the syntax. Some of these ambiguities require a couple of source breaking language changes, and as such the `/.../` syntax requires upgrading to a new language mode in order to use. -```swift -// Matches " = ", extracting the identifier and hex number -let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/ -``` +## Detailed design -The above regex literal will be inferred to be [the regex type][regex-type] `Regex<(Substring, Substring, Substring)>`, where the capture types have been automatically inferred. Errors in the regex will be diagnosed by the compiler. +### Named typed captures -Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternatives). Their ubiquity and familiarity makes them a compelling choice for Swift. +Regex literals have their capture types statically determined by the capture groups present. Each capture group adds an additional capture to the match tuple, with named capture groups receiving a corresponding tuple label. -Due to the existing use of `/` in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is sufficient to disqualify the syntax. Some of these ambiguities require a couple of source breaking language changes, and as such the `/.../` syntax requires upgrading to a new language mode in order to use. +**TODO: Example** -A regex literal may also be spelled using an extended syntax `#/.../#`, which allows the placement of an arbitrary number of balanced `#` characters around a regex literal. This syntax allows regex literals to contain unescaped forward slashes, and provides a delimiter option which does not require a new language mode to use. -## Detailed design +**TODO: Should we cover more general typed capture behavior here? e.g Quantifier types. It overlaps with the typed capture behavior of the DSL tho** ### Extended delimiters `#/.../#`, `##/.../##` From eed1b24d482edb9110edce1b3f3839ffadbbe50e Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Tue, 29 Mar 2022 13:35:34 +0100 Subject: [PATCH 23/36] Expand on typed captures --- Documentation/Evolution/DelimiterSyntax.md | 26 +++++++++++++++++----- 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 03046a7c3..e34aba37a 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -41,7 +41,7 @@ Forward slashes are a regex term of art, and are used as the delimiters for rege A regex literal may also be spelled using an extended syntax `#/.../#`, which allows the placement of an arbitrary number of balanced `#` characters around a regex literal. This syntax allows regex literals to contain unescaped forward slashes, and may be used without needing to upgrade to Swift 5.7 mode. -Within a regex literal, the compiler will parse the regex syntax outlined in in [the Regex Syntax pitch][internal-syntax], and diagnose any errors at compile time. The capture types are automatically inferred based on the capture groups present in the regex. Using a literal allows editors to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see [Regex builder DSL][regex-dsl]). +Within a regex literal, the compiler will parse the regex syntax outlined in in [the Regex Syntax pitch][internal-syntax], and diagnose any errors at compile time. The capture types and labels are automatically inferred based on the capture groups present in the regex. Using a literal allows editors to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see [Regex builder DSL][regex-dsl]). A regex literal also allows for seamless composition with the Regex DSL, enabling lightweight intermixing of a regex syntax with other elements of the builder: @@ -63,21 +63,34 @@ Due to the existing use of `/` in comment syntax and operators, there are some s ## Detailed design -### Named typed captures +### Typed captures -Regex literals have their capture types statically determined by the capture groups present. Each capture group adds an additional capture to the match tuple, with named capture groups receiving a corresponding tuple label. +Regex literals have their capture types statically determined by the capture groups present. A initial `Substring` is always present for the entire match, and each capture group adds an additional capture to the match tuple, with named capture groups receiving a corresponding tuple label. Once matched, such captures may later be referenced: -**TODO: Example** +```swift +func matchHexAssignment(_ input: String) -> (String, Int)? { + let regex = /(?[[:alpha:]]\w*) = (?[0-9A-F]+)/ + // regex: Regex<(Substring, identifier: Substring, hex: Substring)> + + guard let match = regex.matchWhole(input), + let hex = Int(match.hex, radix: 16) + else { return nil } + + return (match.identifier, hex) +} +``` +Unnamed capture groups produce unlabeled tuple elements and must be referenced by their position, e.g `match.1`, `match.2`. -**TODO: Should we cover more general typed capture behavior here? e.g Quantifier types. It overlaps with the typed capture behavior of the DSL tho** +**TODO: Should we cover more general typed capture behavior from [StronglyTypedCaptures.md](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/StronglyTypedCaptures.md) here? There is some overlap with the typed capture behavior of the DSL tho, labels are the main thing that are literal specific** ### Extended delimiters `#/.../#`, `##/.../##` A regex literal may be surrounded by an arbitrary number of balanced pound characters. This is a similar to raw string literal syntax introduced by [SE-0200], and allows a regex literal to use forward slashes without the need to escape them, e.g: ```swift -let regex = #//usr/lib/modules/([^/]+)/vmlinuz/# +let regex = #/usr/lib/modules/([^/]+)/vmlinuz/# +// regex: Regex<(Substring, Substring)> ``` Additionally, this syntax provides a way to write a regex literal without needing to upgrade to Swift 5.7 mode. @@ -99,6 +112,7 @@ In this case, the intent is not for the compiler to recognize any of these seque ```swift let regex = /\\\w\s*=\s*\d+/ +// regex: Regex ``` Backslashes still require escaping to be treated as literal, however we don't expect this to be as common of an occurrence as needing to write a regex escape sequence such as `\s`, `\w`, or `\p{...}`, within a regex literal with extended delimiters `#/.../#`. From 7ad403791325b3dde44abc275e73026080e734dc Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Tue, 29 Mar 2022 20:31:49 +0100 Subject: [PATCH 24/36] Generalize language mode --- Documentation/Evolution/DelimiterSyntax.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index e34aba37a..8e50dec93 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -29,7 +29,7 @@ The ability to compile regex patterns at runtime is useful for cases where it is ## Proposed solution -We propose introducing a new kind of literal for a regex. In Swift 5.7 mode, a regex literal may be written using `/.../` delimiters: +We propose introducing a new kind of literal for a regex. In a new language mode, a regex literal may be written using `/.../` delimiters: ```swift // Matches " = ", extracting the identifier and hex number @@ -39,7 +39,7 @@ let regex = /(?[[:alpha:]]\w*) = (?[0-9A-F]+)/ Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternatives). Their ubiquity and familiarity makes them a compelling choice for Swift. -A regex literal may also be spelled using an extended syntax `#/.../#`, which allows the placement of an arbitrary number of balanced `#` characters around a regex literal. This syntax allows regex literals to contain unescaped forward slashes, and may be used without needing to upgrade to Swift 5.7 mode. +A regex literal may also be spelled using an extended syntax `#/.../#`, which allows the placement of an arbitrary number of balanced `#` characters around a regex literal. This syntax allows regex literals to contain unescaped forward slashes, and may be used without needing to upgrade to a new language mode. Within a regex literal, the compiler will parse the regex syntax outlined in in [the Regex Syntax pitch][internal-syntax], and diagnose any errors at compile time. The capture types and labels are automatically inferred based on the capture groups present in the regex. Using a literal allows editors to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see [Regex builder DSL][regex-dsl]). @@ -93,13 +93,13 @@ let regex = #/usr/lib/modules/([^/]+)/vmlinuz/# // regex: Regex<(Substring, Substring)> ``` -Additionally, this syntax provides a way to write a regex literal without needing to upgrade to Swift 5.7 mode. +Additionally, this syntax provides a way to write a regex literal without needing to upgrade to a new language mode. #### Escaping of backslashes This syntax differs from raw string literals `#"..."#` in that it does not treat backslashes as literal within the regex. A string literal `#"\n"#` represents the literal characters `\n`. However a regex literal `#/\n/#` remains a newline escape sequence. -One of the primary motivations behind this escaping behavior in raw string literals is that it allows the contents to be easily transportable to/from e.g external files where escaping is unnecessary. For string literals, this suggests that backslashes be treated as literal by default. For regex literals, it instead suggests that backslashes should retain their semantic meaning, as it enables interoperability with regexes taken from outside your code without having to adjust escape sequences to match the delimiters used. +One of the primary motivations behind this escaping behavior in raw string literals is that it allows the contents to be easily transportable to/from e.g external files where escaping is unnecessary. For string literals, this suggests that backslashes be treated as literal by default. For regex literals however, it instead suggests that backslashes should retain their semantic meaning, as it enables interoperability with regexes taken from outside your code without having to adjust escape sequences to match the delimiters used. With string literals, escaping can be tricky without the use of raw syntax, as backslashes may have semantic meaning to the consumer, rather than the compiler. For example: @@ -197,7 +197,7 @@ It should be noted that this only mitigates the issue, as it does not handle the ### Language changes required -In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes in Swift 5.7 mode: +In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes in a new language mode: - Deprecation of prefix operators containing the `/` character. - Parsing `/,` and `/]` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments. From f4ef0c2cf0c113e95d281c4a16220e52b03eb2fc Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Thu, 31 Mar 2022 13:03:53 +0100 Subject: [PATCH 25/36] Add multi-line mode --- Documentation/Evolution/DelimiterSyntax.md | 43 +++++++++++++++------- 1 file changed, 30 insertions(+), 13 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 8e50dec93..e06789a57 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -20,11 +20,11 @@ let regex = try! Regex(compiling: pattern) // regex: Regex ``` -The ability to compile regex patterns at runtime is useful for cases where it is e.g provided as user input, however it is suboptimal when the pattern is statically known for a number of reasons: +The ability to compile regex patterns at run time is useful for cases where it is e.g provided as user input, however it is suboptimal when the pattern is statically known for a number of reasons: -- Regex syntax errors aren't detected until runtime, and explicit error handling (e.g `try!`) is required to deal with these errors. +- Regex syntax errors aren't detected until run time, and explicit error handling (e.g `try!`) is required to deal with these errors. - No special source tooling support, such as syntactic highlighting, code completion, and refactoring support, is available. -- Capture types aren't known until runtime, and as such a dynamic `AnyRegexOutput` capture type must be used. +- Capture types aren't known until run time, and as such a dynamic `AnyRegexOutput` capture type must be used. - The syntax is overly verbose, especially for e.g an argument to a matching function. ## Proposed solution @@ -39,7 +39,7 @@ let regex = /(?[[:alpha:]]\w*) = (?[0-9A-F]+)/ Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternatives). Their ubiquity and familiarity makes them a compelling choice for Swift. -A regex literal may also be spelled using an extended syntax `#/.../#`, which allows the placement of an arbitrary number of balanced `#` characters around a regex literal. This syntax allows regex literals to contain unescaped forward slashes, and may be used without needing to upgrade to a new language mode. +A regex literal may also be spelled using an extended syntax `#/.../#`, which allows the placement of an arbitrary number of balanced `#` characters around a regex literal. This syntax allows regex literals to contain unescaped forward slashes, and may be used without needing to upgrade to a new language mode. This syntax further allows a multi-line mode when the opening delimiter is followed by a new line. Within a regex literal, the compiler will parse the regex syntax outlined in in [the Regex Syntax pitch][internal-syntax], and diagnose any errors at compile time. The capture types and labels are automatically inferred based on the capture groups present in the regex. Using a literal allows editors to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see [Regex builder DSL][regex-dsl]). @@ -86,7 +86,7 @@ Unnamed capture groups produce unlabeled tuple elements and must be referenced b ### Extended delimiters `#/.../#`, `##/.../##` -A regex literal may be surrounded by an arbitrary number of balanced pound characters. This is a similar to raw string literal syntax introduced by [SE-0200], and allows a regex literal to use forward slashes without the need to escape them, e.g: +A regex literal may be surrounded by an arbitrary number of balanced pound characters. This is a somewhat similar to the raw string literal syntax introduced by [SE-0200], and allows a regex literal to use forward slashes without the need to escape them, e.g: ```swift let regex = #/usr/lib/modules/([^/]+)/vmlinuz/# @@ -99,7 +99,7 @@ Additionally, this syntax provides a way to write a regex literal without needin This syntax differs from raw string literals `#"..."#` in that it does not treat backslashes as literal within the regex. A string literal `#"\n"#` represents the literal characters `\n`. However a regex literal `#/\n/#` remains a newline escape sequence. -One of the primary motivations behind this escaping behavior in raw string literals is that it allows the contents to be easily transportable to/from e.g external files where escaping is unnecessary. For string literals, this suggests that backslashes be treated as literal by default. For regex literals however, it instead suggests that backslashes should retain their semantic meaning, as it enables interoperability with regexes taken from outside your code without having to adjust escape sequences to match the delimiters used. +One of the primary motivations behind this escaping behavior in raw string literals is that it allows the contents to be easily transportable to/from e.g external files where escaping is unnecessary. For string literals, this suggests that backslashes be treated as literal by default. For regex literals however, it instead suggests that backslashes should retain their semantic meaning. This enables interoperability with regexes taken from outside your code without having to adjust escape sequences to match the delimiters used. With string literals, escaping can be tricky without the use of raw syntax, as backslashes may have semantic meaning to the consumer, rather than the compiler. For example: @@ -117,6 +117,24 @@ let regex = /\\\w\s*=\s*\d+/ Backslashes still require escaping to be treated as literal, however we don't expect this to be as common of an occurrence as needing to write a regex escape sequence such as `\s`, `\w`, or `\p{...}`, within a regex literal with extended delimiters `#/.../#`. +#### Multi-line mode + +Extended regex delimiters additionally support a multi-line mode when the opening delimiter is followed by a new line. For example: + +```swift +let regex = #/ + # Match a line of the format e.g "DEBIT 03/03/2022 Totally Legit Shell Corp $2,000,000.00" + (? \w+) \s\s+ + (? \S+) \s\s+ + (? (?: (?!\s\s) . )+) \s\s+ # Note that account names may contain spaces. + (? .*) + /# +``` + +In this mode, [extended regex syntax][extended-regex-syntax] `(?x)` is enabled by default. This means that whitespace becomes non-semantic, and end-of-line comments are supported with `# comment` syntax. + +This mode is supported with any (non-zero) number of pound characters in the delimiter. Similar to multi-line strings introduced by [SE-0168], the closing delimiter must appear on a new line. To avoid parsing confusion, such a literal will not be parsed if a closing delimiter is not present. This avoids inadvertently treating the rest of the file as regex if you only type the opening. + ### Ambiguities with comment syntax Perhaps the most obvious parsing ambiguity with `/.../` delimiters is with comment syntax. @@ -266,13 +284,7 @@ This takes advantage of the fact that a regex literal will not be parsed if the ## Future Directions -### Multi-line literals - -The obvious choice for a multi-line regex literal would be to use `///` delimiters, in accordance with the precedent set by multi-line string literals `"""`. But this signifies a (documentation) comment, so a different multi-line delimiter would be needed, with no obvious choice. However, it's not clear that we need multi-line regex literals. The existing literals can be used inside a regex builder DSL. - -### Regex extended syntax - -Allowing non-semantic whitespace and other features of the extended syntax would be highly desired, with no obvious choice for a literal. Perhaps the need is also lessened by the ability to use regex literals inside the regex builder DSL. +**TODO: Do we have any other future directions now that extended multi-line syntax has been subsumed?** ## Alternatives Considered @@ -319,6 +331,10 @@ It should also be noted that `#regex(...)` would introduce a syntactic inconsist We could reduce the visual weight of `#regex(...)` by only requiring `#(...)`. However it would still retain the same issues, such as still looking potentially visually noisy as an argument, and having suboptimal behavior for parenthesis balancing. It is also not clear why regex literals would deserve such privileged syntax. +### Using a different delimiter for multi-line + +Instead of re-using the extended delimiter syntax `#/.../#` for multi-line regex literals, we could choose a different delimiter for it. Unfortunately, the obvious choice for a multi-line regex literal would be to use `///` delimiters, in accordance with the precedent set by multi-line string literals `"""`. This signifies a (documentation) comment, and as such would not be viable. + ### Reusing string literal syntax Instead of supporting a first-class literal kind for regex, we could instead allow users to write a regex in a string literal, and parse, diagnose, and generate the appropriate code when it's coerced to the `Regex` type. @@ -352,3 +368,4 @@ We therefore feel this would be a much less compelling feature without first cla [regex-type]: https://forums.swift.org/t/pitch-regex-type-and-overview/56029 [pitch-status]: https://github.com/apple/swift-experimental-string-processing/issues/107 [regex-dsl]: https://forums.swift.org/t/pitch-regex-builder-dsl/56007 +[extended-regex-syntax]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md#extended-syntax-modes From addfbfdd9b26fde35e2e708959bd5562342c5658 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Fri, 1 Apr 2022 22:45:35 +0100 Subject: [PATCH 26/36] Update pitch --- Documentation/Evolution/DelimiterSyntax.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index e06789a57..e1376ca7d 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -80,7 +80,7 @@ func matchHexAssignment(_ input: String) -> (String, Int)? { } ``` -Unnamed capture groups produce unlabeled tuple elements and must be referenced by their position, e.g `match.1`, `match.2`. +Unnamed capture groups produce unlabeled tuple elements and must be referenced by their position, e.g `match.1`, `match.2`. See [StronglyTypedCaptures.md](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/StronglyTypedCaptures.md) for more info. **TODO: Should we cover more general typed capture behavior from [StronglyTypedCaptures.md](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/StronglyTypedCaptures.md) here? There is some overlap with the typed capture behavior of the DSL tho, labels are the main thing that are literal specific** @@ -156,7 +156,7 @@ Perhaps the most obvious parsing ambiguity with `/.../` delimiters is with comme ### Ambiguity with infix operators -There would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required `x + /y/` for regex literal interpretation. Alternatively, extended syntax may be used, e.g `x+#/y/#`. +There would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required for regex literal interpretation, e.g `x + /y/`. Alternatively, extended syntax may be used, e.g `x+#/y/#`. ### Regex syntax limitations @@ -241,8 +241,6 @@ Postfix `/` operators do not require banning, as they'd only be treated as regex #### `/,` and `/]` as regex literal openings -**TODO: Do we still want to break source here given we're also proposing `#/.../#`?** - As stated previously, there is a parsing ambiguity with unapplied operators in argument lists, tuples, and parentheses. Some of these cases can be mitigated by not parsing a regex literal if the starting character is `)`. However it does not solve the issue when the next character is `,` or `]`. Both of these are valid regex starting characters, and comma in particular may be a fairly common case for a regex. For example: @@ -284,7 +282,9 @@ This takes advantage of the fact that a regex literal will not be parsed if the ## Future Directions -**TODO: Do we have any other future directions now that extended multi-line syntax has been subsumed?** +### Modern literal syntax + +We could support a more modern Swift-like syntax in regex literals. For example, comments could be done with `//` and `/* ... */`, and quoted sequences could be done with `"..."`. This would however be incompatible with the syntactic superset of regex syntax we intend to parse, and as such may need to be introduced using a new literal kind, with no obvious choice of delimiter. However, it's possible that the ability to use regex literals in the DSL lessens the benefit that this syntax would bring. ## Alternatives Considered From bb819f6f2657bbd425a5816b7a4971067e0052ea Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Fri, 1 Apr 2022 23:13:55 +0100 Subject: [PATCH 27/36] Clarify upgrade path --- Documentation/Evolution/DelimiterSyntax.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index e1376ca7d..0a86e170e 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -29,7 +29,7 @@ The ability to compile regex patterns at run time is useful for cases where it i ## Proposed solution -We propose introducing a new kind of literal for a regex. In a new language mode, a regex literal may be written using `/.../` delimiters: +A regex literal may be written using `/.../` delimiters: ```swift // Matches " = ", extracting the identifier and hex number @@ -39,7 +39,7 @@ let regex = /(?[[:alpha:]]\w*) = (?[0-9A-F]+)/ Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternatives). Their ubiquity and familiarity makes them a compelling choice for Swift. -A regex literal may also be spelled using an extended syntax `#/.../#`, which allows the placement of an arbitrary number of balanced `#` characters around a regex literal. This syntax allows regex literals to contain unescaped forward slashes, and may be used without needing to upgrade to a new language mode. This syntax further allows a multi-line mode when the opening delimiter is followed by a new line. +A regex literal may also be spelled using an extended syntax `#/.../#`, which allows the placement of an arbitrary number of balanced `#` characters around the literal. This syntax may be used to avoid needing to escape forward slashes within the regex. Additionally, it allows for a multi-line mode when the opening delimiter is followed by a new line. Within a regex literal, the compiler will parse the regex syntax outlined in in [the Regex Syntax pitch][internal-syntax], and diagnose any errors at compile time. The capture types and labels are automatically inferred based on the capture groups present in the regex. Using a literal allows editors to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see [Regex builder DSL][regex-dsl]). @@ -63,6 +63,10 @@ Due to the existing use of `/` in comment syntax and operators, there are some s ## Detailed design +### Upgrade path + +Due to the source breaking changes needed for the `/.../` syntax, it will be introduced in Swift 6 mode. However, projects will be able to adopt it earlier by using the compiler flag `-enable-regex-literals`. Note this does not affect the extended syntax `#/.../#`, which will be usable immediately. + ### Typed captures Regex literals have their capture types statically determined by the capture groups present. A initial `Substring` is always present for the entire match, and each capture group adds an additional capture to the match tuple, with named capture groups receiving a corresponding tuple label. Once matched, such captures may later be referenced: @@ -93,7 +97,7 @@ let regex = #/usr/lib/modules/([^/]+)/vmlinuz/# // regex: Regex<(Substring, Substring)> ``` -Additionally, this syntax provides a way to write a regex literal without needing to upgrade to a new language mode. +Additionally, it allows for a multi-line mode when the opening delimiter is followed by a new line. #### Escaping of backslashes @@ -215,7 +219,7 @@ It should be noted that this only mitigates the issue, as it does not handle the ### Language changes required -In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes in a new language mode: +In addition to ambiguities listed above, there are also some parsing ambiguities that require the following language changes in a new language mode: - Deprecation of prefix operators containing the `/` character. - Parsing `/,` and `/]` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments. From 0a9447a00f24e44daa0fd6f5e8c537e066cd462f Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Wed, 6 Apr 2022 11:41:26 +0100 Subject: [PATCH 28/36] Update typed captures section + other tweaks --- Documentation/Evolution/DelimiterSyntax.md | 32 +++++++++++++--------- 1 file changed, 19 insertions(+), 13 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 0a86e170e..39cf87551 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -4,7 +4,7 @@ ## Introduction -This proposal helps complete the story told in [Regex Type and Overview][regex-type] and [elsewhere][pitch-status]. We propose the introduction of regex literals to Swift source code. The proposed syntax mirrors literals in other programing languages such as Perl, JavaScript and Ruby. As in those languages, literals are delimited with the `/` character: +This proposal helps complete the story told in *[Regex Type and Overview][regex-type]* and [elsewhere][pitch-status]. We propose the introduction of regex literals to Swift source code. The proposed syntax mirrors literals in other programing languages such as Perl, JavaScript and Ruby. As in those languages, literals are delimited with the `/` character: ```swift let re = /[0-9]+/ @@ -12,7 +12,7 @@ let re = /[0-9]+/ ## Motivation -In [Regex Type and Overview][regex-type] we introduced the `Regex` type, which is able to dynamically compile a regex pattern: +In *[Regex Type and Overview][regex-type]* we introduced the `Regex` type, which is able to dynamically compile a regex pattern: ```swift let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"# @@ -41,7 +41,7 @@ Forward slashes are a regex term of art, and are used as the delimiters for rege A regex literal may also be spelled using an extended syntax `#/.../#`, which allows the placement of an arbitrary number of balanced `#` characters around the literal. This syntax may be used to avoid needing to escape forward slashes within the regex. Additionally, it allows for a multi-line mode when the opening delimiter is followed by a new line. -Within a regex literal, the compiler will parse the regex syntax outlined in in [the Regex Syntax pitch][internal-syntax], and diagnose any errors at compile time. The capture types and labels are automatically inferred based on the capture groups present in the regex. Using a literal allows editors to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see [Regex builder DSL][regex-dsl]). +Within a regex literal, the compiler will parse the regex syntax outlined in *[Regex Construction][internal-syntax]*, and diagnose any errors at compile time. The capture types and labels are automatically inferred based on the capture groups present in the regex. Using a literal allows editors to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see *[Regex builder DSL][regex-dsl]*). A regex literal also allows for seamless composition with the Regex DSL, enabling lightweight intermixing of a regex syntax with other elements of the builder: @@ -67,9 +67,9 @@ Due to the existing use of `/` in comment syntax and operators, there are some s Due to the source breaking changes needed for the `/.../` syntax, it will be introduced in Swift 6 mode. However, projects will be able to adopt it earlier by using the compiler flag `-enable-regex-literals`. Note this does not affect the extended syntax `#/.../#`, which will be usable immediately. -### Typed captures +### Named typed captures -Regex literals have their capture types statically determined by the capture groups present. A initial `Substring` is always present for the entire match, and each capture group adds an additional capture to the match tuple, with named capture groups receiving a corresponding tuple label. Once matched, such captures may later be referenced: +Regex literals have their capture types statically determined by the capture groups present. This follows the same inference behavior as [the DSL][regex-dsl], and is explored in more detail in *[Strongly Typed Captures][strongly-typed-captures]*. One aspect of this that is currently unique to the literal is the ability to infer labeled tuple elements for named capture groups. For example: ```swift func matchHexAssignment(_ input: String) -> (String, Int)? { @@ -84,9 +84,7 @@ func matchHexAssignment(_ input: String) -> (String, Int)? { } ``` -Unnamed capture groups produce unlabeled tuple elements and must be referenced by their position, e.g `match.1`, `match.2`. See [StronglyTypedCaptures.md](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/StronglyTypedCaptures.md) for more info. - -**TODO: Should we cover more general typed capture behavior from [StronglyTypedCaptures.md](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/StronglyTypedCaptures.md) here? There is some overlap with the typed capture behavior of the DSL tho, labels are the main thing that are literal specific** +This allows the captures to be referenced as `match.identifier` and `match.hex` instead of `match.1` and `match.2`, which would be the behavior for unnamed capture groups. This label inference behavior is not available in the DSL, however users are able to [bind captures to named variables instead][dsl-captures]. ### Extended delimiters `#/.../#`, `##/.../##` @@ -288,7 +286,9 @@ This takes advantage of the fact that a regex literal will not be parsed if the ### Modern literal syntax -We could support a more modern Swift-like syntax in regex literals. For example, comments could be done with `//` and `/* ... */`, and quoted sequences could be done with `"..."`. This would however be incompatible with the syntactic superset of regex syntax we intend to parse, and as such may need to be introduced using a new literal kind, with no obvious choice of delimiter. However, it's possible that the ability to use regex literals in the DSL lessens the benefit that this syntax would bring. +We could support a more modern Swift-like syntax in regex literals. For example, comments could be done with `//` and `/* ... */`, and quoted sequences could be done with `"..."`. This would however be incompatible with the syntactic superset of regex syntax we intend to parse, and as such may need to be introduced using a new literal kind, with no obvious choice of delimiter. + +However, such a syntax would lose out on the familiarity benefits of standard regex, and as such may lead to an "uncanny valley" effect. It's also possible that the ability to use regex literals in the DSL lessens the benefit that this syntax would bring. ## Alternatives Considered @@ -368,8 +368,14 @@ We therefore feel this would be a much less compelling feature without first cla [SE-0168]: https://github.com/apple/swift-evolution/blob/main/proposals/0168-multi-line-string-literals.md [SE-0200]: https://github.com/apple/swift-evolution/blob/main/proposals/0200-raw-string-escaping.md -[internal-syntax]: https://forums.swift.org/t/pitch-regex-syntax/55711 -[regex-type]: https://forums.swift.org/t/pitch-regex-type-and-overview/56029 -[pitch-status]: https://github.com/apple/swift-experimental-string-processing/issues/107 -[regex-dsl]: https://forums.swift.org/t/pitch-regex-builder-dsl/56007 + +[pitch-status]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/ProposalOverview.md +[regex-type]: https://github.com/apple/swift-evolution/blob/main/proposals/0350-regex-type-overview.md +[strongly-typed-captures]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/StronglyTypedCaptures.md + +[internal-syntax]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md [extended-regex-syntax]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md#extended-syntax-modes + +[regex-dsl]: https://github.com/apple/swift-evolution/blob/main/proposals/0351-regex-builder.md +[dsl-captures]: https://github.com/apple/swift-evolution/blob/main/proposals/0351-regex-builder.md#capture-and-reference + From b41edbd9a2288d160b8e9e1f0aaa709a60ce161d Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Fri, 8 Apr 2022 12:17:11 +0100 Subject: [PATCH 29/36] Author links --- Documentation/Evolution/DelimiterSyntax.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 39cf87551..ddf1603c4 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -1,6 +1,6 @@ # Regex Literals -- Authors: Hamish Knight, Michael Ilseman, David Ewing +- Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman), [David Ewing](https://github.com/DaveEwing) ## Introduction From 8da68d38bf6b23a39bd0b75ec37047e97a8b8d80 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Fri, 8 Apr 2022 17:22:38 +0100 Subject: [PATCH 30/36] Updated extended delimiter section --- Documentation/Evolution/DelimiterSyntax.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index ddf1603c4..82f6d496b 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -88,14 +88,14 @@ This allows the captures to be referenced as `match.identifier` and `match.hex` ### Extended delimiters `#/.../#`, `##/.../##` -A regex literal may be surrounded by an arbitrary number of balanced pound characters. This is a somewhat similar to the raw string literal syntax introduced by [SE-0200], and allows a regex literal to use forward slashes without the need to escape them, e.g: +Backslashes may be used to write forward slashes within the regex literal, e.g `/foo\/bar/`. However, this can be quite syntactically noisy and confusing. To avoid this, a regex literal may be surrounded by an arbitrary number of balanced pound characters. This changes the delimiter of the literal, and therefore allows the use of forward slashes without escaping. For example: ```swift let regex = #/usr/lib/modules/([^/]+)/vmlinuz/# // regex: Regex<(Substring, Substring)> ``` -Additionally, it allows for a multi-line mode when the opening delimiter is followed by a new line. +The number of pounds may be further increased to allow the use of e.g `/#` within the literal. This is similar in style to the raw string literal syntax introduced by [SE-0200], however it has a couple of key differences. The escaping rules for backslashes do not change, and a multi-line mode is entered when the opening delimiter is followed by a newline. #### Escaping of backslashes From 9d0cf04be2e095abbe5c242c446198ef8ce4158e Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Fri, 8 Apr 2022 19:42:57 -0600 Subject: [PATCH 31/36] Update delimiter proposal More details and word smithing. --- Documentation/Evolution/DelimiterSyntax.md | 49 ++++++++++++---------- 1 file changed, 28 insertions(+), 21 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 82f6d496b..90943ebb2 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -4,11 +4,7 @@ ## Introduction -This proposal helps complete the story told in *[Regex Type and Overview][regex-type]* and [elsewhere][pitch-status]. We propose the introduction of regex literals to Swift source code. The proposed syntax mirrors literals in other programing languages such as Perl, JavaScript and Ruby. As in those languages, literals are delimited with the `/` character: - -```swift -let re = /[0-9]+/ -``` +This proposal helps complete the story told in *[Regex Type and Overview][regex-type]* and [elsewhere][pitch-status]. We propose the introduction of regex literals to Swift source code, providing compile-time checks and typed-capture inference. ## Motivation @@ -37,23 +33,26 @@ let regex = /(?[[:alpha:]]\w*) = (?[0-9A-F]+)/ // regex: Regex<(Substring, identifier: Substring, hex: Substring)> ``` -Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternatives). Their ubiquity and familiarity makes them a compelling choice for Swift. +Forward slashes are a regex term of art. They are used as the delimiters for regex literals in, e.g., Perl, JavaScript and Ruby. Perl and Ruby additionally allow for [user-selected delimiters](https://perldoc.perl.org/perlop#Quote-and-Quote-like-Operators) to avoid having to escape any slashes inside a regex. For that purpose, we propose the extended literal `#/.../#`. -A regex literal may also be spelled using an extended syntax `#/.../#`, which allows the placement of an arbitrary number of balanced `#` characters around the literal. This syntax may be used to avoid needing to escape forward slashes within the regex. Additionally, it allows for a multi-line mode when the opening delimiter is followed by a new line. +An extended literal, `#/.../#`, avoids the need to escape forward slashes within the regex. It allows an arbitrary number of balanced `#` characters around the literal and escape. When the opening delimiter is followed by a new line, it supports a multi-line literal where whitespace is non-semantic and line-ending comments are ignored. -Within a regex literal, the compiler will parse the regex syntax outlined in *[Regex Construction][internal-syntax]*, and diagnose any errors at compile time. The capture types and labels are automatically inferred based on the capture groups present in the regex. Using a literal allows editors to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see *[Regex builder DSL][regex-dsl]*). +The compiler will parse the contents of a regex literal using regex syntax outlined in *[Regex Construction][internal-syntax]*, diagnosing any errors at compile time. The capture types and labels are automatically inferred based on the capture groups present in the regex. Regex literals allows editors and source tools to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see *[Regex builder DSL][regex-dsl]*). A regex literal also allows for seamless composition with the Regex DSL, enabling lightweight intermixing of a regex syntax with other elements of the builder: ```swift -// A regex literal for parsing an amount of currency in dollars or pounds. +// A regex for extracting a currency (dollars or pounds) and amount from input +// with precisely the form /[$£]\d+\.\d{2}/ let regex = Regex { - /([$£])/ + Capture { /[$£]/ } TryCapture { - OneOrMore(.digit) + /\d+/ "." - Repeat(.digit, count: 2) - } transform: { Amount(twoDecimalPlaces: $0) } + /\d{2}/ + } transform: { + Amount(twoDecimalPlaces: $0) + } } ``` @@ -65,7 +64,7 @@ Due to the existing use of `/` in comment syntax and operators, there are some s ### Upgrade path -Due to the source breaking changes needed for the `/.../` syntax, it will be introduced in Swift 6 mode. However, projects will be able to adopt it earlier by using the compiler flag `-enable-regex-literals`. Note this does not affect the extended syntax `#/.../#`, which will be usable immediately. +Due to the source breaking changes needed for the `/.../` syntax, it will be introduced in Swift 6 mode. However, projects will be able to adopt it earlier by using the compiler flag `-enable-regex-literals`. Note this does not affect the extended literal `#/.../#`, which will be usable immediately. ### Named typed captures @@ -84,18 +83,27 @@ func matchHexAssignment(_ input: String) -> (String, Int)? { } ``` -This allows the captures to be referenced as `match.identifier` and `match.hex` instead of `match.1` and `match.2`, which would be the behavior for unnamed capture groups. This label inference behavior is not available in the DSL, however users are able to [bind captures to named variables instead][dsl-captures]. +This allows the captures to be referenced as `match.identifier` and `match.hex`, in addition to numerically (like unnamed capture groups) as `match.1` and `match.2`. This label inference behavior is not available in the DSL, however users are able to [bind captures to named variables instead][dsl-captures]. ### Extended delimiters `#/.../#`, `##/.../##` -Backslashes may be used to write forward slashes within the regex literal, e.g `/foo\/bar/`. However, this can be quite syntactically noisy and confusing. To avoid this, a regex literal may be surrounded by an arbitrary number of balanced pound characters. This changes the delimiter of the literal, and therefore allows the use of forward slashes without escaping. For example: +Backslashes may be used to write forward slashes within the regex literal, e.g `/foo\/bar/`. However, this can be quite syntactically noisy and confusing. To avoid this, a regex literal may be surrounded by an arbitrary number of balanced octothorpes. This changes the delimiter of the literal, and therefore allows the use of forward slashes without escaping. For example: ```swift let regex = #/usr/lib/modules/([^/]+)/vmlinuz/# // regex: Regex<(Substring, Substring)> ``` -The number of pounds may be further increased to allow the use of e.g `/#` within the literal. This is similar in style to the raw string literal syntax introduced by [SE-0200], however it has a couple of key differences. The escaping rules for backslashes do not change, and a multi-line mode is entered when the opening delimiter is followed by a newline. +The number of pounds may be further increased to allow the use of e.g `/#` within the literal. This is similar in style to the raw string literal syntax introduced by [SE-0200], however it has a couple of key differences. The escaping rules for backslashes do not change. Additionally, a multi-line mode, where whitespace and line-ending comments are ignored, is entered when the opening delimiter is followed by a newline. + +```swift +let regex = #/ + /usr/lib/modules/ # Prefix + (? [^/]+) + /vmlinuz # The kernel +#/ +// regex: Regex<(Substring, subpath: Substring)> +``` #### Escaping of backslashes @@ -158,11 +166,11 @@ Perhaps the most obvious parsing ambiguity with `/.../` delimiters is with comme ### Ambiguity with infix operators -There would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required for regex literal interpretation, e.g `x + /y/`. Alternatively, extended syntax may be used, e.g `x+#/y/#`. +There would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required for regex literal interpretation, e.g `x + /y/`. Alternatively, extended literals may be used, e.g `x+#/y/#`. ### Regex syntax limitations -In order to help avoid further parsing ambiguities, a `/.../` regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax. This restriction may be avoided by using extended `#/.../#` syntax. +In order to help avoid further parsing ambiguities, a `/.../` regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax. This restriction may be avoided by using the extended `#/.../#` literal. #### Rationale @@ -194,7 +202,7 @@ let regex = Regex { } ``` -or extended syntax must be used, e.g: +or extended literal must be used, e.g: ```swift let regex = Regex { @@ -378,4 +386,3 @@ We therefore feel this would be a much less compelling feature without first cla [regex-dsl]: https://github.com/apple/swift-evolution/blob/main/proposals/0351-regex-builder.md [dsl-captures]: https://github.com/apple/swift-evolution/blob/main/proposals/0351-regex-builder.md#capture-and-reference - From 720c10c5630989d9a02d2b906656caebf8ed88f2 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Mon, 11 Apr 2022 10:43:03 +0100 Subject: [PATCH 32/36] Update Documentation/Evolution/DelimiterSyntax.md Co-authored-by: Michael Ilseman --- Documentation/Evolution/DelimiterSyntax.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index 90943ebb2..a1514c9a9 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -4,7 +4,7 @@ ## Introduction -This proposal helps complete the story told in *[Regex Type and Overview][regex-type]* and [elsewhere][pitch-status]. We propose the introduction of regex literals to Swift source code, providing compile-time checks and typed-capture inference. +We propose the introduction of regex literals to Swift source code, providing compile-time checks and typed-capture inference. Regex literals help complete the story told in *[Regex Type and Overview][regex-type]*. ## Motivation From c045e21f2d07e552154af96fb0180e19bfafea08 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Mon, 11 Apr 2022 10:54:50 +0100 Subject: [PATCH 33/36] Clarify backslash rule --- Documentation/Evolution/DelimiterSyntax.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index a1514c9a9..f5e8bba87 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -94,7 +94,7 @@ let regex = #/usr/lib/modules/([^/]+)/vmlinuz/# // regex: Regex<(Substring, Substring)> ``` -The number of pounds may be further increased to allow the use of e.g `/#` within the literal. This is similar in style to the raw string literal syntax introduced by [SE-0200], however it has a couple of key differences. The escaping rules for backslashes do not change. Additionally, a multi-line mode, where whitespace and line-ending comments are ignored, is entered when the opening delimiter is followed by a newline. +The number of pounds may be further increased to allow the use of e.g `/#` within the literal. This is similar in style to the raw string literal syntax introduced by [SE-0200], however it has a couple of key differences. Backslashes do not become literal characters. Additionally, a multi-line mode, where whitespace and line-ending comments are ignored, is entered when the opening delimiter is followed by a newline. ```swift let regex = #/ From e31262d681a7dc1bda407d67c35d6f2c0e9cecd5 Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Mon, 11 Apr 2022 18:17:39 +0100 Subject: [PATCH 34/36] Minor tweaks Standardize on "number signs" for mentions of `#` (though a couple of them read better as just the character). Also change the multi-line example to not include a `/` at the start, which matches the single-line version. --- Documentation/Evolution/DelimiterSyntax.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index f5e8bba87..c775aa12e 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -87,18 +87,18 @@ This allows the captures to be referenced as `match.identifier` and `match.hex`, ### Extended delimiters `#/.../#`, `##/.../##` -Backslashes may be used to write forward slashes within the regex literal, e.g `/foo\/bar/`. However, this can be quite syntactically noisy and confusing. To avoid this, a regex literal may be surrounded by an arbitrary number of balanced octothorpes. This changes the delimiter of the literal, and therefore allows the use of forward slashes without escaping. For example: +Backslashes may be used to write forward slashes within the regex literal, e.g `/foo\/bar/`. However, this can be quite syntactically noisy and confusing. To avoid this, a regex literal may be surrounded by an arbitrary number of balanced number signs. This changes the delimiter of the literal, and therefore allows the use of forward slashes without escaping. For example: ```swift let regex = #/usr/lib/modules/([^/]+)/vmlinuz/# // regex: Regex<(Substring, Substring)> ``` -The number of pounds may be further increased to allow the use of e.g `/#` within the literal. This is similar in style to the raw string literal syntax introduced by [SE-0200], however it has a couple of key differences. Backslashes do not become literal characters. Additionally, a multi-line mode, where whitespace and line-ending comments are ignored, is entered when the opening delimiter is followed by a newline. +The number of `#` characters may be further increased to allow the use of e.g `/#` within the literal. This is similar in style to the raw string literal syntax introduced by [SE-0200], however it has a couple of key differences. Backslashes do not become literal characters. Additionally, a multi-line mode, where whitespace and line-ending comments are ignored, is entered when the opening delimiter is followed by a newline. ```swift let regex = #/ - /usr/lib/modules/ # Prefix + usr/lib/modules/ # Prefix (? [^/]+) /vmlinuz # The kernel #/ @@ -143,7 +143,7 @@ let regex = #/ In this mode, [extended regex syntax][extended-regex-syntax] `(?x)` is enabled by default. This means that whitespace becomes non-semantic, and end-of-line comments are supported with `# comment` syntax. -This mode is supported with any (non-zero) number of pound characters in the delimiter. Similar to multi-line strings introduced by [SE-0168], the closing delimiter must appear on a new line. To avoid parsing confusion, such a literal will not be parsed if a closing delimiter is not present. This avoids inadvertently treating the rest of the file as regex if you only type the opening. +This mode is supported with any (non-zero) number of `#` characters in the delimiter. Similar to multi-line strings introduced by [SE-0168], the closing delimiter must appear on a new line. To avoid parsing confusion, such a literal will not be parsed if a closing delimiter is not present. This avoids inadvertently treating the rest of the file as regex if you only type the opening. ### Ambiguities with comment syntax From bf7702fbf950a474c5ad7df104472c5afaeb6b5e Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Tue, 12 Apr 2022 21:15:10 +0100 Subject: [PATCH 35/36] Update pitch - Add Source Compatibility section - Condense comment syntax ambiguity section - Mention `/.../` being less popular in some communities --- Documentation/Evolution/DelimiterSyntax.md | 29 +++++++++++----------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/DelimiterSyntax.md index c775aa12e..3a9d02a8b 100644 --- a/Documentation/Evolution/DelimiterSyntax.md +++ b/Documentation/Evolution/DelimiterSyntax.md @@ -62,10 +62,6 @@ Due to the existing use of `/` in comment syntax and operators, there are some s ## Detailed design -### Upgrade path - -Due to the source breaking changes needed for the `/.../` syntax, it will be introduced in Swift 6 mode. However, projects will be able to adopt it earlier by using the compiler flag `-enable-regex-literals`. Note this does not affect the extended literal `#/.../#`, which will be usable immediately. - ### Named typed captures Regex literals have their capture types statically determined by the capture groups present. This follows the same inference behavior as [the DSL][regex-dsl], and is explored in more detail in *[Strongly Typed Captures][strongly-typed-captures]*. One aspect of this that is currently unique to the literal is the ability to infer labeled tuple elements for named capture groups. For example: @@ -147,11 +143,9 @@ This mode is supported with any (non-zero) number of `#` characters in the delim ### Ambiguities with comment syntax -Perhaps the most obvious parsing ambiguity with `/.../` delimiters is with comment syntax. - -- An empty regex literal would conflict with line comment syntax `//`. But an empty regex isn't a particularly useful thing to express, and can be disallowed without significant impact. +Line comment syntax `//` and block comment syntax `/*` will continue to be parsed as comments. An empty regex literal is not a particularly useful thing to express, but can be written as `#//#` if desired. `*` would be an invalid starting character of a regex, and therefore does not pose an issue. -- There is a conflict with block comment syntax, when surrounding a regex literal ending with `*`, for example: +A parsing conflict does however arise when a block comment surrounds a regex literal ending with `*`, for example: ```swift /* @@ -159,14 +153,12 @@ Perhaps the most obvious parsing ambiguity with `/.../` delimiters is with comme */ ``` - In this case, the block comment would prematurely end on the second line, rather than extending all the way to the third line as the user would expect. This is already an issue today with `*/` in a string literal, though it is more likely to occur in a regex given the prevalence of the `*` quantifier. This issue can be avoided in many cases by using line comment syntax `//` instead, which it should be noted is the syntax that Xcode uses when commenting out multiple lines. - -- Block comment syntax also means that a regex literal would not be able to start with the `*` character, however this is less of a concern as it would not be valid regex syntax. +In this case, the block comment prematurely ends on the second line, rather than extending all the way to the third line as the user would expect. This is already an issue today with `*/` in a string literal, though it is more likely to occur in a regex given the prevalence of the `*` quantifier. This issue can be avoided in many cases by using line comment syntax `//` instead, which it should be noted is the syntax that Xcode uses when commenting out multiple lines. ### Ambiguity with infix operators -There would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required for regex literal interpretation, e.g `x + /y/`. Alternatively, extended literals may be used, e.g `x+#/y/#`. +There is a minor ambiguity when infix operators are used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required for regex literal interpretation, e.g `x + /y/`. Alternatively, extended literals may be used, e.g `x+#/y/#`. ### Regex syntax limitations @@ -273,7 +265,7 @@ func baz(_ x: S) -> Int { } ``` -`foo(/, /)` is currently parsed as 2 unapplied operator arguments. `bar(/, 2) + bar(/, 3)` is currently parsed as two independent calls that each take an unapplied `/` operator reference. Both of these would become regex literals arguments, `/, /` and `/, 2) + bar(/` respectively (though the latter would produce a regex error). +`foo(/, /)` is currently parsed as 2 unapplied operator arguments. `bar(/, 2) + bar(/, 3)` is currently parsed as two independent calls that each take an unapplied `/` operator reference. Both of these will become regex literals arguments, `/, /` and `/, 2) + bar(/` respectively (though the latter will produce a regex error). To disambiguate these cases, users will need to surround at least the opening `/` with parentheses, e.g: @@ -290,6 +282,15 @@ This takes advantage of the fact that a regex literal will not be parsed if the +## Source Compatibility + +As explored above, two source breaking changes are needed for `/.../` syntax: + +- Deprecation of prefix operators containing the `/` character. +- Parsing `/,` and `/]` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than two unapplied operator arguments. + +As such, both these changes and the `/.../` syntax will be introduced in Swift 6 mode. However, projects will be able to adopt the syntax earlier by passing the compiler flag `-enable-bare-regex-syntax`. Note this does not affect the extended delimiter syntax `#/.../#`, which will be usable immediately. + ## Future Directions ### Modern literal syntax @@ -300,7 +301,7 @@ However, such a syntax would lose out on the familiarity benefits of standard re ## Alternatives Considered -Given the fact that `/` is an existing term of art for regular expressions, we feel it should be the preferred delimiter syntax. While it has some syntactic ambiguities, we do not feel that they are sufficient to disqualify the syntax. To evaluate this trade-off, below is a list of alternative delimiters that would not have the same ambiguities. +Given the fact that `/.../` is an existing term of art for regular expressions, we feel it should be the preferred delimiter syntax. It should be noted that the syntax has become less popular in some communities such as Perl, however we still feel that it is a compelling choice, especially with extended delimiters `#/.../#`. Additionally, while there has some syntactic ambiguities, we do not feel that they are sufficient to disqualify the syntax. To evaluate this trade-off, below is a list of alternative delimiters that would not have the same ambiguities. ### Prefixed quote `re'...'` From 9c8a1160a044b4d3defff8e3bacff3cd47e5d57b Mon Sep 17 00:00:00 2001 From: Hamish Knight Date: Tue, 12 Apr 2022 21:15:11 +0100 Subject: [PATCH 36/36] Rename DelimiterSyntax.md -> RegexLiterals.md And remove the old version of the pitch. --- Documentation/Evolution/RegexLiteralPitch.md | 292 ------------------ .../{DelimiterSyntax.md => RegexLiterals.md} | 0 2 files changed, 292 deletions(-) delete mode 100644 Documentation/Evolution/RegexLiteralPitch.md rename Documentation/Evolution/{DelimiterSyntax.md => RegexLiterals.md} (100%) diff --git a/Documentation/Evolution/RegexLiteralPitch.md b/Documentation/Evolution/RegexLiteralPitch.md deleted file mode 100644 index bf2a5dad3..000000000 --- a/Documentation/Evolution/RegexLiteralPitch.md +++ /dev/null @@ -1,292 +0,0 @@ -# Regular Expression Literals - -- Authors: Hamish Knight, Michael Ilseman - -## Introduction - -We propose to introduce a first-class regular expression literal into the language that can take advantage of library support to offer extensible, powerful, and familiar textual pattern matching. - -This is a component of a larger string processing picture. We would like to start a focused discussion surrounding our approach to the literal itself, while acknowledging that evaluating the utility of the literal will ultimately depend on the whole picture (e.g. supporting API). To aid this focused discussion, details such as the representation of captures in the type system, semantic details, extensions to lexing/parsing, additional API, etc., are out of scope of this pitch and thread. Feel free to continue discussion of anything related in the [overview thread][overview]. - -## Motivation - -Regular expressions are a ubiquitous, familiar, and concise syntax for matching and extracting text that satisfies a particular pattern. Syntactically, a regex literal in Swift should: - -- Support a syntax familiar to developers who have learned to use regular expressions in other tools and languages -- Allow reuse of many regular expressions not specifically designed for Swift (e.g. from Stack Overflow or popular programming books) -- Allow libraries to define custom types that can be constructed with regex literals, much like string literals -- Diagnose at compile time if a regex literal uses capabilities that aren't allowed by the type's regex dialect - -Further motivation, examples, and discussion can be found in the [overview thread][overview]. - -## Proposed Solution - -We propose the introduction of a regular expression literal that supports [the PCRE syntax][PCRE], in addition to new standard library protocols `ExpressibleByRegexLiteral` and `RegexLiteralProtocol` that allow for the customization of how the regex literal is interpreted (similar to [string interpolation][stringinterpolation]). The compiler will parse the PCRE syntax within a regex literal, and synthesize calls to corresponding builder methods. Types conforming to `ExpressibleByRegexLiteral` will be able to provide a builder type that opts into supporting various regex constructs through the use of normal function declarations and `@available`. - -_Note: This pitch concerns language syntax and compiler changes alone, it isn't stating what features the stdlib should support in the initial version or in future versions._ - -## Detailed Design - -A regular expression literal will be introduced using `/` delimiters, within which the compiler will parse [PCRE regex syntax][PCRE]: - -```swift -// Matches " = ", extracting the identifier and hex number -let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/ -``` - -The above regex literal will be inferred to be the default regex literal type `Regex`. Errors in the regex will be diagnosed by the compiler. - -_`Regex` here is a stand-in type, further details about the type such as if or how this will scale to strongly typed captures is still under investigation._ - -_How best to diagnose grapheme-semantic concerns is still under investigation and probably best discussed in their corresponding threads. For example, `Range` is not [countable][countable] and [ordering is not linguistically meaningful][ordering], so validating character class ranges may involve restricting to a semantically-meaningful range (e.g. ASCII). This is best discussed in the (upcoming) character class pitch/thread._ - -The compiler will then transform the literal into a set of builder calls that may be customized by adopting the `ExpressibleByRegexLiteral` protocol. Below is a straw-person transformation of this example: - -```swift -// let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/ -let regex = { - var builder = T.RegexLiteral() - - // __A4 = /([[:alpha:]]\w*)/ - let __A1 = builder.buildCharacterClass_POSIX_alpha() - let __A2 = builder.buildCharacterClass_w() - let __A3 = builder.buildConcatenate(__A1, __A2) - let __A4 = builder.buildCaptureGroup(__A3) - - // __B1 = / = / - let __B1 = builder.buildLiteral(" = ") - - // __C3 = /([0-9A-F]+)/ - let __C1 = builder.buildCustomCharacterClass(["0"..."9", "A"..."F"]) - let __C2 = builder.buildOneOrMore(__C1) - let __C3 = builder.buildCaptureGroup(__C2) - - let __D1 = builder.buildConcatenate(__A4, __B1, __C3) - return T(regexLiteral: builder.finalize(__D1)) -}() -``` - -In this formulation, the compiler fully parses the regex literal, calling mutating methods on a builder which constructs an AST. Here, the compiler recognizes syntax such as ranges and classifies metacharacters (`buildCharacterClass_w()`). Alternate formulations could involve less reasoning (`buildMetacharacter_w`), or more (`builderCharacterClass_word`). We'd like community feedback on this approach. - -Additionally, it may make sense for the stdlib to provide a `RegexLiteral` conformer that just constructs a string to pass off to a string-based library. Such a type might assume all features are supported unless communicated otherwise, and we'd like community feedback on mechanisms to communicate this (e.g. availability). - -### The `ExpressibleByRegexLiteral` and `RegexLiteralProtocol` protocols - -New `ExpressibleByRegexLiteral` and `RegexLiteralProtocol` protocols will be introduced to the standard library, and will serve a similar purpose to the existing literal protocols `ExpressibleByStringInterpolation` and `StringInterpolationProtocol`. - -```swift -public protocol ExpressibleByRegexLiteral { - associatedtype RegexLiteral : RegexLiteralProtocol = DefaultRegexLiteral - init(regexLiteral: RegexLiteral) -} - -public protocol RegexLiteralProtocol { - init() - - // Informal builder requirements for building a regex literal - // will be specified here. -} -``` - -Types conforming to `ExpressibleByRegexLiteral` will be able to provide a custom type that conforms to `RegexLiteralProtocol`, which will be used to build the resulting regex value. A default conforming type will be provided by the standard library (`DefaultRegexLiteral` here). - -Libraries can extend regex handling logic for their domains. For example, a higher-level library could provide linguistically richer regular expressions by incorporating locale, collation, language dictionaries, and fuzzier matching. Similarly, libraries wrapping different regex engines (e.g. `NSRegularExpression`) can support custom regex literals. - -### Opting into certain regex features - -We intend for the compiler to completely parse [the PCRE syntax][PCRE]. However, types conforming to `RegexLiteralProtocol` might not be able to handle the full feature set. The compiler will look for corresponding function declarations inside `RegexLiteralProtocol` and will emit a compilation error if missing. Conforming types can use `@available` on these function declarations to communicate versioning and add more support in the future. - -This approach of lookup combined with availability allows the stdlib to support more features over time. - -### Impact of using `/` as the delimiter - -#### On comment syntax - -Single line comments use the syntax `//`, which would conflict with the spelling for an empty regex literal. As such, an empty regex literal would be forbidden. - -While not conflicting with the syntax proposed in this pitch, it's also worth noting that the `//` comment syntax (in particular documentation comments that use `///`) would likely preclude the ability to use `///` as a delimiter if we ever wanted to support multi-line regex literals. It's possible though that future multi-line support could be provided through raw regex literals. Alternatively, it could be inferred from the regex options provided. For example, a regex that uses the multi-line option `/(?m)/` could be allowed to span multiple lines. - -Multi-line comments use the `/*` delimiter. As such, a regex literal starting with `*` wouldn't be parsed. This however isn't a major issue as an unqualified `*` is already invalid regex syntax. An escaped `/\*/` regex literal wouldn't be impacted. - -#### On custom infix operators using the `/` character - -Choosing `/` as the delimiter means there will be a conflict for infix operators containing `/` in cases where whitespace isn't used, for example: - -```swift -x+/y/+z -``` - -Should the operators be parsed as `+/` and `/+` respectively, or should this be parsed as `x + /y/ + z`? - -In this case, things can be disambiguated by the user inserting additional whitespace. We therefore could continue to parse `x+/y/+z` as a binary operator chain, and require additional whitespace to interpret `/y/` as a regex literal. - -#### On custom prefix and postfix operators using the `/` character - -There will also be parsing ambiguity with any user-defined prefix and postfix operators containing the `/` character. For example, code such as the following poses an issue: - -```swift -let x = /0; let y = 1/ -``` - -Should this be considered to be two `let` bindings, with each initialization expression using prefix and postfix `/` operators, or is it a single regex literal? - -This also extends more generally to prefix and postfix operators containing the `/` character, e.g: - -```swift -let x = Int { 0 } -} - -let x = 0 -/ 1 / .foo() -``` - -Today, this is parsed as a single binary operator chain `0 / 1 / .foo()`, with `.foo()` becoming an argument to the `/` operator. This is because while Swift does have some parser behavior that is affected by newlines, generally newlines are treated as whitespace, and expressions therefore may span multiple lines. However the user may well be expecting the second line to be parsed as a regex literal. - -This is also potentially an issue for result builders, for example: - -```swift -SomeBuilder { - x - / y / - z -} -``` - -Today this is parsed as `SomeBuilder { x / y / z }`, however it's likely the user was expecting this to become a result builder with 3 elements, the second of which being a regex literal. - -There is currently no source compatibility impact as both cases will continue to parse as binary operations. The user may insert a `;` on the prior line to get the desired regex literal parsing. However this may not be sufficient we may need to change parsing rules (under a version check) to favor parsing regex literals in these cases. We'd like to discuss this further with the community. - -It's worth noting that this is similar to an ambiguity that already exists today with trailing closures, for example: - -```swift -SomeBuilder { - SomeType() - { print("hello") } - AnotherType() -} -``` - -`{ print("hello") }` will be parsed as a trailing closure to `SomeType()` rather than as a separate element to the result builder. - -It can also currently arise with leading dot syntax in a result builder, e.g: - -```swift -SomeBuilder { - SomeType() - .member -} -``` - -`.member` will be parsed as a member access on `SomeType()` rather than as a separate element that may have its base type inferred by the parameter of a `buildExpression` method on the result builder. - - -## Future Directions - -### Typed captures - -Typed captures would statically represent how many captures and of what kind are present in a regex literals. They could produce a `Substring` for a regular capture, `Substring?` for a zero-or-one capture, and `Array` (or a lazy collection) for a zero(or one)-or-more capture. These are worth exploring, especially in the context of the [start of variadic generics][variadics] support, but we'd like to keep this pitch and discussion focused to the details presented. - -### Other regex literals - -Multi-line extensions to regex literals is considered future work. Generally, we'd like to encourage refactoring into `Pattern` when the regex gets to that degree of complexity. - -User-specified [choice of quote delimiters][perlquotes] is considered future work. A related approach to this could be a "raw" regex literal analogous to [raw strings][rawstrings]. For example (total strawperson), an approach where `n` `#`s before the opening delimiter would requires `n` `#` at the end of the trailing delimiter as well as requiring `n-1` `#`s to access metacharacters. - -```txt -// All of the below are trying to match a path like "/tmp/foo/bar/File.app/file.txt" - -/\/tmp\/.*\/File\.app\/file\.txt/ -#//tmp/.*/File\.app/file\.txt/# -##//tmp/#.#*/File.app/file.txt/## -``` - -"Swiftier" literals, such as with non-semantic whitespace (e.g. [Raku's][rakuregex]), is future work. We'd want to strongly consider using a different backing technology for Swifty matching literals, such as PEGs. - -Fully-custom literal support, that is literals whose bodies are not parsed and there is no default type available, is orthogonal to this work. It would require support for compilation-time Swift libraries in addition to Swift APIs for the compiler and type system. - - -### Further extension to Swift language constructs - -Other language constructs, such as raw-valued enums, might benefit from further regex enhancements. - -```swift -enum CalculatorToken: Regex { - case wholeNumber = /\d+/ - case identifier = /\w+/ - case symbol = /\p{Math}/ - ... -} -``` - -As mentioned in the overview, general purpose extensions to Swift (syntactic) pattern matching could benefit regex - -```swift -func parseField(_ field: String) -> ParsedField { - switch field { - case let text <- /#\s?(.*)/: - return .comment(text) - case let (l, u) <- /([0-9A-F]+)(?:\.\.([0-9A-F]+))?/: - return .scalars(Unicode.Scalar(hex: l) ... Unicode.Scalar(hex: u ?? l)) - case let prop <- GraphemeBreakProperty.init: - return .property(prop) - } -} -``` - -### Other semantic details - -Further details about the semantics of regex literals, such as what definition we give to character classes, the initial supported feature set, and how to switch between grapheme-semantic and scalar-semantic usage, is still under investigation and outside the scope of this discussion. - -## Alternatives considered - -### Using a different delimiter to `/` - -As explored above, using `/` as the delimiter has the potential to conflict with existing operators using that character, and may necessitate: - -- Changing of parsing rules around chained `/` over multiple lines -- Deprecating prefix and postfix operators containing the `/` character -- Requiring additional whitespace to disambiguate from infix operators containing `/` -- Requiring a new language version mode to parse the literal with `/` delimiters - -However one of the main goals of this pitch is to introduce a familiar syntax for regular expression literals, which has been the motivation behind choices such as using the PCRE regex syntax. Given the fact that `/` is an existing term of art for regular expressions, we feel that if the aforementioned parsing issues can be solved in a satisfactory manner, we should prefer it as the delimiter. - - -### Reusing string literal syntax - -Instead of supporting a first-class literal kind for regular expressions, we could instead allow users to write a regular expression in a string literal, and parse, diagnose, and generate the appropriate code when it's coerced to an `ExpressibleByRegexLiteral` conforming type. - -```swift -let regex: Regex = "([[:alpha:]]\w*) = ([0-9A-F]+)" -``` - -However we decided against this because: - -- We would not be able to easily apply custom syntax highlighting for the regex syntax -- It would require an `ExpressibleByRegexLiteral` contextual type to be treated as a regex, otherwise it would be defaulted to `String`, which may be undesired -- In an overloaded context it may be ambiguous whether a string literal is meant to be interpreted as a literal string or regex -- Regex escape sequences aren't currently compatible with string literal escape sequence rules, e.g `\w` is currently illegal in a string literal -- It wouldn't be compatible with other string literal features such as interpolations - -[PCRE]: http://pcre.org/current/doc/html/pcre2syntax.html -[overview]: https://forums.swift.org/t/declarative-string-processing-overview/52459 -[variadics]: https://forums.swift.org/t/pitching-the-start-of-variadic-generics/51467 -[stringinterpolation]: https://github.com/apple/swift-evolution/blob/master/proposals/0228-fix-expressiblebystringinterpolation.md -[countable]: https://en.wikipedia.org/wiki/Countable_set -[ordering]: https://forums.swift.org/t/min-function-doesnt-work-on-values-greater-than-9-999-any-idea-why/52004/16 -[perlquotes]: https://perldoc.perl.org/perlop#Quote-and-Quote-like-Operators -[rawstrings]: https://github.com/apple/swift-evolution/blob/main/proposals/0200-raw-string-escaping.md -[rakuregex]: https://docs.raku.org/language/regexes diff --git a/Documentation/Evolution/DelimiterSyntax.md b/Documentation/Evolution/RegexLiterals.md similarity index 100% rename from Documentation/Evolution/DelimiterSyntax.md rename to Documentation/Evolution/RegexLiterals.md