|
| 1 | +# Appendix: Macro Follow-Set Ambiguity Formal Specification |
| 2 | + |
| 3 | +This page documents the formal specification of the follow rules for [Macros |
| 4 | +By Example]. They were originally specified in [RFC 550], from which the bulk |
| 5 | +of this text is copied, and expanded upon in subsequent RFCs. |
| 6 | + |
| 7 | +## Definitions & Conventions |
| 8 | + |
| 9 | + - `macro`: anything invokable as `foo!(...)` in source code. |
| 10 | + - `MBE`: macro-by-example, a macro defined by `macro_rules`. |
| 11 | + - `matcher`: the left-hand-side of a rule in a `macro_rules` invocation, or a |
| 12 | + subportion thereof. |
| 13 | + - `macro parser`: the bit of code in the Rust parser that will parse the |
| 14 | + input using a grammar derived from all of the matchers. |
| 15 | + - `fragment`: The class of Rust syntax that a given matcher will accept (or |
| 16 | + "match"). |
| 17 | + - `repetition` : a fragment that follows a regular repeating pattern |
| 18 | + - `NT`: non-terminal, the various "meta-variables" or repetition matchers |
| 19 | + that can appear in a matcher, specified in MBE syntax with a leading `$` |
| 20 | + character. |
| 21 | + - `simple NT`: a "meta-variable" non-terminal (further discussion below). |
| 22 | + - `complex NT`: a repetition matching non-terminal, specified via repetition |
| 23 | + operators (`\*`, `+`, `?`). |
| 24 | + - `token`: an atomic element of a matcher; i.e. identifiers, operators, |
| 25 | + open/close delimiters, *and* simple NT's. |
| 26 | + - `token tree`: a tree structure formed from tokens (the leaves), complex |
| 27 | + NT's, and finite sequences of token trees. |
| 28 | + - `delimiter token`: a token that is meant to divide the end of one fragment |
| 29 | + and the start of the next fragment. |
| 30 | + - `separator token`: an optional delimiter token in an complex NT that |
| 31 | + separates each pair of elements in the matched repetition. |
| 32 | + - `separated complex NT`: a complex NT that has its own separator token. |
| 33 | + - `delimited sequence`: a sequence of token trees with appropriate open- and |
| 34 | + close-delimiters at the start and end of the sequence. |
| 35 | + - `empty fragment`: The class of invisible Rust syntax that separates tokens, |
| 36 | + i.e. whitespace, or (in some lexical contexts), the empty token sequence. |
| 37 | + - `fragment specifier`: The identifier in a simple NT that specifies which |
| 38 | + fragment the NT accepts. |
| 39 | + - `language`: a context-free language. |
| 40 | + |
| 41 | +Example: |
| 42 | + |
| 43 | +```rust,compile_fail |
| 44 | +macro_rules! i_am_an_mbe { |
| 45 | + (start $foo:expr $($i:ident),* end) => ($foo) |
| 46 | +} |
| 47 | +``` |
| 48 | + |
| 49 | +`(start $foo:expr $($i:ident),\* end)` is a matcher. The whole matcher is a |
| 50 | +delimited sequence (with open- and close-delimiters `(` and `)`), and `$foo` |
| 51 | +and `$i` are simple NT's with `expr` and `ident` as their respective fragment |
| 52 | +specifiers. |
| 53 | + |
| 54 | +`$(i:ident),\*` is *also* an NT; it is a complex NT that matches a |
| 55 | +comma-separated repetition of identifiers. The `,` is the separator token for |
| 56 | +the complex NT; it occurs in between each pair of elements (if any) of the |
| 57 | +matched fragment. |
| 58 | + |
| 59 | +Another example of a complex NT is `$(hi $e:expr ;)+`, which matches any |
| 60 | +fragment of the form `hi <expr>; hi <expr>; ...` where `hi <expr>;` occurs at |
| 61 | +least once. Note that this complex NT does not have a dedicated separator |
| 62 | +token. |
| 63 | + |
| 64 | +(Note that Rust's parser ensures that delimited sequences always occur with |
| 65 | +proper nesting of token tree structure and correct matching of open- and |
| 66 | +close-delimiters.) |
| 67 | + |
| 68 | +We will tend to use the variable "M" to stand for a matcher, variables "t" and |
| 69 | +"u" for arbitrary individual tokens, and the variables "tt" and "uu" for |
| 70 | +arbitrary token trees. (The use of "tt" does present potential ambiguity with |
| 71 | +its additional role as a fragment specifier; but it will be clear from context |
| 72 | +which interpretation is meant.) |
| 73 | + |
| 74 | +"SEP" will range over separator tokens, "OP" over the repetition operators |
| 75 | +`\*`, `+`, and `?`, "OPEN"/"CLOSE" over matching token pairs surrounding a |
| 76 | +delimited sequence (e.g. `[` and `]`). |
| 77 | + |
| 78 | +Greek letters "α" "β" "γ" "δ" stand for potentially empty token-tree sequences. |
| 79 | +(However, the Greek letter "ε" (epsilon) has a special role in the presentation |
| 80 | +and does not stand for a token-tree sequence.) |
| 81 | + |
| 82 | + * This Greek letter convention is usually just employed when the presence of |
| 83 | + a sequence is a technical detail; in particular, when we wish to *emphasize* |
| 84 | + that we are operating on a sequence of token-trees, we will use the notation |
| 85 | + "tt ..." for the sequence, not a Greek letter. |
| 86 | + |
| 87 | +Note that a matcher is merely a token tree. A "simple NT", as mentioned above, |
| 88 | +is an meta-variable NT; thus it is a non-repetition. For example, `$foo:ty` is |
| 89 | +a simple NT but `$($foo:ty)+` is a complex NT. |
| 90 | + |
| 91 | +Note also that in the context of this formalism, the term "token" generally |
| 92 | +*includes* simple NTs. |
| 93 | + |
| 94 | +Finally, it is useful for the reader to keep in mind that according to the |
| 95 | +definitions of this formalism, no simple NT matches the empty fragment, and |
| 96 | +likewise no token matches the empty fragment of Rust syntax. (Thus, the *only* |
| 97 | +NT that can match the empty fragment is a complex NT.) This is not actually |
| 98 | +true, because the `vis` matcher can match an empty fragment. Thus, for the |
| 99 | +purposes of the formalism, we will treat `$v:vis` as actually being |
| 100 | +`$($v:vis)?`, with a requirement that the matcher match an empty fragment. |
| 101 | + |
| 102 | +### The Matcher Invariants |
| 103 | + |
| 104 | +To be valid, a matcher must meet the following three invariants. The definitions |
| 105 | +of FIRST and FOLLOW are described later. |
| 106 | + |
| 107 | +1. For any two successive token tree sequences in a matcher `M` (i.e. `M = ... |
| 108 | + tt uu ...`) with `uu ...` nonempty, we must have FOLLOW(`... tt`) ∪ {ε} ⊇ |
| 109 | + FIRST(`uu ...`). |
| 110 | +1. For any separated complex NT in a matcher, `M = ... $(tt ...) SEP OP ...`, |
| 111 | + we must have `SEP` ∈ FOLLOW(`tt ...`). |
| 112 | +1. For an unseparated complex NT in a matcher, `M = ... $(tt ...) OP ...`, if |
| 113 | + OP = `\*` or `+`, we must have FOLLOW(`tt ...`) ⊇ FIRST(`tt ...`). |
| 114 | + |
| 115 | +The first invariant says that whatever actual token that comes after a matcher, |
| 116 | +if any, must be somewhere in the predetermined follow set. This ensures that a |
| 117 | +legal macro definition will continue to assign the same determination as to |
| 118 | +where `... tt` ends and `uu ...` begins, even as new syntactic forms are added |
| 119 | +to the language. |
| 120 | + |
| 121 | +The second invariant says that a separated complex NT must use a seperator token |
| 122 | +that is part of the predetermined follow set for the internal contents of the |
| 123 | +NT. This ensures that a legal macro definition will continue to parse an input |
| 124 | +fragment into the same delimited sequence of `tt ...`'s, even as new syntactic |
| 125 | +forms are added to the language. |
| 126 | + |
| 127 | +The third invariant says that when we have a complex NT that can match two or |
| 128 | +more copies of the same thing with no separation in between, it must be |
| 129 | +permissible for them to be placed next to each other as per the first invariant. |
| 130 | +This invariant also requires they be nonempty, which eliminates a possible |
| 131 | +ambiguity. |
| 132 | + |
| 133 | +**NOTE: The third invariant is currently unenforced due to historical oversight |
| 134 | +and significant reliance on the behaviour. It is currently undecided what to do |
| 135 | +about this going forward. Macros that do not respect the behaviour may become |
| 136 | +invalid in a future edition of Rust. See the [tracking issue].** |
| 137 | + |
| 138 | +### FIRST and FOLLOW, informally |
| 139 | + |
| 140 | +A given matcher M maps to three sets: FIRST(M), LAST(M) and FOLLOW(M). |
| 141 | + |
| 142 | +Each of the three sets is made up of tokens. FIRST(M) and LAST(M) may also |
| 143 | +contain a distinguished non-token element ε ("epsilon"), which indicates that M |
| 144 | +can match the empty fragment. (But FOLLOW(M) is always just a set of tokens.) |
| 145 | + |
| 146 | +Informally: |
| 147 | + |
| 148 | + * FIRST(M): collects the tokens potentially used first when matching a |
| 149 | + fragment to M. |
| 150 | + |
| 151 | + * LAST(M): collects the tokens potentially used last when matching a fragment |
| 152 | + to M. |
| 153 | + |
| 154 | + * FOLLOW(M): the set of tokens allowed to follow immediately after some |
| 155 | + fragment matched by M. |
| 156 | + |
| 157 | + In other words: t ∈ FOLLOW(M) if and only if there exists (potentially |
| 158 | + empty) token sequences α, β, γ, δ where: |
| 159 | + |
| 160 | + * M matches β, |
| 161 | + |
| 162 | + * t matches γ, and |
| 163 | + |
| 164 | + * The concatenation α β γ δ is a parseable Rust program. |
| 165 | + |
| 166 | +We use the shorthand ANYTOKEN to denote the set of all tokens (including simple |
| 167 | +NTs). For example, if any token is legal after a matcher M, then FOLLOW(M) = |
| 168 | +ANYTOKEN. |
| 169 | + |
| 170 | +(To review one's understanding of the above informal descriptions, the reader |
| 171 | +at this point may want to jump ahead to the [examples of |
| 172 | +FIRST/LAST][#examples-of-first-and-last] before reading their formal |
| 173 | +definitions.) |
| 174 | + |
| 175 | +### FIRST, LAST |
| 176 | + |
| 177 | +Below are formal inductive definitions for FIRST and LAST. |
| 178 | + |
| 179 | +"A ∪ B" denotes set union, "A ∩ B" denotes set intersection, and "A \ B" |
| 180 | +denotes set difference (i.e. all elements of A that are not present in B). |
| 181 | + |
| 182 | +#### FIRST |
| 183 | + |
| 184 | +FIRST(M) is defined by case analysis on the sequence M and the structure of its |
| 185 | +first token-tree (if any): |
| 186 | + |
| 187 | + * if M is the empty sequence, then FIRST(M) = { ε }, |
| 188 | + |
| 189 | + * if M starts with a token t, then FIRST(M) = { t }, |
| 190 | + |
| 191 | + (Note: this covers the case where M starts with a delimited token-tree |
| 192 | + sequence, `M = OPEN tt ... CLOSE ...`, in which case `t = OPEN` and thus |
| 193 | + FIRST(M) = { `OPEN` }.) |
| 194 | + |
| 195 | + (Note: this critically relies on the property that no simple NT matches the |
| 196 | + empty fragment.) |
| 197 | + |
| 198 | + * Otherwise, M is a token-tree sequence starting with a complex NT: `M = $( tt |
| 199 | + ... ) OP α`, or `M = $( tt ... ) SEP OP α`, (where `α` is the (potentially |
| 200 | + empty) sequence of token trees for the rest of the matcher). |
| 201 | + |
| 202 | + * Let SEP\_SET(M) = { SEP } if SEP is present and ε ∈ FIRST(`tt ...`); |
| 203 | + otherwise SEP\_SET(M) = {}. |
| 204 | + |
| 205 | + * Let ALPHA\_SET(M) = FIRST(`α`) if OP = `\*` or `?` and ALPHA\_SET(M) = {} if |
| 206 | + OP = `+`. |
| 207 | + * FIRST(M) = (FIRST(`tt ...`) \\ {ε}) ∪ SEP\_SET(M) ∪ ALPHA\_SET(M). |
| 208 | + |
| 209 | +The definition for complex NTs deserves some justification. SEP\_SET(M) defines |
| 210 | +the possibility that the separator could be a valid first token for M, which |
| 211 | +happens when there is a separator defined and the repeated fragment could be |
| 212 | +empty. ALPHA\_SET(M) defines the possibility that the complex NT could be empty, |
| 213 | +meaning that M's valid first tokens are those of the following token-tree |
| 214 | +sequences `α`. This occurs when either `\*` or `?` is used, in which case there |
| 215 | +could be zero repetitions. In theory, this could also occur if `+` was used with |
| 216 | +a potentially-empty repeating fragment, but this is forbidden by the third |
| 217 | +invariant. |
| 218 | + |
| 219 | +From there, clearly FIRST(M) can include any token from SEP\_SET(M) or |
| 220 | +ALPHA\_SET(M), and if the complex NT match is nonempty, then any token starting |
| 221 | +FIRST(`tt ...`) could work too. The last piece to consider is ε. SEP\_SET(M) and |
| 222 | +FIRST(`tt ...`) \ {ε} cannot contain ε, but ALPHA\_SET(M) could. Hence, this |
| 223 | +definition allows M to accept ε if and only if ε ∈ ALPHA\_SET(M) does. This is |
| 224 | +correct because for M to accept ε in the complex NT case, both the complex NT |
| 225 | +and α must accept it. If OP = `+`, meaning that the complex NT cannot be empty, |
| 226 | +then by definition ε ∉ ALPHA\_SET(M). Otherwise, the complex NT can accept zero |
| 227 | +repetitions, and then ALPHA\_SET(M) = FOLLOW(`α`). So this definition is correct |
| 228 | +with respect to \varepsilon as well. |
| 229 | + |
| 230 | +#### LAST |
| 231 | + |
| 232 | +LAST(M), defined by case analysis on M itself (a sequence of token-trees): |
| 233 | + |
| 234 | + * if M is the empty sequence, then LAST(M) = { ε } |
| 235 | + |
| 236 | + * if M is a singleton token t, then LAST(M) = { t } |
| 237 | + |
| 238 | + * if M is the singleton complex NT repeating zero or more times, `M = $( tt |
| 239 | + ... ) *`, or `M = $( tt ... ) SEP *` |
| 240 | + |
| 241 | + * Let sep_set = { SEP } if SEP present; otherwise sep_set = {}. |
| 242 | + |
| 243 | + * if ε ∈ LAST(`tt ...`) then LAST(M) = LAST(`tt ...`) ∪ sep_set |
| 244 | + |
| 245 | + * otherwise, the sequence `tt ...` must be non-empty; LAST(M) = LAST(`tt |
| 246 | + ...`) ∪ {ε}. |
| 247 | + |
| 248 | + * if M is the singleton complex NT repeating one or more times, `M = $( tt ... |
| 249 | + ) +`, or `M = $( tt ... ) SEP +` |
| 250 | + |
| 251 | + * Let sep_set = { SEP } if SEP present; otherwise sep_set = {}. |
| 252 | + |
| 253 | + * if ε ∈ LAST(`tt ...`) then LAST(M) = LAST(`tt ...`) ∪ sep_set |
| 254 | + |
| 255 | + * otherwise, the sequence `tt ...` must be non-empty; LAST(M) = LAST(`tt |
| 256 | + ...`) |
| 257 | + |
| 258 | + * if M is the singleton complex NT repeating zero or one time, `M = $( tt ...) |
| 259 | + ?`, then LAST(M) = LAST(`tt ...`) ∪ {ε}. |
| 260 | + |
| 261 | + * if M is a delimited token-tree sequence `OPEN tt ... CLOSE`, then LAST(M) = |
| 262 | + { `CLOSE` }. |
| 263 | + |
| 264 | + * if M is a non-empty sequence of token-trees `tt uu ...`, |
| 265 | + |
| 266 | + * If ε ∈ LAST(`uu ...`), then LAST(M) = LAST(`tt`) ∪ (LAST(`uu ...`) \ { ε }). |
| 267 | + |
| 268 | + * Otherwise, the sequence `uu ...` must be non-empty; then LAST(M) = |
| 269 | + LAST(`uu ...`). |
| 270 | + |
| 271 | +### Examples of FIRST and LAST |
| 272 | +[examples-of-first-and-last]: #examples-of-first-and-last |
| 273 | + |
| 274 | +Below are some examples of FIRST and LAST. |
| 275 | +(Note in particular how the special ε element is introduced and |
| 276 | +eliminated based on the interation between the pieces of the input.) |
| 277 | + |
| 278 | +Our first example is presented in a tree structure to elaborate on how |
| 279 | +the analysis of the matcher composes. (Some of the simpler subtrees |
| 280 | +have been elided.) |
| 281 | + |
| 282 | +```text |
| 283 | +INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ g |
| 284 | + ~~~~~~~~ ~~~~~~~ ~ |
| 285 | + | | | |
| 286 | +FIRST: { $d:ident } { $e:expr } { h } |
| 287 | +
|
| 288 | +
|
| 289 | +INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ |
| 290 | + ~~~~~~~~~~~~~~~~~~ ~~~~~~~ ~~~ |
| 291 | + | | | |
| 292 | +FIRST: { $d:ident } { h, ε } { f } |
| 293 | +
|
| 294 | +INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ g |
| 295 | + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~ ~~~~~~~~~ ~ |
| 296 | + | | | | |
| 297 | +FIRST: { $d:ident, ε } { h, ε, ; } { f } { g } |
| 298 | +
|
| 299 | +
|
| 300 | +INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ g |
| 301 | + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 302 | + | |
| 303 | +FIRST: { $d:ident, h, ;, f } |
| 304 | +``` |
| 305 | + |
| 306 | +Thus: |
| 307 | + |
| 308 | + * FIRST(`$($d:ident $e:expr );* $( $(h)* );* $( f ;)+ g`) = { `$d:ident`, `h`, `;`, `f` } |
| 309 | + |
| 310 | +Note however that: |
| 311 | + |
| 312 | + * FIRST(`$($d:ident $e:expr );* $( $(h)* );* $($( f ;)+ g)*`) = { `$d:ident`, `h`, `;`, `f`, ε } |
| 313 | + |
| 314 | +Here are similar examples but now for LAST. |
| 315 | + |
| 316 | + * LAST(`$d:ident $e:expr`) = { `$e:expr` } |
| 317 | + * LAST(`$( $d:ident $e:expr );*`) = { `$e:expr`, ε } |
| 318 | + * LAST(`$( $d:ident $e:expr );* $(h)*`) = { `$e:expr`, ε, `h` } |
| 319 | + * LAST(`$( $d:ident $e:expr );* $(h)* $( f ;)+`) = { `;` } |
| 320 | + * LAST(`$( $d:ident $e:expr );* $(h)* $( f ;)+ g`) = { `g` } |
| 321 | + |
| 322 | +### FOLLOW(M) |
| 323 | + |
| 324 | +Finally, the definition for FOLLOW(M) is built up as follows. pat, expr, etc. |
| 325 | +represent simple nonterminals with the given fragment specifier. |
| 326 | + |
| 327 | + * FOLLOW(pat) = {`=>`, `,`, `=`, `|`, `if`, `in`}`. |
| 328 | + |
| 329 | + * FOLLOW(expr) = FOLLOW(stmt) = {`=>`, `,`, `;`}`. |
| 330 | + |
| 331 | + * FOLLOW(ty) = FOLLOW(path) = {`{`, `[`, `,`, `=>`, `:`, `=`, `>`, `>>`, `;`, |
| 332 | + `|`, `as`, `where`, block nonterminals}. |
| 333 | + |
| 334 | + * FOLLOW(vis) = {`,`l any keyword or identifier except a non-raw `priv`; any |
| 335 | + token that can begin a type; ident, ty, and path nonterminals}. |
| 336 | + |
| 337 | + * FOLLOW(t) = ANYTOKEN for any other simple token, including block, ident, |
| 338 | + tt, item, lifetime, literal and meta simple nonterminals, and all terminals. |
| 339 | + |
| 340 | + * FOLLOW(M), for any other M, is defined as the intersection, as t ranges over |
| 341 | + (LAST(M) \ {ε}), of FOLLOW(t). |
| 342 | + |
| 343 | +The tokens that can begin a type are, as of this writing, {`(`, `[`, `!`, `\*`, |
| 344 | +`&`, `&&`, `?`, lifetimes, `>`, `>>`, `::`, any non-keyword identifier, `super`, |
| 345 | +`self`, `Self`, `extern`, `crate`, `$crate`, `_`, `for`, `impl`, `fn`, `unsafe`, |
| 346 | +`typeof`, `dyn`}, although this list may not be complete because people won't |
| 347 | +always remember to update the appendix when new ones are added. |
| 348 | + |
| 349 | +Examples of FOLLOW for complex M: |
| 350 | + |
| 351 | + * FOLLOW(`$( $d:ident $e:expr )\*`) = FOLLOW(`$e:expr`) |
| 352 | + * FOLLOW(`$( $d:ident $e:expr )\* $(;)\*`) = FOLLOW(`$e:expr`) ∩ ANYTOKEN = FOLLOW(`$e:expr`) |
| 353 | + * FOLLOW(`$( $d:ident $e:expr )\* $(;)\* $( f |)+`) = ANYTOKEN |
| 354 | + |
| 355 | +### Examples of valid and invalid matchers |
| 356 | + |
| 357 | +With the above specification in hand, we can present arguments for |
| 358 | +why particular matchers are legal and others are not. |
| 359 | + |
| 360 | + * `($ty:ty < foo ,)` : illegal, because FIRST(`< foo ,`) = { `<` } ⊈ FOLLOW(`ty`) |
| 361 | + |
| 362 | + * `($ty:ty , foo <)` : legal, because FIRST(`, foo <`) = { `,` } is ⊆ FOLLOW(`ty`). |
| 363 | + |
| 364 | + * `($pa:pat $pb:pat $ty:ty ,)` : illegal, because FIRST(`$pb:pat $ty:ty ,`) = { `$pb:pat` } ⊈ FOLLOW(`pat`), and also FIRST(`$ty:ty ,`) = { `$ty:ty` } ⊈ FOLLOW(`pat`). |
| 365 | + |
| 366 | + * `( $($a:tt $b:tt)* ; )` : legal, because FIRST(`$b:tt`) = { `$b:tt` } is ⊆ FOLLOW(`tt`) = ANYTOKEN, as is FIRST(`;`) = { `;` }. |
| 367 | + |
| 368 | + * `( $($t:tt),* , $(t:tt),* )` : legal, (though any attempt to actually use this macro will signal a local ambguity error during expansion). |
| 369 | + |
| 370 | + * `($ty:ty $(; not sep)* -)` : illegal, because FIRST(`$(; not sep)* -`) = { `;`, `-` } is not in FOLLOW(`ty`). |
| 371 | + |
| 372 | + * `($($ty:ty)-+)` : illegal, because separator `-` is not in FOLLOW(`ty`). |
| 373 | + |
| 374 | + * `($($e:expr)*)` : illegal, because expr NTs are not in FOLLOW(expr NT). |
| 375 | + |
| 376 | +[Macros by Example]: macros-by-example.html |
| 377 | +[RFC 550]: https://github.com/rust-lang/rfcs/blob/master/text/0550-macro-future-proofing.html |
| 378 | +[tracking issue]: https://github.com/rust-lang/rust/issues/56575 |
0 commit comments