Skip to content

Commit fa56fdb

Browse files
committed
Lexical structure: move the description of CRLF normalization
We now say that CRLF normalization happens as a separate pass before tokenization.
1 parent a0b1195 commit fa56fdb

File tree

3 files changed

+54
-26
lines changed

3 files changed

+54
-26
lines changed

src/comments.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
>    | INNER_BLOCK_DOC
3131
>
3232
> _IsolatedCR_ :\
33-
>    _A `\r` not followed by a `\n`_
33+
>    \\r
3434
3535
## Non-doc comments
3636

@@ -53,8 +53,9 @@ that follows. That is, they are equivalent to writing `#![doc="..."]` around
5353
the body of the comment. `//!` comments are usually used to document
5454
modules that occupy a source file.
5555

56-
Isolated CRs (`\r`), i.e. not followed by LF (`\n`), are not allowed in doc
57-
comments.
56+
The character `U+000D` (CR) is not allowed in doc comments.
57+
58+
> **Note**: The sequence `U+000D` (CR) immediately followed by `U+000A` (LF) would have been previously transformed into a single `U+000A` (LF).
5859
5960
## Examples
6061

src/input-format.md

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,22 @@
11
# Input format
22

3-
Rust input is interpreted as a sequence of Unicode code points encoded in UTF-8.
3+
This chapter describes how a source file is interpreted as a sequence of tokens.
4+
5+
See [Crates and source files] for a description of how programs are organised into files.
6+
7+
## Source encoding
8+
9+
Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.
10+
It is an error if the file is not valid UTF-8.
11+
12+
## CRLF normalization
13+
14+
Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).
15+
16+
Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).
17+
18+
## Tokenization
19+
20+
The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.
21+
22+
[Crates and source files]: crates-and-source-files.md

src/tokens.md

Lines changed: 30 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@ Literals are tokens used in [literal expressions].
3737

3838
[^nsets]: The number of `#`s on each side of the same literal must be equivalent.
3939

40+
> **Note**: Character and string literal tokens never include the sequence of `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).
41+
4042
#### ASCII escapes
4143

4244
| | Name |
@@ -156,13 +158,10 @@ A _string literal_ is a sequence of any Unicode characters enclosed within two
156158
`U+0022` (double-quote) characters, with the exception of `U+0022` itself,
157159
which must be _escaped_ by a preceding `U+005C` character (`\`).
158160

159-
Line-breaks are allowed in string literals.
160-
A line-break is either a newline (`U+000A`) or a pair of carriage return and newline (`U+000D`, `U+000A`).
161-
Both byte sequences are translated to `U+000A`.
162-
161+
Line-breaks, represented by the character `U+000A` (LF), are allowed in string literals.
163162
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
164163
See [String continuation escapes] for details.
165-
164+
The character `U+000D` (CR) may not appear in a string literal other than as part of such a string continuation escape.
166165

167166
#### Character escapes
168167

@@ -198,10 +197,10 @@ following forms:
198197
199198
Raw string literals do not process any escapes. They start with the character
200199
`U+0072` (`r`), followed by fewer than 256 of the character `U+0023` (`#`) and a
201-
`U+0022` (double-quote) character. The _raw string body_ can contain any sequence
202-
of Unicode characters and is terminated only by another `U+0022` (double-quote)
203-
character, followed by the same number of `U+0023` (`#`) characters that preceded
204-
the opening `U+0022` (double-quote) character.
200+
`U+0022` (double-quote) character.
201+
202+
The _raw string body_ can contain any sequence of Unicode characters other than `U+000D` (CR).
203+
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
205204

206205
All Unicode characters contained in the raw string body represent themselves,
207206
the characters `U+0022` (double-quote) (except when followed by at least as
@@ -259,6 +258,11 @@ the literal, it must be _escaped_ by a preceding `U+005C` (`\`) character.
259258
Alternatively, a byte string literal can be a _raw byte string literal_, defined
260259
below.
261260

261+
Line-breaks, represented by the character `U+000A` (LF), are allowed in byte string literals.
262+
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
263+
See [String continuation escapes] for details.
264+
The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape.
265+
262266
Some additional _escapes_ are available in either byte or non-raw byte string
263267
literals. An escape starts with a `U+005C` (`\`) and continues with one of the
264268
following forms:
@@ -281,19 +285,19 @@ following forms:
281285
> &nbsp;&nbsp; `br` RAW_BYTE_STRING_CONTENT SUFFIX<sup>?</sup>
282286
>
283287
> RAW_BYTE_STRING_CONTENT :\
284-
> &nbsp;&nbsp; &nbsp;&nbsp; `"` ASCII<sup>* (non-greedy)</sup> `"`\
288+
> &nbsp;&nbsp; &nbsp;&nbsp; `"` ASCII_FOR_RAW<sup>* (non-greedy)</sup> `"`\
285289
> &nbsp;&nbsp; | `#` RAW_BYTE_STRING_CONTENT `#`
286290
>
287-
> ASCII :\
288-
> &nbsp;&nbsp; _any ASCII (i.e. 0x00 to 0x7F)_
291+
> ASCII_FOR_RAW :\
292+
> &nbsp;&nbsp; _any ASCII (i.e. 0x00 to 0x7F) except IsolatedCR_
289293
290294
Raw byte string literals do not process any escapes. They start with the
291295
character `U+0062` (`b`), followed by `U+0072` (`r`), followed by fewer than 256
292-
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
293-
_raw string body_ can contain any sequence of ASCII characters and is terminated
294-
only by another `U+0022` (double-quote) character, followed by the same number of
295-
`U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote)
296-
character. A raw byte string literal can not contain any non-ASCII byte.
296+
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.
297+
298+
The _raw string body_ can contain any sequence of ASCII characters other than `U+000D` (CR).
299+
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
300+
A raw byte string literal can not contain any non-ASCII byte.
297301

298302
All characters contained in the raw string body represent their ASCII encoding,
299303
the characters `U+0022` (double-quote) (except when followed by at least as
@@ -340,6 +344,11 @@ C strings are implicitly terminated by byte `0x00`, so the C string literal
340344
literal `b"\x00"`. Other than the implicit terminator, byte `0x00` is not
341345
permitted within a C string.
342346

347+
Line-breaks, represented by the character `U+000A` (LF), are allowed in C string literals.
348+
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
349+
See [String continuation escapes] for details.
350+
The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape.
351+
343352
Some additional _escapes_ are available in non-raw C string literals. An escape
344353
starts with a `U+005C` (`\`) and continues with one of the following forms:
345354

@@ -382,11 +391,10 @@ c"\xC3\xA6";
382391
383392
Raw C string literals do not process any escapes. They start with the
384393
character `U+0063` (`c`), followed by `U+0072` (`r`), followed by fewer than 256
385-
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
386-
_raw C string body_ can contain any sequence of Unicode characters (other than
387-
`U+0000`) and is terminated only by another `U+0022` (double-quote) character,
388-
followed by the same number of `U+0023` (`#`) characters that preceded the
389-
opening `U+0022` (double-quote) character.
394+
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.
395+
396+
The _raw C string body_ can contain any sequence of Unicode characters other than `U+0000` (NUL) and `U+000D` (CR).
397+
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
390398

391399
All characters contained in the raw C string body represent themselves in UTF-8
392400
encoding. The characters `U+0022` (double-quote) (except when followed by at

0 commit comments

Comments
 (0)