Skip to content

Commit 5f51269

Browse files
committed
lexical structure: move the description of BOM-removal
This takes place at the same time as CRLF normalisation. It's better not to list it in a Lexer block, as it isn't a token that can be fed to a macro.
1 parent fa56fdb commit 5f51269

File tree

2 files changed

+7
-11
lines changed

2 files changed

+7
-11
lines changed

src/crates-and-source-files.md

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,11 @@
22

33
> **<sup>Syntax</sup>**\
44
> _Crate_ :\
5-
> &nbsp;&nbsp; UTF8BOM<sup>?</sup>\
65
> &nbsp;&nbsp; SHEBANG<sup>?</sup>\
76
> &nbsp;&nbsp; [_InnerAttribute_]<sup>\*</sup>\
87
> &nbsp;&nbsp; [_Item_]<sup>\*</sup>
98
109
> **<sup>Lexer</sup>**\
11-
> UTF8BOM : `\uFEFF`\
1210
> SHEBANG : `#!` \~`\n`<sup>\+</sup>[](#shebang)
1311
1412

@@ -65,19 +63,13 @@ apply to the crate as a whole.
6563
#![warn(non_camel_case_types)]
6664
```
6765

68-
## Byte order mark
69-
70-
The optional [_UTF8 byte order mark_] (UTF8BOM production) indicates that the
71-
file is encoded in UTF8. It can only occur at the beginning of the file and
72-
is ignored by the compiler.
73-
7466
## Shebang
7567

7668
A source file can have a [_shebang_] (SHEBANG production), which indicates
7769
to the operating system what program to use to execute this file. It serves
7870
essentially to treat the source file as an executable script. The shebang
79-
can only occur at the beginning of the file (but after the optional
80-
_UTF8BOM_). It is ignored by the compiler. For example:
71+
can only occur at the beginning of the file.
72+
It is ignored by the compiler. For example:
8173

8274
<!-- ignore: tests don't like shebang -->
8375
```rust,ignore
@@ -162,7 +154,6 @@ or `_` (U+005F) characters.
162154
[_Item_]: items.md
163155
[_MetaNameValueStr_]: attributes.md#meta-item-attribute-syntax
164156
[_shebang_]: https://en.wikipedia.org/wiki/Shebang_(Unix)
165-
[_utf8 byte order mark_]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
166157
[`ExitCode`]: ../std/process/struct.ExitCode.html
167158
[`Infallible`]: ../std/convert/enum.Infallible.html
168159
[`Termination`]: ../std/process/trait.Termination.html

src/input-format.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@ See [Crates and source files] for a description of how programs are organised in
99
Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.
1010
It is an error if the file is not valid UTF-8.
1111

12+
## Byte order mark removal
13+
14+
If the first character in the sequence is `U+FEFF` ([BYTE ORDER MARK]), it is removed.
15+
1216
## CRLF normalization
1317

1418
Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).
@@ -19,4 +23,5 @@ Other occurrences of the character `U+000D` (CR) are left in place (they are tre
1923

2024
The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.
2125

26+
[BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
2227
[Crates and source files]: crates-and-source-files.md

0 commit comments

Comments
 (0)