Skip to content

Commit c2d5a4c

Browse files
committed
Edit "What the compiler does to your code"
1 parent a9829ab commit c2d5a4c

File tree

1 file changed

+109
-93
lines changed

1 file changed

+109
-93
lines changed

src/overview.md

Lines changed: 109 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -17,108 +17,122 @@ So first, let's look at what the compiler does to your code. For now, we will
1717
avoid mentioning how the compiler implements these steps except as needed;
1818
we'll talk about that later.
1919

20-
### Invokation
21-
22-
- The compile process begins when a user writes a Rust source program in text
23-
and invokes the `rustc` compiler on it. The work that the compiler needs to
24-
perform is defined by command-line options. For example, it is possible to
25-
enable nightly features (`-Z` flags), perform `check`-only builds, or emit
26-
LLVM-IR rather than executable machine code. The `rustc` executable call may
27-
be indirect through the use of `cargo`.
28-
- Command line argument parsing occurs in the [`rustc_driver`]. This crate
29-
defines the compile configuration that is requested by the user and passes it
30-
to the rest of the compilation process as a [`rustc_interface::Config`].
20+
### Invocation
21+
22+
Compilation begins when a user writes a Rust source program in text
23+
and invokes the `rustc` compiler on it. The work that the compiler needs to
24+
perform is defined by command-line options. For example, it is possible to
25+
enable nightly features (`-Z` flags), perform `check`-only builds, or emit
26+
LLVM-IR rather than executable machine code. The `rustc` executable call may
27+
be indirect through the use of `cargo`.
28+
29+
Command line argument parsing occurs in the [`rustc_driver`]. This crate
30+
defines the compile configuration that is requested by the user and passes it
31+
to the rest of the compilation process as a [`rustc_interface::Config`].
3132

3233
### Lexing and parsing
3334

34-
- The raw Rust source text is analyzed by a low-level lexer located in
35-
[`rustc_lexer`]. At this stage, the source text is turned into a stream of
36-
atomic source code units known as _tokens_. The lexer supports the
37-
Unicode character encoding.
38-
- The token stream passes through a higher-level lexer located in
39-
[`rustc_parse`] to prepare for the next stage of the compile process. The
40-
[`StringReader`] struct is used at this stage to perform a set of validations
41-
and turn strings into interned symbols (_interning_ is discussed later).
42-
[String interning] is a way of storing only one immutable
43-
copy of each distinct string value.
44-
45-
- The lexer has a small interface and doesn't depend directly on the
46-
diagnostic infrastructure in `rustc`. Instead it provides diagnostics as plain
47-
data which are emitted in `rustc_parse::lexer::mod` as real diagnostics.
48-
- The lexer preserves full fidelity information for both IDEs and proc macros.
49-
- The parser [translates the token stream from the lexer into an Abstract Syntax
50-
Tree (AST)][parser]. It uses a recursive descent (top-down) approach to syntax
51-
analysis. The crate entry points for the parser are the
52-
[`Parser::parse_crate_mod()`][parse_crate_mod] and [`Parser::parse_mod()`][parse_mod]
53-
methods found in [`rustc_parse::parser::Parser`]. The external module parsing
54-
entry point is [`rustc_expand::module::parse_external_mod`][parse_external_mod].
55-
And the macro parser entry point is [`Parser::parse_nonterminal()`][parse_nonterminal].
56-
- Parsing is performed with a set of `Parser` utility methods including `fn bump`,
57-
`fn check`, `fn eat`, `fn expect`, `fn look_ahead`.
58-
- Parsing is organized by the semantic construct that is being parsed. Separate
59-
`parse_*` methods can be found in [`rustc_parse` `parser`][rustc_parse_parser_dir]
60-
directory. The source file name follows the construct name. For example, the
61-
following files are found in the parser:
62-
- `expr.rs`
63-
- `pat.rs`
64-
- `ty.rs`
65-
- `stmt.rs`
66-
- This naming scheme is used across many compiler stages. You will find
67-
either a file or directory with the same name across the parsing, lowering,
68-
type checking, THIR lowering, and MIR building sources.
69-
- Macro expansion, AST validation, name resolution, and early linting takes place
70-
during this stage of the compile process.
71-
- The parser uses the standard `DiagnosticBuilder` API for error handling, but we
72-
try to recover, parsing a superset of Rust's grammar, while also emitting an error.
73-
- `rustc_ast::ast::{Crate, Mod, Expr, Pat, ...}` AST nodes are returned from the parser.
35+
The raw Rust source text is analyzed by a low-level *lexer* located in
36+
[`rustc_lexer`]. At this stage, the source text is turned into a stream of
37+
atomic source code units known as _tokens_. The lexer supports the
38+
Unicode character encoding.
39+
40+
The token stream passes through a higher-level lexer located in
41+
[`rustc_parse`] to prepare for the next stage of the compile process. The
42+
[`StringReader`] struct is used at this stage to perform a set of validations
43+
and turn strings into interned symbols (_interning_ is discussed later).
44+
[String interning] is a way of storing only one immutable
45+
copy of each distinct string value.
46+
47+
The lexer has a small interface and doesn't depend directly on the
48+
diagnostic infrastructure in `rustc`. Instead it provides diagnostics as plain
49+
data which are emitted in `rustc_parse::lexer::mod` as real diagnostics.
50+
The lexer preserves full fidelity information for both IDEs and proc macros.
51+
52+
The *parser* [translates the token stream from the lexer into an Abstract Syntax
53+
Tree (AST)][parser]. It uses a recursive descent (top-down) approach to syntax
54+
analysis. The crate entry points for the parser are the
55+
[`Parser::parse_crate_mod()`][parse_crate_mod] and [`Parser::parse_mod()`][parse_mod]
56+
methods found in [`rustc_parse::parser::Parser`]. The external module parsing
57+
entry point is [`rustc_expand::module::parse_external_mod`][parse_external_mod].
58+
And the macro parser entry point is [`Parser::parse_nonterminal()`][parse_nonterminal].
59+
60+
Parsing is performed with a set of `Parser` utility methods including `bump`,
61+
`check`, `eat`, `expect`, `look_ahead`.
62+
63+
Parsing is organized by the semantic construct that is being parsed. Separate
64+
`parse_*` methods can be found in [`rustc_parse` `parser`][rustc_parse_parser_dir]
65+
directory. The source file name follows the construct name. For example, the
66+
following files are found in the parser:
67+
68+
- `expr.rs`
69+
- `pat.rs`
70+
- `ty.rs`
71+
- `stmt.rs`
72+
73+
This naming scheme is used across many compiler stages. You will find
74+
either a file or directory with the same name across the parsing, lowering,
75+
type checking, THIR lowering, and MIR building sources.
76+
77+
Macro expansion, AST validation, name resolution, and early linting also take place
78+
during this stage.
79+
80+
The parser uses the standard `DiagnosticBuilder` API for error handling, but we
81+
try to recover, parsing a superset of Rust's grammar, while also emitting an error.
82+
`rustc_ast::ast::{Crate, Mod, Expr, Pat, ...}` AST nodes are returned from the parser.
7483

7584
### HIR lowering
7685

77-
- We then take the AST and [convert it to High-Level Intermediate
78-
Representation (HIR)][hir]. This is a compiler-friendly representation of the
79-
AST. This involves a lot of desugaring of things like loops and `async fn`.
80-
- We use the HIR to do [type inference] (the process of automatic
81-
detection of the type of an expression), [trait solving] (the process
82-
of pairing up an impl with each reference to a trait), and [type
83-
checking] (the process of converting the types found in the HIR
84-
(`hir::Ty`), which represent the syntactic things that the user wrote,
85-
into the internal representation used by the compiler (`Ty<'tcx>`),
86-
and using that information to verify the type safety, correctness and
87-
coherence of the types used in the program).
86+
We next take the AST and convert it to [High-Level Intermediate
87+
Representation (HIR)][hir], a more compiler-friendly representation of the
88+
AST. This process called "lowering". It involves a lot of desugaring of things
89+
like loops and `async fn`.
90+
91+
We then use the HIR to do [*type inference*] (the process of automatic
92+
detection of the type of an expression), [*trait solving*] (the process
93+
of pairing up an impl with each reference to a trait), and [*type
94+
checking*]. Type checking is the process of converting the types found in the HIR
95+
([`hir::Ty`]), which represent what the user wrote,
96+
into the internal representation used by the compiler ([`Ty<'tcx>`]).
97+
That information is usedto verify the type safety, correctness and
98+
coherence of the types used in the program.
8899

89100
### MIR lowering
90101

91-
- The HIR is then [lowered to Mid-Level Intermediate Representation (MIR)][mir].
92-
- Along the way, we construct the THIR, which is an even more desugared HIR.
93-
THIR is used for pattern and exhaustiveness checking. It is also more
94-
convenient to convert into MIR than HIR is.
95-
- The MIR is used for [borrow checking].
96-
- We (want to) do [many optimizations on the MIR][mir-opt] because it is still
97-
generic and that improves the code we generate later, improving compilation
98-
speed too.
99-
- MIR is a higher level (and generic) representation, so it is easier to do
100-
some optimizations at MIR level than at LLVM-IR level. For example LLVM
101-
doesn't seem to be able to optimize the pattern the [`simplify_try`] mir
102-
opt looks for.
103-
- Rust code is _monomorphized_, which means making copies of all the generic
104-
code with the type parameters replaced by concrete types. To do
105-
this, we need to collect a list of what concrete types to generate code for.
106-
This is called _monomorphization collection_.
102+
The HIR is then [lowered to Mid-level Intermediate Representation (MIR)][mir],
103+
which is used for [borrow checking].
104+
105+
Along the way, we also construct the THIR, which is an even more desugared HIR.
106+
THIR is used for pattern and exhaustiveness checking. It is also more
107+
convenient to convert into MIR than HIR is.
108+
109+
We do [many optimizations on the MIR][mir-opt] because it is still
110+
generic and that improves the code we generate later, improving compilation
111+
speed too.
112+
MIR is a higher level (and generic) representation, so it is easier to do
113+
some optimizations at MIR level than at LLVM-IR level. For example LLVM
114+
doesn't seem to be able to optimize the pattern the [`simplify_try`] mir
115+
opt looks for.
116+
117+
Rust code is _monomorphized_, which means making copies of all the generic
118+
code with the type parameters replaced by concrete types. To do
119+
this, we need to collect a list of what concrete types to generate code for.
120+
This is called _monomorphization collection_ and it happens at the MIR level.
107121

108122
### Code generation
109123

110-
- We then begin what is vaguely called _code generation_ or _codegen_.
111-
- The [code generation stage (codegen)][codegen] is when higher level
112-
representations of source are turned into an executable binary. `rustc`
113-
uses LLVM for code generation. The first step is to convert the MIR
114-
to LLVM Intermediate Representation (LLVM IR). This is where the MIR
115-
is actually monomorphized, according to the list we created in the
116-
previous step.
117-
- The LLVM IR is passed to LLVM, which does a lot more optimizations on it.
118-
It then emits machine code. It is basically assembly code with additional
119-
low-level types and annotations added. (e.g. an ELF object or wasm).
120-
- The different libraries/binaries are linked together to produce the final
121-
binary.
124+
We then begin what is vaguely called _code generation_ or _codegen_.
125+
The [code generation stage][codegen] is when higher level
126+
representations of source are turned into an executable binary. `rustc`
127+
uses LLVM for code generation. The first step is to convert the MIR
128+
to LLVM Intermediate Representation (LLVM IR). This is where the MIR
129+
is actually monomorphized, according to the list we created in the
130+
previous step.
131+
The LLVM IR is passed to LLVM, which does a lot more optimizations on it.
132+
It then emits machine code. It is basically assembly code with additional
133+
low-level types and annotations added (e.g. an ELF object or WASM).
134+
The different libraries/binaries are then linked together to produce the final
135+
binary.
122136

123137
[String interning]: https://en.wikipedia.org/wiki/String_interning
124138
[`rustc_lexer`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html
@@ -129,9 +143,9 @@ we'll talk about that later.
129143
[`rustc_parse`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html
130144
[parser]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html
131145
[hir]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_hir/index.html
132-
[type inference]: https://rustc-dev-guide.rust-lang.org/type-inference.html
133-
[trait solving]: https://rustc-dev-guide.rust-lang.org/traits/resolution.html
134-
[type checking]: https://rustc-dev-guide.rust-lang.org/type-checking.html
146+
[*type inference*]: https://rustc-dev-guide.rust-lang.org/type-inference.html
147+
[*trait solving*]: https://rustc-dev-guide.rust-lang.org/traits/resolution.html
148+
[*type checking*]: https://rustc-dev-guide.rust-lang.org/type-checking.html
135149
[mir]: https://rustc-dev-guide.rust-lang.org/mir/index.html
136150
[borrow checking]: https://rustc-dev-guide.rust-lang.org/borrow_check.html
137151
[mir-opt]: https://rustc-dev-guide.rust-lang.org/mir/optimizations.html
@@ -143,6 +157,8 @@ we'll talk about that later.
143157
[`rustc_parse::parser::Parser`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/parser/struct.Parser.html
144158
[parse_external_mod]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_expand/module/fn.parse_external_mod.html
145159
[rustc_parse_parser_dir]: https://github.com/rust-lang/rust/tree/master/compiler/rustc_parse/src/parser
160+
[`hir::Ty`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_hir/hir/struct.Ty.html
161+
[`Ty<'tcx>`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.Ty.html
146162

147163
## How it does it
148164

0 commit comments

Comments
 (0)