Skip to content

Commit f130e7d

Browse files
committed
revamp the Compiler Process section to be more up to date
1 parent 70db841 commit f130e7d

File tree

1 file changed

+74
-116
lines changed

1 file changed

+74
-116
lines changed

src/librustc/README.md

Lines changed: 74 additions & 116 deletions
Original file line numberDiff line numberDiff line change
@@ -91,121 +91,69 @@ better than others).
9191
The compiler process
9292
====================
9393
94-
The Rust compiler is comprised of six main compilation phases.
95-
96-
1. Parsing input
97-
2. Configuration & expanding (cfg rules & syntax extension expansion)
98-
3. Running analysis passes
99-
4. Translation to LLVM
100-
5. LLVM passes
101-
6. Linking
102-
103-
Phase one is responsible for parsing & lexing the input to the compiler. The
104-
output of this phase is an abstract syntax tree (AST). The AST at this point
105-
includes all macro uses & attributes. This means code which will be later
106-
expanded and/or removed due to `cfg` attributes is still present in this
107-
version of the AST. Parsing abstracts away details about individual files which
108-
have been read into the AST.
109-
110-
Phase two handles configuration and macro expansion. You can think of this
111-
phase as a function acting on the AST from the previous phase. The input for
112-
this phase is the unexpanded AST from phase one, and the output is an expanded
113-
version of the same AST. This phase will expand all macros & syntax
114-
extensions and will evaluate all `cfg` attributes, potentially removing some
115-
code. The resulting AST will not contain any macros or `macro_use` statements.
116-
117-
The code for these first two phases is in [`libsyntax`][libsyntax].
118-
119-
After this phase, the compiler allocates IDs to each node in the AST
120-
(technically not every node, but most of them). If we are writing out
121-
dependencies, that happens now.
122-
123-
The third phase is analysis. This is the most complex phase in the compiler,
124-
and makes up much of the code. This phase included name resolution, type
125-
checking, borrow checking, type & lifetime inference, trait selection, method
126-
selection, linting and so on. Most of the error detection in the compiler comes
127-
from this phase (with the exception of parse errors which arise during
128-
parsing). The "output" of this phase is a set of side tables containing
129-
semantic information about the source program. The analysis code is in
130-
[`librustc`][rustc] and some other crates with the `librustc_` prefix.
131-
132-
The fourth phase is translation. This phase translates the AST (and the side
133-
tables from the previous phase) into LLVM IR (intermediate representation).
134-
This is achieved by calling into the LLVM libraries. The code for this is in
135-
[`librustc_trans`][trans].
136-
137-
Phase five runs the LLVM backend. This runs LLVM's optimization passes on the
138-
generated IR and generates machine code resulting in object files. This phase
139-
is not really part of the Rust compiler, as LLVM carries out all the work.
140-
The interface between LLVM and Rust is in [`librustc_llvm`][llvm].
141-
142-
The final phase, phase six, links the object files into an executable. This is
143-
again outsourced to other tools and not performed by the Rust compiler
144-
directly. The interface is in [`librustc_back`][back] (which also contains some
145-
things used primarily during translation).
146-
147-
A module called the driver coordinates all these phases. It handles all the
148-
highest level coordination of compilation from parsing command line arguments
149-
all the way to invoking the linker to produce an executable.
150-
151-
Modules in the librustc crate
152-
=============================
153-
154-
The librustc crate itself consists of the following submodules
155-
(mostly, but not entirely, in their own directories):
156-
157-
- session: options and data that pertain to the compilation session as
158-
a whole
159-
- middle: middle-end: name resolution, typechecking, LLVM code
160-
generation
161-
- metadata: encoder and decoder for data required by separate
162-
compilation
163-
- plugin: infrastructure for compiler plugins
164-
- lint: infrastructure for compiler warnings
165-
- util: ubiquitous types and helper functions
166-
- lib: bindings to LLVM
167-
168-
The entry-point for the compiler is main() in the [`librustc_driver`][driver]
169-
crate.
170-
171-
The 3 central data structures:
172-
------------------------------
173-
174-
1. `./../libsyntax/ast.rs` defines the AST. The AST is treated as
175-
immutable after parsing, but it depends on mutable context data
176-
structures (mainly hash maps) to give it meaning.
177-
178-
- Many – though not all – nodes within this data structure are
179-
wrapped in the type `spanned<T>`, meaning that the front-end has
180-
marked the input coordinates of that node. The member `node` is
181-
the data itself, the member `span` is the input location (file,
182-
line, column; both low and high).
183-
184-
- Many other nodes within this data structure carry a
185-
`def_id`. These nodes represent the 'target' of some name
186-
reference elsewhere in the tree. When the AST is resolved, by
187-
`middle/resolve.rs`, all names wind up acquiring a def that they
188-
point to. So anything that can be pointed-to by a name winds
189-
up with a `def_id`.
190-
191-
2. `middle/ty.rs` defines the datatype `sty`. This is the type that
192-
represents types after they have been resolved and normalized by
193-
the middle-end. The typeck phase converts every ast type to a
194-
`ty::sty`, and the latter is used to drive later phases of
195-
compilation. Most variants in the `ast::ty` tag have a
196-
corresponding variant in the `ty::sty` tag.
197-
198-
3. `./../librustc_llvm/lib.rs` defines the exported types
199-
`ValueRef`, `TypeRef`, `BasicBlockRef`, and several others.
200-
Each of these is an opaque pointer to an LLVM type,
201-
manipulated through the `lib::llvm` interface.
202-
203-
[libsyntax]: https://github.com/rust-lang/rust/tree/master/src/libsyntax/
204-
[trans]: https://github.com/rust-lang/rust/tree/master/src/librustc_trans/
205-
[llvm]: https://github.com/rust-lang/rust/tree/master/src/librustc_llvm/
206-
[back]: https://github.com/rust-lang/rust/tree/master/src/librustc_back/
207-
[rustc]: https://github.com/rust-lang/rust/tree/master/src/librustc/
208-
[driver]: https://github.com/rust-lang/rust/tree/master/src/librustc_driver
94+
The Rust compiler is in a bit of transition right now. It used to be a
95+
purely "pass-based" compiler, where we ran a number of passes over the
96+
entire program, and each did a particular check of transformation.
97+
98+
We are gradually replacing this pass-based code with an alternative
99+
setup based on on-demand **queries**. In the query-model, we work
100+
backwards, executing a *query* that expresses our ultimate goal (e.g.,
101+
"compiler this crate"). This query in turn may make other queries
102+
(e.g., "get me a list of all modules in the crate"). Those queries
103+
make other queries that ultimately bottom out in the base operations,
104+
like parsing the input, running the type-checker, and so forth. This
105+
on-demand model permits us to do exciting things like only do the
106+
minimal amount of work needed to type-check a single function. It also
107+
helps with incremental compilation. (For details on defining queries,
108+
check out `src/librustc/ty/maps/README.md`.)
109+
110+
Regardless of the general setup, the basic operations that the
111+
compiler must perform are the same. The only thing that changes is
112+
whether these operations are invoked front-to-back, or on demand. In
113+
order to compile a Rust crate, these are the general steps that we
114+
take:
115+
116+
1. **Parsing input**
117+
- this processes the `.rs` files and produces the AST ("abstract syntax tree")
118+
- the AST is defined in `syntax/ast.rs`. It is intended to match the lexical
119+
syntax of the Rust language quite closely.
120+
2. **Name resolution, macro expansion, and configuration**
121+
- once parsing is complete, we process the AST recursively, resolving paths
122+
and expanding macros. This same process also processes `#[cfg]` nodes, and hence
123+
may strip things out of the AST as well.
124+
3. **Lowering to HIR**
125+
- Once name resolution completes, we convert the AST into the HIR,
126+
or "high-level IR". The HIR is defined in `src/librustc/hir/`; that module also includes
127+
the lowering code.
128+
- The HIR is a lightly desugared variant of the AST. It is more processed than the
129+
AST and more suitable for the analyses that follow. It is **not** required to match
130+
the syntax of the Rust language.
131+
- As a simple example, in the **AST**, we preserve the parentheses
132+
that the user wrote, so `((1 + 2) + 3)` and `1 + 2 + 3` parse
133+
into distinct trees, even though they are equivalent. In the
134+
HIR, however, parentheses nodes are removed, and those two
135+
expressions are represented in the same way.
136+
3. **Type-checking and subsequent analyses**
137+
- An important step in processing the HIR is to perform type
138+
checking. This process assigns types to every HIR expression,
139+
for example, and also is responsible for resolving some
140+
"type-dependent" paths, such as field accesses (`x.f` -- we
141+
can't know what field `f` is being accessed until we know the
142+
type of `x`) and associated type references (`T::Item` -- we
143+
can't know what type `Item` is until we know what `T` is).
144+
- Type checking creates "side-tables" (`TypeckTables`) that include
145+
the types of expressions, the way to resolve methods, and so forth.
146+
- After type-checking, we can do other analyses, such as privacy checking.
147+
4. **Lowering to MIR and post-processing**
148+
- Once type-checking is done, we can lower the HIR into MIR ("middle IR"), which
149+
is a **very** desugared version of Rust, well suited to the borrowck but also
150+
certain high-level optimizations.
151+
5. **Translation to LLVM and LLVM optimizations**
152+
- From MIR, we can produce LLVM IR.
153+
- LLVM then runs its various optimizations, which produces a number of `.o` files
154+
(one for each "codegen unit").
155+
6. **Linking**
156+
- Finally, those `.o` files are linke together.
209157
210158
Glossary
211159
========
@@ -215,9 +163,15 @@ things. This glossary attempts to list them and give you a few
215163
pointers for understanding them better.
216164
217165
- AST -- the **abstract syntax tree** produced the `syntax` crate; reflects user syntax
218-
very closely.
166+
very closely.
167+
- codegen unit -- when we produce LLVM IR, we group the Rust code into a number of codegen
168+
units. Each of these units is processed by LLVM independently from one another,
169+
enabling parallelism. They are also the unit of incremental re-use.
219170
- cx -- we tend to use "cx" as an abbrevation for context. See also tcx, infcx, etc.
171+
- `DefId` -- an index identifying a **definition** (see `librustc/hir/def_id.rs`).
220172
- HIR -- the **High-level IR**, created by lowering and desugaring the AST. See `librustc/hir`.
173+
- `HirId` -- identifies a particular node in the HIR by combining a
174+
def-id with an "intra-definition offset".
221175
- `'gcx` -- the lifetime of the global arena (see `librustc/ty`).
222176
- generics -- the set of generic type parameters defined on a type or item
223177
- infcx -- the inference context (see `librustc/infer`)
@@ -226,9 +180,13 @@ pointers for understanding them better.
226180
found in `src/librustc_mir`.
227181
- obligation -- something that must be proven by the trait system; see `librustc/traits`.
228182
- local crate -- the crate currently being compiled.
183+
- node-id or `NodeId` -- an index identifying a particular node in the
184+
AST or HIR; gradually being phased out.
229185
- query -- perhaps some sub-computation during compilation; see `librustc/maps`.
230186
- provider -- the function that executes a query; see `librustc/maps`.
231187
- sess -- the **compiler session**, which stores global data used throughout compilation
188+
- side tables -- because the AST and HIR are immutable once created, we often carry extra
189+
information about them in the form of hashtables, indexed by the id of a particular node.
232190
- substs -- the **substitutions** for a given generic type or item
233191
(e.g., the `i32, u32` in `HashMap<i32, u32>`)
234192
- tcx -- the "typing context", main data structure of the compiler (see `librustc/ty`).

0 commit comments

Comments
 (0)