Skip to content

Commit 17e0092

Browse files
committed
Blog Post: the heart of a langauge server
1 parent c919e9d commit 17e0092

File tree

1 file changed

+282
-0
lines changed

1 file changed

+282
-0
lines changed
Lines changed: 282 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,282 @@
1+
= The Heart of a Language Server
2+
:sectanchors:
3+
:page-layout: post
4+
5+
In this post, I want to expand on one curious comment from rust-analyzer code base.
6+
You can find the comment https://github.com/rust-lang/rust-analyzer/blob/34cffbf1d75fb6b5cb6bc68a9854b20dc74f135d/crates/hir/src/semantics/source_to_def.rs#L3-L4[here].
7+
8+
It describes a curious recursive algorithm that is repeated across different language-server-shaped thing:
9+
I've seen it implemented for Kotlin and C#, and implemented it myself for Rust.
10+
11+
Here's a seemingly random grab bag of IDE features:
12+
13+
- Go to definition
14+
- Code completion
15+
- Run test at the cursor
16+
- Extract variable
17+
18+
What's common among them all?
19+
All these features are relative to the _current position_ of the cursor!
20+
The input is not only the state of the code at a give point in time, but a specific location in the source of a project, like `src/main.rs:90:2`.
21+
22+
And the first thing a language server needs to do for any of the above features is to understand what is located at the given offset, semantically speaking.
23+
Is it an operator, like `+`?
24+
Is it a name, like `foo`?
25+
If it is a name, in what context a name is used --- does it _define_ an entity named `foo` or does it _refer_ to a pre-existing entity?
26+
If it is a reference, then _what_ entity is referenced?
27+
What type is it?
28+
29+
The first step here is determining a node in the syntax tree which covers the offset.
30+
This is relatively straightforward:
31+
32+
[source,rust]
33+
----
34+
fn node_at_offset(node: SyntaxNode, offset: u32) -> SyntaxNode {
35+
assert!(node.text_range().contains(offset));
36+
node.children()
37+
.find(|it| it.text_range().contains(offset))
38+
.map(|it| node_at_offset(it, offset))
39+
.unwrap_or(node)
40+
}
41+
----
42+
43+
But the syntax tree by itself doesn't contain enough information to drive IDE features.
44+
Semantic analysis is required.
45+
46+
But the problem with semantic analysis is that it usually involves several layers of intermediate representations, which are only indirectly related to the syntax tree.
47+
While the syntax tree is relatively uniform, and it is possible to implement a generic traversal like the one above,
48+
semantic information is usually stored in a menagerie of ad-hoc data structures: trees, graphs, and plain old hash tables.
49+
50+
Traditional compilers attach source span information to semantic elements, which could look like this:
51+
52+
[source,rust]
53+
----
54+
struct Span {
55+
file: PathBuf,
56+
line: u32,
57+
column: u32,
58+
}
59+
60+
struct LocalVariable {
61+
name: InternedString,
62+
mutability: Mutability,
63+
ty: Type,
64+
span: Span
65+
}
66+
----
67+
68+
With line information in place, it _is_ possible for a language server to find an appropriate semantic element for a given cursor position:
69+
just iterate all semantic elements there are, and find the one with the smallest span which still contains the cursor.
70+
71+
This approach works, but has two drawbacks.
72+
73+
The _first_ drawback is that it's too slow.
74+
To iterate over all semantic elements, an entire compilation unit must be analyzed, and that's too slow, even if done incrementally.
75+
The core trick of a performant language server is that it avoids any analysis unless _absolutely_ necessary.
76+
The server knows everything about the function currently on the screen, and knows almost nothing about other functions.
77+
78+
The _second_ drawback is more philosophical --- using text spans _erases_ information about the underlying syntax trees.
79+
A `LocalVariable` didn't originate from a particular `span` of text, it was created using a specific node in the concrete syntax tree.
80+
For features like "go to definition", which need to go from syntax to semantics, the approximation turns out to be good enough.
81+
But for refactors, it is often convenient to go in the opposite direction --- from semantics to syntax.
82+
To change a tuple enum to a record enum, a language server needs to find all usages of the enum in the semantic model, but then it needs to modify the syntax tree.
83+
And going from a `Span` back to the `SyntaxNode` is not straightforward: different syntax nodes might have the same span!
84+
85+
For example, a `foo` is a:
86+
87+
* name token
88+
* a reference
89+
* a trivial path (`foo::bar`)
90+
* and a path expression
91+
92+
[source]
93+
----
94+
PATH_EXPR@20..23
95+
PATH@20..23
96+
PATH_SEGMENT@20..23
97+
NAME_REF@20..23
98+
IDENT@20..23 "foo"
99+
----
100+
101+
== Iterative Recursive Analysis
102+
103+
So, how can a language server map syntax nodes to corresponding semantic elements, so that the mapping is precise and can be computed lazily?
104+
105+
First, every semantic element gets a `source_syntax` method that returns the original syntax node:
106+
107+
[source,rust]
108+
----
109+
impl LocalVariable {
110+
pub fn source_syntax(&self) -> SyntaxNode
111+
}
112+
----
113+
114+
The method is implemented differently for different types.
115+
Sometimes, storing a reference to a syntax node is appropriate:
116+
117+
[source,rust]
118+
----
119+
struct LocalVariable {
120+
source_syntax: SyntaxNodeId,
121+
}
122+
123+
impl LocalVariable {
124+
pub fn source_syntax(&self) -> SyntaxNode {
125+
node_id_to_node(self.source_syntax)
126+
}
127+
}
128+
----
129+
130+
Alternatively, the syntax might be computed on demand.
131+
For example, for local variables we might store a reference to the parent function, and the ordinal number of this local variable:
132+
133+
[source,rust]
134+
----
135+
struct LocalVariable {
136+
parent: Function,
137+
ordinal: usize
138+
}
139+
140+
impl LocalVariable {
141+
pub fn source_syntax(&self) -> SyntaxNode {
142+
let parent_function_syntax = self.parent.source_syntax()
143+
parent_function_syntax
144+
.descendants()
145+
.filter(|it| it.kind == SyntaxNodeKind::LocalVariable)
146+
.nth(self.ordinal)
147+
.unwrap()
148+
}
149+
}
150+
----
151+
152+
Yet another pattern is to get this information from a side table:
153+
154+
```
155+
type SyntaxMapping = HashMap<LocalVariable, SyntaxNode>
156+
```
157+
158+
In rust-analyzer all three approaches are used in various places.
159+
160+
This solves the problem going from a semantic element to a syntax, but what we've started with is the opposite: from an offset like `main.rs:80:20` we go to a `SyntaxNode`, and then we need to discover the semantic element.
161+
The trick is to use the same solution in _both_ directions:
162+
163+
To find a semantic element for a given piece of syntax:
164+
165+
1. Look at the _parent_ syntax node.
166+
2. If there is no parent, then the current syntax node corresponds to an entire file, and the appropriate semantic element is the module.
167+
3. Otherwise, _recursively_ lookup semantics for the parent.
168+
4. Among all parent's children (our siblings), find the one whose source syntax is the node we started with
169+
170+
171+
Or, in pseudocode:
172+
173+
[source,rust]
174+
----
175+
fn semantics_for_syntax(node: SyntaxNode) -> SemanticElement {
176+
match node.parent() {
177+
None => module_for_file(node.source_file),
178+
Some(parent) => {
179+
180+
// Recursive call
181+
let parent_semantics = semantics_for_syntax(parent);
182+
183+
for sibling in parent_semantics.children() {
184+
if sibling.source_syntax() == node {
185+
return sibling
186+
}
187+
}
188+
}
189+
}
190+
}
191+
----
192+
193+
In this formulation, a language server needs to just enough analysis to drill down to a specific node.
194+
195+
Consider this example:
196+
197+
[source,rust]
198+
---
199+
struct RangeIter {
200+
lo: u32,
201+
hi: u32,
202+
}
203+
204+
impl Iterator for RangeIter {
205+
type Item = u32;
206+
207+
fn next(&mut RangeIter) -> Item {
208+
// ^ Cursor here
209+
210+
}
211+
}
212+
213+
impl RangeIter {
214+
...
215+
}
216+
---
217+
218+
Starting from the `Item` syntax node, the language server will consider:
219+
220+
- the return type of the function `next`,
221+
- the function itself,
222+
- the `impl Iterator` block,
223+
- the entire file.
224+
225+
Just enough semantic analysis will be executed to learn that a file has a struct declaration and two impl blocks, but the _contents_ of the struct and the second impl block won't be inspected at all.
226+
That is a huge win --- typically, source files are much more wide than they are deep.
227+
228+
This recursion-and-loop structure is present in many language servers.
229+
For rust-analyzer, see the https://github.com/rust-lang/rust-analyzer/blob/34cffbf1d75fb6b5cb6bc68a9854b20dc74f135d/crates/hir/src/semantics/source_to_def.rs#L3-L4[`source_to_def`] module,
230+
with many functions that convert syntax (`ast::` types) to semantics (unqualified types).
231+
232+
[source,rust]
233+
----
234+
fn type_alias_to_def(
235+
&mut self,
236+
src: InFile<ast::TypeAlias>,
237+
) -> Option<TypeAliasId> {
238+
----
239+
240+
For Roslyn, one entry point to the machinery is https://github.com/dotnet/roslyn/blob/36a0c338d6621cc5fe34b79d414074a95a6a489c/src/Compilers/CSharp/Portable/Compilation/SyntaxTreeSemanticModel.cs#L1403-L1429[`GetDeclaredType`] function.
241+
`BaseTypeDeclarationSyntax` is, well, syntax, while the return type `NamedTypeSymbol` is the semantic info.
242+
First, Roslyn looks up semantic info for syntactic parent, using https://github.com/dotnet/roslyn/blob/36a0c338d6621cc5fe34b79d414074a95a6a489c/src/Compilers/CSharp/Portable/Compilation/SyntaxTreeSemanticModel.cs#L1423[`GetDeclaredTypeMemberContainer`].
243+
Then, in https://github.com/dotnet/roslyn/blob/36a0c338d6621cc5fe34b79d414074a95a6a489c/src/Compilers/CSharp/Portable/Compilation/SyntaxTreeSemanticModel.cs#L1783[`GetDeclaredMember`] it iterates semantic siblings and finds the one with the matching text range.
244+
245+
For Kotlin, the entry is https://github.com/JetBrains/kotlin/blob/a288b8b00e4754a1872b164999c6d3f3b8c8994a/idea/idea-frontend-fir/idea-fir-low-level-api/src/org/jetbrains/kotlin/idea/fir/low/level/api/FirModuleResolveStateImpl.kt#L93-L125[`findSourceFirDeclarationByExpression`].
246+
This function starts with a syntax node (`KtExpression` is syntax, like all `Kt` nodes), and returns a declaration.
247+
It uses `getNonLocalContainingOrThisDeclaration` to get syntactic container for a current node.
248+
Then, `findSourceNonLocalFirDeclaration` gets `Fir` for this parent.
249+
Finally, `findElementIn` function traverses `Fir` children to find one with the same source we originally started with.
250+
251+
== Limitations
252+
253+
There are two properties of the underlying languages which make this approach work:
254+
255+
1. Syntactic nesting must match semantic nesting.
256+
Looking at parent's sibling makes sense only if the current element should be among the siblings.
257+
2. Getting sematic element for an entire file is trivial.
258+
259+
The second one is actually less true in Rust than it is in Kotlin or C#!
260+
In those languages, each file starts with a package declaration, which immediately mounts the file at the appropriate place in the semantic model.
261+
262+
For Rust, a file `foo.rs` only exists semantically if some parent file includes it via `mod foo;` declaration!
263+
And, in general, it's impossible to locate the parent file automatically.
264+
_Usually_, for `src/bar/foo.rs` the parent would be `src/bar.rs`, but, due to `#[path]` attributes which override this default, this might not be true.
265+
So rust-analyzer has to be less lazy than ideal here --- on every change, it reconstructs the entire module tree for a crate looking at every file, even if only a single file is currently visible.
266+
267+
Here's another interesting example:
268+
269+
[source,rust]
270+
----
271+
mod ast {
272+
generate_ast_from_grammar!("FooLang.grm");
273+
}
274+
----
275+
276+
Here, we have a hypothetical procedural macro, which reads a grammar definition from an external file, and presumably generates a bunch of Rust types for the AST described by the grammar.
277+
One could dream of an IDE where, without knowing anything specific about `.grammar`, it can still find usages of AST nodes defined therein, using the span information from the procedural macro.
278+
This works in theory: when the macro creates Rust token trees, it can manufacture spans that point inside `FooLang.grm`, which connects Rust source with the grammar.
279+
280+
Where this breaks down is laziness.
281+
When a user invokes "find usages" inside `FooLang.grm`, the IDE has no way of knowing, up-front, that the `generate_ast_from_grammar!("FooLang.grm")` macro call needs to be expanded.
282+
The only way this could work if the IDE conservatively expands all macros all the time.

0 commit comments

Comments
 (0)