|
| 1 | += The Heart of a Language Server |
| 2 | +:sectanchors: |
| 3 | +:page-layout: post |
| 4 | + |
| 5 | +In this post, I want to expand on one curious comment from rust-analyzer code base. |
| 6 | +You can find the comment https://github.com/rust-lang/rust-analyzer/blob/34cffbf1d75fb6b5cb6bc68a9854b20dc74f135d/crates/hir/src/semantics/source_to_def.rs#L3-L4[here]. |
| 7 | + |
| 8 | +It describes a curious recursive algorithm that is repeated across different language-server-shaped thing: |
| 9 | +I've seen it implemented for Kotlin and C#, and implemented it myself for Rust. |
| 10 | + |
| 11 | +Here's a seemingly random grab bag of IDE features: |
| 12 | + |
| 13 | +- Go to definition |
| 14 | +- Code completion |
| 15 | +- Run test at the cursor |
| 16 | +- Extract variable |
| 17 | +
|
| 18 | +What's common among them all? |
| 19 | +All these features are relative to the _current position_ of the cursor! |
| 20 | +The input is not only the state of the code at a give point in time, but a specific location in the source of a project, like `src/main.rs:90:2`. |
| 21 | + |
| 22 | +And the first thing a language server needs to do for any of the above features is to understand what is located at the given offset, semantically speaking. |
| 23 | +Is it an operator, like `+`? |
| 24 | +Is it a name, like `foo`? |
| 25 | +If it is a name, in what context a name is used --- does it _define_ an entity named `foo` or does it _refer_ to a pre-existing entity? |
| 26 | +If it is a reference, then _what_ entity is referenced? |
| 27 | +What type is it? |
| 28 | + |
| 29 | +The first step here is determining a node in the syntax tree which covers the offset. |
| 30 | +This is relatively straightforward: |
| 31 | + |
| 32 | +[source,rust] |
| 33 | +---- |
| 34 | +fn node_at_offset(node: SyntaxNode, offset: u32) -> SyntaxNode { |
| 35 | + assert!(node.text_range().contains(offset)); |
| 36 | + node.children() |
| 37 | + .find(|it| it.text_range().contains(offset)) |
| 38 | + .map(|it| node_at_offset(it, offset)) |
| 39 | + .unwrap_or(node) |
| 40 | +} |
| 41 | +---- |
| 42 | + |
| 43 | +But the syntax tree by itself doesn't contain enough information to drive IDE features. |
| 44 | +Semantic analysis is required. |
| 45 | + |
| 46 | +But the problem with semantic analysis is that it usually involves several layers of intermediate representations, which are only indirectly related to the syntax tree. |
| 47 | +While the syntax tree is relatively uniform, and it is possible to implement a generic traversal like the one above, |
| 48 | +semantic information is usually stored in a menagerie of ad-hoc data structures: trees, graphs, and plain old hash tables. |
| 49 | + |
| 50 | +Traditional compilers attach source span information to semantic elements, which could look like this: |
| 51 | + |
| 52 | +[source,rust] |
| 53 | +---- |
| 54 | +struct Span { |
| 55 | + file: PathBuf, |
| 56 | + line: u32, |
| 57 | + column: u32, |
| 58 | +} |
| 59 | +
|
| 60 | +struct LocalVariable { |
| 61 | + name: InternedString, |
| 62 | + mutability: Mutability, |
| 63 | + ty: Type, |
| 64 | + span: Span |
| 65 | +} |
| 66 | +---- |
| 67 | + |
| 68 | +With line information in place, it _is_ possible for a language server to find an appropriate semantic element for a given cursor position: |
| 69 | +just iterate all semantic elements there are, and find the one with the smallest span which still contains the cursor. |
| 70 | + |
| 71 | +This approach works, but has two drawbacks. |
| 72 | + |
| 73 | +The _first_ drawback is that it's too slow. |
| 74 | +To iterate over all semantic elements, an entire compilation unit must be analyzed, and that's too slow, even if done incrementally. |
| 75 | +The core trick of a performant language server is that it avoids any analysis unless _absolutely_ necessary. |
| 76 | +The server knows everything about the function currently on the screen, and knows almost nothing about other functions. |
| 77 | + |
| 78 | +The _second_ drawback is more philosophical --- using text spans _erases_ information about the underlying syntax trees. |
| 79 | +A `LocalVariable` didn't originate from a particular `span` of text, it was created using a specific node in the concrete syntax tree. |
| 80 | +For features like "go to definition", which need to go from syntax to semantics, the approximation turns out to be good enough. |
| 81 | +But for refactors, it is often convenient to go in the opposite direction --- from semantics to syntax. |
| 82 | +To change a tuple enum to a record enum, a language server needs to find all usages of the enum in the semantic model, but then it needs to modify the syntax tree. |
| 83 | +And going from a `Span` back to the `SyntaxNode` is not straightforward: different syntax nodes might have the same span! |
| 84 | + |
| 85 | +For example, a `foo` is a: |
| 86 | + |
| 87 | +* name token |
| 88 | +* a reference |
| 89 | +* a trivial path (`foo::bar`) |
| 90 | +* and a path expression |
| 91 | +
|
| 92 | +[source] |
| 93 | +---- |
| 94 | +PATH_EXPR@20..23 |
| 95 | + PATH@20..23 |
| 96 | + PATH_SEGMENT@20..23 |
| 97 | + NAME_REF@20..23 |
| 98 | + IDENT@20..23 "foo" |
| 99 | +---- |
| 100 | + |
| 101 | +== Iterative Recursive Analysis |
| 102 | + |
| 103 | +So, how can a language server map syntax nodes to corresponding semantic elements, so that the mapping is precise and can be computed lazily? |
| 104 | + |
| 105 | +First, every semantic element gets a `source_syntax` method that returns the original syntax node: |
| 106 | + |
| 107 | +[source,rust] |
| 108 | +---- |
| 109 | +impl LocalVariable { |
| 110 | + pub fn source_syntax(&self) -> SyntaxNode |
| 111 | +} |
| 112 | +---- |
| 113 | + |
| 114 | +The method is implemented differently for different types. |
| 115 | +Sometimes, storing a reference to a syntax node is appropriate: |
| 116 | + |
| 117 | +[source,rust] |
| 118 | +---- |
| 119 | +struct LocalVariable { |
| 120 | + source_syntax: SyntaxNodeId, |
| 121 | +} |
| 122 | +
|
| 123 | +impl LocalVariable { |
| 124 | + pub fn source_syntax(&self) -> SyntaxNode { |
| 125 | + node_id_to_node(self.source_syntax) |
| 126 | + } |
| 127 | +} |
| 128 | +---- |
| 129 | + |
| 130 | +Alternatively, the syntax might be computed on demand. |
| 131 | +For example, for local variables we might store a reference to the parent function, and the ordinal number of this local variable: |
| 132 | + |
| 133 | +[source,rust] |
| 134 | +---- |
| 135 | +struct LocalVariable { |
| 136 | + parent: Function, |
| 137 | + ordinal: usize |
| 138 | +} |
| 139 | +
|
| 140 | +impl LocalVariable { |
| 141 | + pub fn source_syntax(&self) -> SyntaxNode { |
| 142 | + let parent_function_syntax = self.parent.source_syntax() |
| 143 | + parent_function_syntax |
| 144 | + .descendants() |
| 145 | + .filter(|it| it.kind == SyntaxNodeKind::LocalVariable) |
| 146 | + .nth(self.ordinal) |
| 147 | + .unwrap() |
| 148 | + } |
| 149 | +} |
| 150 | +---- |
| 151 | + |
| 152 | +Yet another pattern is to get this information from a side table: |
| 153 | + |
| 154 | +``` |
| 155 | +type SyntaxMapping = HashMap<LocalVariable, SyntaxNode> |
| 156 | +``` |
| 157 | + |
| 158 | +In rust-analyzer all three approaches are used in various places. |
| 159 | + |
| 160 | +This solves the problem going from a semantic element to a syntax, but what we've started with is the opposite: from an offset like `main.rs:80:20` we go to a `SyntaxNode`, and then we need to discover the semantic element. |
| 161 | +The trick is to use the same solution in _both_ directions: |
| 162 | + |
| 163 | +To find a semantic element for a given piece of syntax: |
| 164 | + |
| 165 | +1. Look at the _parent_ syntax node. |
| 166 | +2. If there is no parent, then the current syntax node corresponds to an entire file, and the appropriate semantic element is the module. |
| 167 | +3. Otherwise, _recursively_ lookup semantics for the parent. |
| 168 | +4. Among all parent's children (our siblings), find the one whose source syntax is the node we started with |
| 169 | + |
| 170 | + |
| 171 | +Or, in pseudocode: |
| 172 | + |
| 173 | +[source,rust] |
| 174 | +---- |
| 175 | +fn semantics_for_syntax(node: SyntaxNode) -> SemanticElement { |
| 176 | + match node.parent() { |
| 177 | + None => module_for_file(node.source_file), |
| 178 | + Some(parent) => { |
| 179 | +
|
| 180 | + // Recursive call |
| 181 | + let parent_semantics = semantics_for_syntax(parent); |
| 182 | +
|
| 183 | + for sibling in parent_semantics.children() { |
| 184 | + if sibling.source_syntax() == node { |
| 185 | + return sibling |
| 186 | + } |
| 187 | + } |
| 188 | + } |
| 189 | + } |
| 190 | +} |
| 191 | +---- |
| 192 | + |
| 193 | +In this formulation, a language server needs to just enough analysis to drill down to a specific node. |
| 194 | + |
| 195 | +Consider this example: |
| 196 | + |
| 197 | +[source,rust] |
| 198 | +--- |
| 199 | +struct RangeIter { |
| 200 | + lo: u32, |
| 201 | + hi: u32, |
| 202 | +} |
| 203 | + |
| 204 | +impl Iterator for RangeIter { |
| 205 | + type Item = u32; |
| 206 | + |
| 207 | + fn next(&mut RangeIter) -> Item { |
| 208 | + // ^ Cursor here |
| 209 | + |
| 210 | + } |
| 211 | +} |
| 212 | + |
| 213 | +impl RangeIter { |
| 214 | + ... |
| 215 | +} |
| 216 | +--- |
| 217 | + |
| 218 | +Starting from the `Item` syntax node, the language server will consider: |
| 219 | + |
| 220 | +- the return type of the function `next`, |
| 221 | +- the function itself, |
| 222 | +- the `impl Iterator` block, |
| 223 | +- the entire file. |
| 224 | + |
| 225 | +Just enough semantic analysis will be executed to learn that a file has a struct declaration and two impl blocks, but the _contents_ of the struct and the second impl block won't be inspected at all. |
| 226 | +That is a huge win --- typically, source files are much more wide than they are deep. |
| 227 | + |
| 228 | +This recursion-and-loop structure is present in many language servers. |
| 229 | +For rust-analyzer, see the https://github.com/rust-lang/rust-analyzer/blob/34cffbf1d75fb6b5cb6bc68a9854b20dc74f135d/crates/hir/src/semantics/source_to_def.rs#L3-L4[`source_to_def`] module, |
| 230 | +with many functions that convert syntax (`ast::` types) to semantics (unqualified types). |
| 231 | + |
| 232 | +[source,rust] |
| 233 | +---- |
| 234 | +fn type_alias_to_def( |
| 235 | + &mut self, |
| 236 | + src: InFile<ast::TypeAlias>, |
| 237 | +) -> Option<TypeAliasId> { |
| 238 | +---- |
| 239 | + |
| 240 | +For Roslyn, one entry point to the machinery is https://github.com/dotnet/roslyn/blob/36a0c338d6621cc5fe34b79d414074a95a6a489c/src/Compilers/CSharp/Portable/Compilation/SyntaxTreeSemanticModel.cs#L1403-L1429[`GetDeclaredType`] function. |
| 241 | +`BaseTypeDeclarationSyntax` is, well, syntax, while the return type `NamedTypeSymbol` is the semantic info. |
| 242 | +First, Roslyn looks up semantic info for syntactic parent, using https://github.com/dotnet/roslyn/blob/36a0c338d6621cc5fe34b79d414074a95a6a489c/src/Compilers/CSharp/Portable/Compilation/SyntaxTreeSemanticModel.cs#L1423[`GetDeclaredTypeMemberContainer`]. |
| 243 | +Then, in https://github.com/dotnet/roslyn/blob/36a0c338d6621cc5fe34b79d414074a95a6a489c/src/Compilers/CSharp/Portable/Compilation/SyntaxTreeSemanticModel.cs#L1783[`GetDeclaredMember`] it iterates semantic siblings and finds the one with the matching text range. |
| 244 | + |
| 245 | +For Kotlin, the entry is https://github.com/JetBrains/kotlin/blob/a288b8b00e4754a1872b164999c6d3f3b8c8994a/idea/idea-frontend-fir/idea-fir-low-level-api/src/org/jetbrains/kotlin/idea/fir/low/level/api/FirModuleResolveStateImpl.kt#L93-L125[`findSourceFirDeclarationByExpression`]. |
| 246 | +This function starts with a syntax node (`KtExpression` is syntax, like all `Kt` nodes), and returns a declaration. |
| 247 | +It uses `getNonLocalContainingOrThisDeclaration` to get syntactic container for a current node. |
| 248 | +Then, `findSourceNonLocalFirDeclaration` gets `Fir` for this parent. |
| 249 | +Finally, `findElementIn` function traverses `Fir` children to find one with the same source we originally started with. |
| 250 | + |
| 251 | +== Limitations |
| 252 | + |
| 253 | +There are two properties of the underlying languages which make this approach work: |
| 254 | + |
| 255 | +1. Syntactic nesting must match semantic nesting. |
| 256 | + Looking at parent's sibling makes sense only if the current element should be among the siblings. |
| 257 | +2. Getting sematic element for an entire file is trivial. |
| 258 | + |
| 259 | +The second one is actually less true in Rust than it is in Kotlin or C#! |
| 260 | +In those languages, each file starts with a package declaration, which immediately mounts the file at the appropriate place in the semantic model. |
| 261 | + |
| 262 | +For Rust, a file `foo.rs` only exists semantically if some parent file includes it via `mod foo;` declaration! |
| 263 | +And, in general, it's impossible to locate the parent file automatically. |
| 264 | +_Usually_, for `src/bar/foo.rs` the parent would be `src/bar.rs`, but, due to `#[path]` attributes which override this default, this might not be true. |
| 265 | +So rust-analyzer has to be less lazy than ideal here --- on every change, it reconstructs the entire module tree for a crate looking at every file, even if only a single file is currently visible. |
| 266 | + |
| 267 | +Here's another interesting example: |
| 268 | + |
| 269 | +[source,rust] |
| 270 | +---- |
| 271 | +mod ast { |
| 272 | + generate_ast_from_grammar!("FooLang.grm"); |
| 273 | +} |
| 274 | +---- |
| 275 | + |
| 276 | +Here, we have a hypothetical procedural macro, which reads a grammar definition from an external file, and presumably generates a bunch of Rust types for the AST described by the grammar. |
| 277 | +One could dream of an IDE where, without knowing anything specific about `.grammar`, it can still find usages of AST nodes defined therein, using the span information from the procedural macro. |
| 278 | +This works in theory: when the macro creates Rust token trees, it can manufacture spans that point inside `FooLang.grm`, which connects Rust source with the grammar. |
| 279 | + |
| 280 | +Where this breaks down is laziness. |
| 281 | +When a user invokes "find usages" inside `FooLang.grm`, the IDE has no way of knowing, up-front, that the `generate_ast_from_grammar!("FooLang.grm")` macro call needs to be expanded. |
| 282 | +The only way this could work if the IDE conservatively expands all macros all the time. |
0 commit comments