rawDependencyInformation bad performance and memory usage

I've observed that `rawDependencyInformation`, used to compute the modules dependency graph, have poor performance and uses a lot of memory.

The reasons are two folds:

1. Algorithmically

- The rule `GetLocatedImport`, for each file, parses the `import` section and for each imported module, it tries to find the associated file. It means that the same lookup is done multiples times (e.g. if import `Data.Functor` appears on 100 of the module of your project, it will be looked up 100 times).
- The module lookup depends on the number of include path. If you have 1000 include path, it will try 1000 locations.

Fixing this could improve the performance by a few order of magnitude.

2. Data structures

- All the filepath semantic is written with `FilePath`, which are `String`. We store a lot of filepath (e.g. at least as many as file in the project) and `String` is notoriously known for its poor performance, huge memory usage and huge GC pressure when tracing / coping.
- the dependency logic use a lot of `IntSet`: for each module, we store the `IntSet` of its dependencies and reverse dependencies. Similarly to `FilePath`, `IntSet` is not a memory and GC friendly structure.

Details later in this document.

Fixing this would only have a limited benefit, however it can dramatically reduce memory usage of HLS (and GC pressure, hence performance). See also this recent note from @soulomoon which is afraid of the memory footprint of the dependencry graph: https://github.com/haskell/haskell-language-server/pull/4594#issuecomment-2906974092

# The debugging session

I'm interested since months about #4443: HLS in our codebase is slow, takes minutes to start and we observe moment where HLS is stuck for a few minutes after some change in the codebase and before returning updated diagnostics. This is especially painful during large refactoring session. E.g recently I updated our codebase to use `UnliftIO.Exception` instead of `Control.Exception.Safe`. I started by a super large search and replace, but some fine tuning had to be done on each module. I started to get made: HLS show me an error (usualy a missing module import) with an associated code action, I activate the code action, HLS recompute for 2 or 3 minutes, and then I switch to the next error. I'm so happy to be paid watching HLS processing files. In #4443 I was able to propose a small reproducer and recently @soulomoon wrote https://github.com/haskell/haskell-language-server/pull/4594 which fixed the rebuild issues with my reproducer.

But HLS in my codebase still takes minutes to start and, even if it rebuilds less, it does rebuild (but now I think that's expected) and unfortunately, rebuild usually takes minutes, so I started to investigate.

A note: it is usually great to have correct assumptions about the expected performances. I had already thought that it was "normal" that HLS was slower than ghci because it was doing way more stuff. In the matrix chat, @fendor (that I thank for all the question he answered) told me that hls should be as fast or even faster that ghci: this completly changed my mental model of HLS and I decided to investigate why it was taking so much time to start: it is easier to debug startup time rather than what is happening during rebuild. I also hope that by working on the startup time, I would gather enough knowledge of the HLS architecture to be able to debug more stuffs.

So yesterday I started debugging HLS in order to understand why it was starting so slowly.

## The debugging context.

I'm working on our proprietary codebase at https://github.com/novainsilico. For the record, we use Haskell to simulate biological interactions in the human body, pathology and treatments.

- We have a codebase of `1116` haskell modules.
- The "bios" for HLS is manually written, because we have a custom build system based on nix to build everything.
- From the point of view of HLS, we have only one "unit" containing all the modules. However our build system works with packages, `254` of them (we migrated from `cabal` to `bazel` and finally `nix` as the build system). It also means that we have `254` include path, this will be important later ;)
- The `HLS` bios is generated by nix AND our `ghci` session exactly uses the same settings. So comparing `HLS` with ghci is fair.
- We use ghc 9.12.2 with haskell-language-server built from the latest commits in this repo, on top of which I had rebased:

- #4594: which avoids too much rebuilding when we edit the module "header" (`import`, language pragmas, ...)
- #4596: which avoids unnecessary rebuild due to haddock flag changes

## Initial observation

Starting from a clean `.cache/ghcide`, I've started my editor on a random file in the project which is kinda in the middle of the dependency graph. I'll discuss timing (approximate, I've used a hand chronometer) and what appears in my editor "status" regarding what HLS is doing.

- for `15s`, HLS is `Processing 1/2 Setting up jinko`: this is the bios evaluation time and is expected, it calls nix which walks our complete project to build a dependency graph of all the Haskell files by manually parsing the `import`. This is not required for our bios logic, but it was simpler. Now that I've realized that this takes 15s of startup time, I will definitely work on improving this process. Note however that in 15s I can build, in nix, without parallelism, a complete module dependency graph. This is also important for later ;)
- Then HLS stays stuck at "Processing 58/1167 4%" for `3m21s`. No files are created in `.cache/ghcide`, so I guessed that it was some sort of HLS preprocessing before starting to do the real typechecking work.
- Then, once it unblocks, `Processing` starts to increase and `Indexing` appears. Files starts to be created inside `.cache/ghcide`
- `26s` later, the diagnostics appear in my file and `Processing` disappears: typechecking is done.
- `Indexing` continues for `1m26`.

The complete process took `5m14s` and I was able to access the first diagnostic after `3m48s`.

Repeating the same process without cleaning `.cache/ghcide` lead to the following differences:

- still `15s` of `Setting up`: this is understood and not related to HLS
- `3m03s` of `Processing 4%`. This is a bit less than on the first phase, but I'm benchmarking on my laptop, so the difference is irrelevant.
- `6s` of `Processing` before diagnostics appears, and no `Indexing`.

So it seems that typechecking benefits from the cache, indexing too. However the "preprocessing step" does not seem to be cached in anyway.

## The preprocessing phase

I changed my strategy to debug. Instead of running hls from my editor, I decided to run it directly in the CLI on the same file as the one opened in the editor, such as `haskell-language-server path/to/the/Haskell/Module.hs`. This approach have the benefit of allowing benchmarking (pref, eventlog, ...) and that the process terminates when done.

I started with eventlog profiling, which does not require any rebuild of the hls binary. `haskell-language-server path/to/the/Haskell/Module.hs +RTS -hT -l`. Then running `eventlog2html` on the generate `.eventlog` file.

![Image](https://github.com/user-attachments/assets/ee0312c9-0ec2-4bbb-ab1a-a5641a8e3010)

![Image](https://github.com/user-attachments/assets/cb2d32de-080a-4d2c-a3b0-d24720b2a13d)

We can see that:

- There are a lot of `IntSet` used
- A lot of `:`
- A few thunk.

There is nothing alarming in this memory benchmark, except that there are a lot of small objects, taking more than 1 GiB of RAM during the preprocessing. I kept that for future improvment: reducing this could improve performances (e.g. less GC pressure) and anyway, reducing memory usage is always great.

## Build with profiling enabled

I then build hls with `cabal build exe:haskell-language-server --enable-profiling -f-dynamic`. This was dificult to find and I had a lot of surprising errors regarding missing `p_dyn` from dependencies, that's because profiling need to be built in static but HLS is built by default with `dynamic`. I note that here so maybe a search would help a futur dev interested in debuging. The error were as such:

```
exe/Main.hs:33:1: error: [GHC-88719]
    Could not load module ‘Language.LSP.Protocol.Message’.
    Perhaps you haven't installed the "p_dyn" libraries for package ‘lsp-types-2.3.0.1’?
    Use -v to see a list of the files searched for.
   |
33 | import           Language.LSP.Protocol.Message as LSP
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

exe/Main.hs:34:1: error: [GHC-88719]
    Could not load module ‘Language.LSP.Server’.
    Perhaps you haven't installed the "p_dyn" libraries for package ‘lsp-2.7.0.1’?
    Use -v to see a list of the files searched for.
   |
34 | import           Language.LSP.Server           as LSP
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

exe/Main.hs:35:1: error: [GHC-88719]
    Could not load module ‘Prettyprinter’.
    Perhaps you haven't installed the "p_dyn" libraries for package ‘prettyprinter-1.7.1’?
    Use -v to see a list of the files searched for.
   |
35 | import           Prettyprinter                 (Pretty (pretty), vcat, vsep)
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```

I've then run haskell-language-server with `+RTS -p` (and another run with `-pj` to get the json result).

The result in the `.prof` file is really interesting:

```
++                         GHC.Internal.Base                  libraries/ghc-internal/src/GHC/Internal/Base.hs:1985:1-4    15.0   29.6
$wbreak                    GHC.Internal.List                  libraries/ghc-internal/src/GHC/Internal/List.hs:1329:1-5    14.0   19.6
onBranchElement            StmHamt.Focuses                    library/StmHamt/Focuses.hs:(14,1)-(36,23)                    6.9    4.0
joinDrive                  System.FilePath.Posix              libraries/filepath/System/FilePath/Internal.hs:553:1-9       4.4    1.7
newKey                     Development.IDE.Graph.Internal.Key src/Development/IDE/Graph/Internal/Key.hs:(75,1)-(83,108)    4.4    0.4
go3                        System.FilePath.Posix              <no location info>                                           3.6    5.3
splitDirectories_f         System.FilePath.Posix              libraries/filepath/System/FilePath/Internal.hs:877:9         3.1    4.2
$wspan                     GHC.Internal.List                  libraries/ghc-internal/src/GHC/Internal/List.hs:1282:1-4     2.5    2.7
escapeURIChar              Network.URI                        Network/URI.hs:(1000,1)-(1011,63)                            2.3    1.8
bracket1                   GHC.Internal.IO                    <no location info>                                           2.3    7.3
$wouter                    Data.Text.Internal                 <no location info>                                           2.2    0.3
map                        GHC.Internal.Base                  libraries/ghc-internal/src/GHC/Internal/Base.hs:1892:1-3     2.2    2.8
escapeURIString            Network.URI                        Network/URI.hs:1043:1-51                                     2.1    2.6
normalise                  System.FilePath.Posix              libraries/filepath/System/FilePath/Internal.hs:1028:1-9      1.8    0.0
firstJustM                 Control.Monad.Extra                src/Control/Monad/Extra.hs:(293,1)-(294,66)                  1.3    0.3
dropTrailingPathSeparator  System.FilePath.Posix              libraries/filepath/System/FilePath/Internal.hs:770:1-25      1.3    0.0
uses_                      Development.IDE.Core.Shake         src/Development/IDE/Core/Shake.hs:(1130,1)-(1134,27)         1.3    0.7
findWithIndex              PrimitiveExtras.SmallArray         library/PrimitiveExtras/SmallArray.hs:(160,3)-(169,15)       1.0    0.1
unsafeDupableInterleaveIO1 GHC.Internal.IO.Unsafe             <no location info>                                           1.0    1.5
unpackAppendCString#       GHC.CString                        libraries/ghc-prim/GHC/CString.hs:192:1-20                   0.7    1.5
```

- (++) from list concatenation is responsible for 30% of the alloc, and 15% of the total runtime
- A lot of the relevant items are related to list, char, filepath (which are actually list of char).

The process took `370s`, close to the `5m` I've observed without profiling, so we can assume that we are not just observing profiling overhead.

At that point, I started to be worried: why the hell does HLS spend 3 minutes doing file handling.

When observing the prof file with `profiteur`, I had a few surprises:

![Image](https://github.com/user-attachments/assets/0a00fe6e-6364-42d5-a7ae-e4d2956784bf)

- 95% of the time is spent in `getModuleGraphRule`. This is completely coherent with the idea of HLS doing a lot of "pre processing". This rule is called once to build the complete dependency graph before starting doing the typechecking.
- 73% of the time is spent in `getLocatedImportsRule`: this rule returns the files (from the project) associed with all the imported modules in another file of the project. e.g. if module `A.hs` imports `B` and `C`, this rule will look for the path on disk of `B.hs` and `C.hs`.
- `locateModule` is called `17390` times.
- `toNormalizeFilePath'` is called `6_830_022` for a total of `60%` of the total runtime! `toNormalizeFilePath` is indeed a "slow" function because it works on `String`, but calling it 6 millions time is anyway too much ;)
- getFileExists in called `6830022` times too, this is an `IO` operation, knida expensive (it takes 3% of the total CPU time, but it may be way more on the wall clock time because this things in doing IO...)

## Discussion

It was clear that there was an issue with `getFileExists` which was called too many times, and the number of call of `locateModule` is also surprising. I had to check if `locateModule` had some sort of caching, but anyway, why calling it so much time. My codebase contains 1117 module as file, I actually depends on a bit more than 2 thousands module:

```
$ pcregrep -h -o2 '^import( qualified)*([^(]*)' **/*.hs | sort | uniq | wc -l
2505
```

So why calling that 17K times? And which is this function doing so much `getFileExists`.

Digging into its implementation, I found that it loops through all the include path to look for a module. Recall that I have `254` "packages", so import path, `254 * 17390` adds up to close to 5 millions of the `getfileExists` call.

So we here have a first identify problem: the graph dependency build is linear with the number of include path and can really explode. It also explains why not much person had reported similar issue: there may not be much large codebase with so much include path in the wild.

I had done a quick test by copying all the modules of my project in the same hierarchy and indeed, this is way way faster and actually most of my problems are solved. But actually, HLS should do better, because projects with a few include path exists, and here there is a linear factor to drop. Note also that ghci DOES not suffer from this problem.

Regarding why `locateModule` is called 17K times. In `rawDependencyInformation`, for each module, we extract the list of the `import` from this module and try to find the associated file. This is done in the `GetLocatedImports` rule which calls `locateModule` for each import, but without caching in anyway. Hence the logic depends on the number of times an import appears in files, instead of the fact that it appears.

Numbers adds up pretty well:

```haskell
git grep '^import' **/*.hs | wc -l
16058
```

(I do have some generated files which are stored in `/nix/store`) hence it does not match the `17K`, but it confirms the estimation.

## Solutions

- Regarding the `locateModule` called as many time as there are `import` in the codebase, we can introduce caching. I'm not yet fully used to the internal of the rule build system, but a rule which takes a module name and returns the associated file path would do the caching. I'm however not aware of how this could be invalidated (e.g. if the file is moved). Is this important to handle (e.g. is HLS considered robust to file move. Actually, if startup time could be improved, restarting the LSP could be a good solution). Implementing this will immediately (on my codebase) yield a `8x` performance boost. This value seems to be the "mean" number of module imported per module. This would actually benefit ALL projects using HLS.
- Regarding the linear time dependency to the number of import paths, from my point of view it could be fixed by an initial walk of all the include paths directories, extracting the module names from the found files and building a map. The process will hence depend uniquely on the number of files stored in the import path directories. The only risk is if the import path contains an important number of other files. I think there is really low risk for that and HLS could warn the user that one of its import path contains millions of files and damage performances. And anyway, even with millions of files, the it will still be more efficient that the current situation (because walking directories is more efficient than testing millions of time). This rule will also have the problem of being invalidated by files move and such.
- Note that the `locateModule` logic is also doing the check on disk AND the check in the package DB. There are also room for optimization here considering that the package DB is already available in memory.
- I just realized that the bios can either return module name of filepath. When it returns filepath, HLS have nothing to do to convert that to filepath. However when it returns module name, HLS first convert them to filepath (I don't know where, I need to investigate) and THEN do the build graph dependency. So the logic is maybe duplicated and maybe there are room for improvement here too. TODO: investigate that too.

Note that both problem could be solved at the same time using by building this `Map ModuleName FilePath` at startup.

Regarding the invalidation by files move, most HLS client now can send event when files are moved, this could force the invalidation of the `module to file` cache.

Both these change have the potential to just removes the build graph cost time completely. In my project, it means 4 minutes of initialisation turned into `5s`.


## Memory usage of `FilePath` and `IntSet`

Regarding `FilePath`, it is obvious that `[Char]` is less storage efficient than `Text` or even `ByteString` which are packed data. It is also way more complex for the GC to trace / copy them. One fix could be to store them as `Text` in the different DB and convert them to `FilePath` on demand each time this is required. The temporary `[Char]` would not put GC under much pressure because it will be discarded quickly. Note that it may even be possible to use file api primitives which works with `Text` or `ByteString` to avoid the conversion.

Regarding `IntSet`, it is super efficient when we want to store few values with respect to a large possible set (e.g. if the set of possibles values is all `Int`).

In our use case, the domain is small (e.g. number of module files in the project). For example, to store `1000` modules, a compact structure such as a bitset will use a fixed `128` bytes (+ small overhead) and will have virtually no impact on the GC. On the other hand, an `IntSet` uses `8` bytes for each stored `Int`, so is already less efficient than a bitset once we want to store more than 16 `Int`, and that's not taken into account the overhead of the tree itself. Without going into the details and I may be wrong by a factor of 2, but each `Int` is itself packed into a `Tip` constructor and the Tree nodes contains multiples pointers: a tree of 3 `Int` takes `152` bytes, which is already more than a packed bitset able to store up to 1000 modules.

# Conclusion

I would like to work on the implementation of the different solution discussed here. I may not be fast because of my time constraints or low understanding of the details of the codebase, but this is a good entry issue and I would like to be more involved in HLS dev. Also, these change will have a huge impact at work.

I will try to soon provide an MR fixing the massive number of file lookup and later, once benchmark wil have show that building the graph is neglible, work on the memory usage of the graph.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

rawDependencyInformation bad performance and memory usage #4598

The debugging session

The debugging context.

Initial observation

The preprocessing phase

Build with profiling enabled

Discussion

Solutions

Memory usage of `FilePath` and `IntSet`

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

rawDependencyInformation bad performance and memory usage #4598

Description

The debugging session

The debugging context.

Initial observation

The preprocessing phase

Build with profiling enabled

Discussion

Solutions

Memory usage of FilePath and IntSet

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Memory usage of `FilePath` and `IntSet`