Description
I've observed that rawDependencyInformation
, used to compute the modules dependency graph, have poor performance and uses a lot of memory.
The reasons are two folds:
- Algorithmically
- The rule
GetLocatedImport
, for each file, parses theimport
section and for each imported module, it tries to find the associated file. It means that the same lookup is done multiples times (e.g. if importData.Functor
appears on 100 of the module of your project, it will be looked up 100 times). - The module lookup depends on the number of include path. If you have 1000 include path, it will try 1000 locations.
Fixing this could improve the performance by a few order of magnitude.
- Data structures
- All the filepath semantic is written with
FilePath
, which areString
. We store a lot of filepath (e.g. at least as many as file in the project) andString
is notoriously known for its poor performance, huge memory usage and huge GC pressure when tracing / coping. - the dependency logic use a lot of
IntSet
: for each module, we store theIntSet
of its dependencies and reverse dependencies. Similarly toFilePath
,IntSet
is not a memory and GC friendly structure.
Details later in this document.
Fixing this would only have a limited benefit, however it can dramatically reduce memory usage of HLS (and GC pressure, hence performance). See also this recent note from @soulomoon which is afraid of the memory footprint of the dependencry graph: #4594 (comment)
The debugging session
I'm interested since months about #4443: HLS in our codebase is slow, takes minutes to start and we observe moment where HLS is stuck for a few minutes after some change in the codebase and before returning updated diagnostics. This is especially painful during large refactoring session. E.g recently I updated our codebase to use UnliftIO.Exception
instead of Control.Exception.Safe
. I started by a super large search and replace, but some fine tuning had to be done on each module. I started to get made: HLS show me an error (usualy a missing module import) with an associated code action, I activate the code action, HLS recompute for 2 or 3 minutes, and then I switch to the next error. I'm so happy to be paid watching HLS processing files. In #4443 I was able to propose a small reproducer and recently @soulomoon wrote #4594 which fixed the rebuild issues with my reproducer.
But HLS in my codebase still takes minutes to start and, even if it rebuilds less, it does rebuild (but now I think that's expected) and unfortunately, rebuild usually takes minutes, so I started to investigate.
A note: it is usually great to have correct assumptions about the expected performances. I had already thought that it was "normal" that HLS was slower than ghci because it was doing way more stuff. In the matrix chat, @fendor (that I thank for all the question he answered) told me that hls should be as fast or even faster that ghci: this completly changed my mental model of HLS and I decided to investigate why it was taking so much time to start: it is easier to debug startup time rather than what is happening during rebuild. I also hope that by working on the startup time, I would gather enough knowledge of the HLS architecture to be able to debug more stuffs.
So yesterday I started debugging HLS in order to understand why it was starting so slowly.
The debugging context.
I'm working on our proprietary codebase at https://github.com/novainsilico. For the record, we use Haskell to simulate biological interactions in the human body, pathology and treatments.
-
We have a codebase of
1116
haskell modules. -
The "bios" for HLS is manually written, because we have a custom build system based on nix to build everything.
-
From the point of view of HLS, we have only one "unit" containing all the modules. However our build system works with packages,
254
of them (we migrated fromcabal
tobazel
and finallynix
as the build system). It also means that we have254
include path, this will be important later ;) -
The
HLS
bios is generated by nix AND ourghci
session exactly uses the same settings. So comparingHLS
with ghci is fair. -
We use ghc 9.12.2 with haskell-language-server built from the latest commits in this repo, on top of which I had rebased:
-
compute deps fingerprint only #4594: which avoids too much rebuilding when we edit the module "header" (
import
, language pragmas, ...) -
Avoid unnecessary recompilation due to -haddock #4596: which avoids unnecessary rebuild due to haddock flag changes
Initial observation
Starting from a clean .cache/ghcide
, I've started my editor on a random file in the project which is kinda in the middle of the dependency graph. I'll discuss timing (approximate, I've used a hand chronometer) and what appears in my editor "status" regarding what HLS is doing.
- for
15s
, HLS isProcessing 1/2 Setting up jinko
: this is the bios evaluation time and is expected, it calls nix which walks our complete project to build a dependency graph of all the Haskell files by manually parsing theimport
. This is not required for our bios logic, but it was simpler. Now that I've realized that this takes 15s of startup time, I will definitely work on improving this process. Note however that in 15s I can build, in nix, without parallelism, a complete module dependency graph. This is also important for later ;) - Then HLS stays stuck at "Processing 58/1167 4%" for
3m21s
. No files are created in.cache/ghcide
, so I guessed that it was some sort of HLS preprocessing before starting to do the real typechecking work. - Then, once it unblocks,
Processing
starts to increase andIndexing
appears. Files starts to be created inside.cache/ghcide
26s
later, the diagnostics appear in my file andProcessing
disappears: typechecking is done.Indexing
continues for1m26
.
The complete process took 5m14s
and I was able to access the first diagnostic after 3m48s
.
Repeating the same process without cleaning .cache/ghcide
lead to the following differences:
- still
15s
ofSetting up
: this is understood and not related to HLS 3m03s
ofProcessing 4%
. This is a bit less than on the first phase, but I'm benchmarking on my laptop, so the difference is irrelevant.6s
ofProcessing
before diagnostics appears, and noIndexing
.
So it seems that typechecking benefits from the cache, indexing too. However the "preprocessing step" does not seem to be cached in anyway.
The preprocessing phase
I changed my strategy to debug. Instead of running hls from my editor, I decided to run it directly in the CLI on the same file as the one opened in the editor, such as haskell-language-server path/to/the/Haskell/Module.hs
. This approach have the benefit of allowing benchmarking (pref, eventlog, ...) and that the process terminates when done.
I started with eventlog profiling, which does not require any rebuild of the hls binary. haskell-language-server path/to/the/Haskell/Module.hs +RTS -hT -l
. Then running eventlog2html
on the generate .eventlog
file.
We can see that:
- There are a lot of
IntSet
used - A lot of
:
- A few thunk.
There is nothing alarming in this memory benchmark, except that there are a lot of small objects, taking more than 1 GiB of RAM during the preprocessing. I kept that for future improvment: reducing this could improve performances (e.g. less GC pressure) and anyway, reducing memory usage is always great.
Build with profiling enabled
I then build hls with cabal build exe:haskell-language-server --enable-profiling -f-dynamic
. This was dificult to find and I had a lot of surprising errors regarding missing p_dyn
from dependencies, that's because profiling need to be built in static but HLS is built by default with dynamic
. I note that here so maybe a search would help a futur dev interested in debuging. The error were as such:
exe/Main.hs:33:1: error: [GHC-88719]
Could not load module ‘Language.LSP.Protocol.Message’.
Perhaps you haven't installed the "p_dyn" libraries for package ‘lsp-types-2.3.0.1’?
Use -v to see a list of the files searched for.
|
33 | import Language.LSP.Protocol.Message as LSP
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
exe/Main.hs:34:1: error: [GHC-88719]
Could not load module ‘Language.LSP.Server’.
Perhaps you haven't installed the "p_dyn" libraries for package ‘lsp-2.7.0.1’?
Use -v to see a list of the files searched for.
|
34 | import Language.LSP.Server as LSP
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
exe/Main.hs:35:1: error: [GHC-88719]
Could not load module ‘Prettyprinter’.
Perhaps you haven't installed the "p_dyn" libraries for package ‘prettyprinter-1.7.1’?
Use -v to see a list of the files searched for.
|
35 | import Prettyprinter (Pretty (pretty), vcat, vsep)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I've then run haskell-language-server with +RTS -p
(and another run with -pj
to get the json result).
The result in the .prof
file is really interesting:
++ GHC.Internal.Base libraries/ghc-internal/src/GHC/Internal/Base.hs:1985:1-4 15.0 29.6
$wbreak GHC.Internal.List libraries/ghc-internal/src/GHC/Internal/List.hs:1329:1-5 14.0 19.6
onBranchElement StmHamt.Focuses library/StmHamt/Focuses.hs:(14,1)-(36,23) 6.9 4.0
joinDrive System.FilePath.Posix libraries/filepath/System/FilePath/Internal.hs:553:1-9 4.4 1.7
newKey Development.IDE.Graph.Internal.Key src/Development/IDE/Graph/Internal/Key.hs:(75,1)-(83,108) 4.4 0.4
go3 System.FilePath.Posix <no location info> 3.6 5.3
splitDirectories_f System.FilePath.Posix libraries/filepath/System/FilePath/Internal.hs:877:9 3.1 4.2
$wspan GHC.Internal.List libraries/ghc-internal/src/GHC/Internal/List.hs:1282:1-4 2.5 2.7
escapeURIChar Network.URI Network/URI.hs:(1000,1)-(1011,63) 2.3 1.8
bracket1 GHC.Internal.IO <no location info> 2.3 7.3
$wouter Data.Text.Internal <no location info> 2.2 0.3
map GHC.Internal.Base libraries/ghc-internal/src/GHC/Internal/Base.hs:1892:1-3 2.2 2.8
escapeURIString Network.URI Network/URI.hs:1043:1-51 2.1 2.6
normalise System.FilePath.Posix libraries/filepath/System/FilePath/Internal.hs:1028:1-9 1.8 0.0
firstJustM Control.Monad.Extra src/Control/Monad/Extra.hs:(293,1)-(294,66) 1.3 0.3
dropTrailingPathSeparator System.FilePath.Posix libraries/filepath/System/FilePath/Internal.hs:770:1-25 1.3 0.0
uses_ Development.IDE.Core.Shake src/Development/IDE/Core/Shake.hs:(1130,1)-(1134,27) 1.3 0.7
findWithIndex PrimitiveExtras.SmallArray library/PrimitiveExtras/SmallArray.hs:(160,3)-(169,15) 1.0 0.1
unsafeDupableInterleaveIO1 GHC.Internal.IO.Unsafe <no location info> 1.0 1.5
unpackAppendCString# GHC.CString libraries/ghc-prim/GHC/CString.hs:192:1-20 0.7 1.5
- (++) from list concatenation is responsible for 30% of the alloc, and 15% of the total runtime
- A lot of the relevant items are related to list, char, filepath (which are actually list of char).
The process took 370s
, close to the 5m
I've observed without profiling, so we can assume that we are not just observing profiling overhead.
At that point, I started to be worried: why the hell does HLS spend 3 minutes doing file handling.
When observing the prof file with profiteur
, I had a few surprises:
- 95% of the time is spent in
getModuleGraphRule
. This is completely coherent with the idea of HLS doing a lot of "pre processing". This rule is called once to build the complete dependency graph before starting doing the typechecking. - 73% of the time is spent in
getLocatedImportsRule
: this rule returns the files (from the project) associed with all the imported modules in another file of the project. e.g. if moduleA.hs
importsB
andC
, this rule will look for the path on disk ofB.hs
andC.hs
. locateModule
is called17390
times.toNormalizeFilePath'
is called6_830_022
for a total of60%
of the total runtime!toNormalizeFilePath
is indeed a "slow" function because it works onString
, but calling it 6 millions time is anyway too much ;)- getFileExists in called
6830022
times too, this is anIO
operation, knida expensive (it takes 3% of the total CPU time, but it may be way more on the wall clock time because this things in doing IO...)
Discussion
It was clear that there was an issue with getFileExists
which was called too many times, and the number of call of locateModule
is also surprising. I had to check if locateModule
had some sort of caching, but anyway, why calling it so much time. My codebase contains 1117 module as file, I actually depends on a bit more than 2 thousands module:
$ pcregrep -h -o2 '^import( qualified)*([^(]*)' **/*.hs | sort | uniq | wc -l
2505
So why calling that 17K times? And which is this function doing so much getFileExists
.
Digging into its implementation, I found that it loops through all the include path to look for a module. Recall that I have 254
"packages", so import path, 254 * 17390
adds up to close to 5 millions of the getfileExists
call.
So we here have a first identify problem: the graph dependency build is linear with the number of include path and can really explode. It also explains why not much person had reported similar issue: there may not be much large codebase with so much include path in the wild.
I had done a quick test by copying all the modules of my project in the same hierarchy and indeed, this is way way faster and actually most of my problems are solved. But actually, HLS should do better, because projects with a few include path exists, and here there is a linear factor to drop. Note also that ghci DOES not suffer from this problem.
Regarding why locateModule
is called 17K times. In rawDependencyInformation
, for each module, we extract the list of the import
from this module and try to find the associated file. This is done in the GetLocatedImports
rule which calls locateModule
for each import, but without caching in anyway. Hence the logic depends on the number of times an import appears in files, instead of the fact that it appears.
Numbers adds up pretty well:
git grep '^import' **/*.hs | wc -l
16058
(I do have some generated files which are stored in /nix/store
) hence it does not match the 17K
, but it confirms the estimation.
Solutions
- Regarding the
locateModule
called as many time as there areimport
in the codebase, we can introduce caching. I'm not yet fully used to the internal of the rule build system, but a rule which takes a module name and returns the associated file path would do the caching. I'm however not aware of how this could be invalidated (e.g. if the file is moved). Is this important to handle (e.g. is HLS considered robust to file move. Actually, if startup time could be improved, restarting the LSP could be a good solution). Implementing this will immediately (on my codebase) yield a8x
performance boost. This value seems to be the "mean" number of module imported per module. This would actually benefit ALL projects using HLS. - Regarding the linear time dependency to the number of import paths, from my point of view it could be fixed by an initial walk of all the include paths directories, extracting the module names from the found files and building a map. The process will hence depend uniquely on the number of files stored in the import path directories. The only risk is if the import path contains an important number of other files. I think there is really low risk for that and HLS could warn the user that one of its import path contains millions of files and damage performances. And anyway, even with millions of files, the it will still be more efficient that the current situation (because walking directories is more efficient than testing millions of time). This rule will also have the problem of being invalidated by files move and such.
- Note that the
locateModule
logic is also doing the check on disk AND the check in the package DB. There are also room for optimization here considering that the package DB is already available in memory. - I just realized that the bios can either return module name of filepath. When it returns filepath, HLS have nothing to do to convert that to filepath. However when it returns module name, HLS first convert them to filepath (I don't know where, I need to investigate) and THEN do the build graph dependency. So the logic is maybe duplicated and maybe there are room for improvement here too. TODO: investigate that too.
Note that both problem could be solved at the same time using by building this Map ModuleName FilePath
at startup.
Regarding the invalidation by files move, most HLS client now can send event when files are moved, this could force the invalidation of the module to file
cache.
Both these change have the potential to just removes the build graph cost time completely. In my project, it means 4 minutes of initialisation turned into 5s
.
Memory usage of FilePath
and IntSet
Regarding FilePath
, it is obvious that [Char]
is less storage efficient than Text
or even ByteString
which are packed data. It is also way more complex for the GC to trace / copy them. One fix could be to store them as Text
in the different DB and convert them to FilePath
on demand each time this is required. The temporary [Char]
would not put GC under much pressure because it will be discarded quickly. Note that it may even be possible to use file api primitives which works with Text
or ByteString
to avoid the conversion.
Regarding IntSet
, it is super efficient when we want to store few values with respect to a large possible set (e.g. if the set of possibles values is all Int
).
In our use case, the domain is small (e.g. number of module files in the project). For example, to store 1000
modules, a compact structure such as a bitset will use a fixed 128
bytes (+ small overhead) and will have virtually no impact on the GC. On the other hand, an IntSet
uses 8
bytes for each stored Int
, so is already less efficient than a bitset once we want to store more than 16 Int
, and that's not taken into account the overhead of the tree itself. Without going into the details and I may be wrong by a factor of 2, but each Int
is itself packed into a Tip
constructor and the Tree nodes contains multiples pointers: a tree of 3 Int
takes 152
bytes, which is already more than a packed bitset able to store up to 1000 modules.
Conclusion
I would like to work on the implementation of the different solution discussed here. I may not be fast because of my time constraints or low understanding of the details of the codebase, but this is a good entry issue and I would like to be more involved in HLS dev. Also, these change will have a huge impact at work.
I will try to soon provide an MR fixing the massive number of file lookup and later, once benchmark wil have show that building the graph is neglible, work on the memory usage of the graph.