Skip to content

rawDependencyInformation bad performance and memory usage #4598

Open
@guibou

Description

@guibou

I've observed that rawDependencyInformation, used to compute the modules dependency graph, have poor performance and uses a lot of memory.

The reasons are two folds:

  1. Algorithmically
  • The rule GetLocatedImport, for each file, parses the import section and for each imported module, it tries to find the associated file. It means that the same lookup is done multiples times (e.g. if import Data.Functor appears on 100 of the module of your project, it will be looked up 100 times).
  • The module lookup depends on the number of include path. If you have 1000 include path, it will try 1000 locations.

Fixing this could improve the performance by a few order of magnitude.

  1. Data structures
  • All the filepath semantic is written with FilePath, which are String. We store a lot of filepath (e.g. at least as many as file in the project) and String is notoriously known for its poor performance, huge memory usage and huge GC pressure when tracing / coping.
  • the dependency logic use a lot of IntSet: for each module, we store the IntSet of its dependencies and reverse dependencies. Similarly to FilePath, IntSet is not a memory and GC friendly structure.

Details later in this document.

Fixing this would only have a limited benefit, however it can dramatically reduce memory usage of HLS (and GC pressure, hence performance). See also this recent note from @soulomoon which is afraid of the memory footprint of the dependencry graph: #4594 (comment)

The debugging session

I'm interested since months about #4443: HLS in our codebase is slow, takes minutes to start and we observe moment where HLS is stuck for a few minutes after some change in the codebase and before returning updated diagnostics. This is especially painful during large refactoring session. E.g recently I updated our codebase to use UnliftIO.Exception instead of Control.Exception.Safe. I started by a super large search and replace, but some fine tuning had to be done on each module. I started to get made: HLS show me an error (usualy a missing module import) with an associated code action, I activate the code action, HLS recompute for 2 or 3 minutes, and then I switch to the next error. I'm so happy to be paid watching HLS processing files. In #4443 I was able to propose a small reproducer and recently @soulomoon wrote #4594 which fixed the rebuild issues with my reproducer.

But HLS in my codebase still takes minutes to start and, even if it rebuilds less, it does rebuild (but now I think that's expected) and unfortunately, rebuild usually takes minutes, so I started to investigate.

A note: it is usually great to have correct assumptions about the expected performances. I had already thought that it was "normal" that HLS was slower than ghci because it was doing way more stuff. In the matrix chat, @fendor (that I thank for all the question he answered) told me that hls should be as fast or even faster that ghci: this completly changed my mental model of HLS and I decided to investigate why it was taking so much time to start: it is easier to debug startup time rather than what is happening during rebuild. I also hope that by working on the startup time, I would gather enough knowledge of the HLS architecture to be able to debug more stuffs.

So yesterday I started debugging HLS in order to understand why it was starting so slowly.

The debugging context.

I'm working on our proprietary codebase at https://github.com/novainsilico. For the record, we use Haskell to simulate biological interactions in the human body, pathology and treatments.

  • We have a codebase of 1116 haskell modules.

  • The "bios" for HLS is manually written, because we have a custom build system based on nix to build everything.

  • From the point of view of HLS, we have only one "unit" containing all the modules. However our build system works with packages, 254 of them (we migrated from cabal to bazel and finally nix as the build system). It also means that we have 254 include path, this will be important later ;)

  • The HLS bios is generated by nix AND our ghci session exactly uses the same settings. So comparing HLS with ghci is fair.

  • We use ghc 9.12.2 with haskell-language-server built from the latest commits in this repo, on top of which I had rebased:

  • compute deps fingerprint only #4594: which avoids too much rebuilding when we edit the module "header" (import, language pragmas, ...)

  • Avoid unnecessary recompilation due to -haddock #4596: which avoids unnecessary rebuild due to haddock flag changes

Initial observation

Starting from a clean .cache/ghcide, I've started my editor on a random file in the project which is kinda in the middle of the dependency graph. I'll discuss timing (approximate, I've used a hand chronometer) and what appears in my editor "status" regarding what HLS is doing.

  • for 15s, HLS is Processing 1/2 Setting up jinko: this is the bios evaluation time and is expected, it calls nix which walks our complete project to build a dependency graph of all the Haskell files by manually parsing the import. This is not required for our bios logic, but it was simpler. Now that I've realized that this takes 15s of startup time, I will definitely work on improving this process. Note however that in 15s I can build, in nix, without parallelism, a complete module dependency graph. This is also important for later ;)
  • Then HLS stays stuck at "Processing 58/1167 4%" for 3m21s. No files are created in .cache/ghcide, so I guessed that it was some sort of HLS preprocessing before starting to do the real typechecking work.
  • Then, once it unblocks, Processing starts to increase and Indexing appears. Files starts to be created inside .cache/ghcide
  • 26s later, the diagnostics appear in my file and Processing disappears: typechecking is done.
  • Indexing continues for 1m26.

The complete process took 5m14s and I was able to access the first diagnostic after 3m48s.

Repeating the same process without cleaning .cache/ghcide lead to the following differences:

  • still 15s of Setting up: this is understood and not related to HLS
  • 3m03s of Processing 4%. This is a bit less than on the first phase, but I'm benchmarking on my laptop, so the difference is irrelevant.
  • 6s of Processing before diagnostics appears, and no Indexing.

So it seems that typechecking benefits from the cache, indexing too. However the "preprocessing step" does not seem to be cached in anyway.

The preprocessing phase

I changed my strategy to debug. Instead of running hls from my editor, I decided to run it directly in the CLI on the same file as the one opened in the editor, such as haskell-language-server path/to/the/Haskell/Module.hs. This approach have the benefit of allowing benchmarking (pref, eventlog, ...) and that the process terminates when done.

I started with eventlog profiling, which does not require any rebuild of the hls binary. haskell-language-server path/to/the/Haskell/Module.hs +RTS -hT -l. Then running eventlog2html on the generate .eventlog file.

Image

Image

We can see that:

  • There are a lot of IntSet used
  • A lot of :
  • A few thunk.

There is nothing alarming in this memory benchmark, except that there are a lot of small objects, taking more than 1 GiB of RAM during the preprocessing. I kept that for future improvment: reducing this could improve performances (e.g. less GC pressure) and anyway, reducing memory usage is always great.

Build with profiling enabled

I then build hls with cabal build exe:haskell-language-server --enable-profiling -f-dynamic. This was dificult to find and I had a lot of surprising errors regarding missing p_dyn from dependencies, that's because profiling need to be built in static but HLS is built by default with dynamic. I note that here so maybe a search would help a futur dev interested in debuging. The error were as such:

exe/Main.hs:33:1: error: [GHC-88719]
    Could not load module ‘Language.LSP.Protocol.Message’.
    Perhaps you haven't installed the "p_dyn" libraries for package ‘lsp-types-2.3.0.1’?
    Use -v to see a list of the files searched for.
   |
33 | import           Language.LSP.Protocol.Message as LSP
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

exe/Main.hs:34:1: error: [GHC-88719]
    Could not load module ‘Language.LSP.Server’.
    Perhaps you haven't installed the "p_dyn" libraries for package ‘lsp-2.7.0.1’?
    Use -v to see a list of the files searched for.
   |
34 | import           Language.LSP.Server           as LSP
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

exe/Main.hs:35:1: error: [GHC-88719]
    Could not load module ‘Prettyprinter’.
    Perhaps you haven't installed the "p_dyn" libraries for package ‘prettyprinter-1.7.1’?
    Use -v to see a list of the files searched for.
   |
35 | import           Prettyprinter                 (Pretty (pretty), vcat, vsep)
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I've then run haskell-language-server with +RTS -p (and another run with -pj to get the json result).

The result in the .prof file is really interesting:

++                         GHC.Internal.Base                  libraries/ghc-internal/src/GHC/Internal/Base.hs:1985:1-4    15.0   29.6
$wbreak                    GHC.Internal.List                  libraries/ghc-internal/src/GHC/Internal/List.hs:1329:1-5    14.0   19.6
onBranchElement            StmHamt.Focuses                    library/StmHamt/Focuses.hs:(14,1)-(36,23)                    6.9    4.0
joinDrive                  System.FilePath.Posix              libraries/filepath/System/FilePath/Internal.hs:553:1-9       4.4    1.7
newKey                     Development.IDE.Graph.Internal.Key src/Development/IDE/Graph/Internal/Key.hs:(75,1)-(83,108)    4.4    0.4
go3                        System.FilePath.Posix              <no location info>                                           3.6    5.3
splitDirectories_f         System.FilePath.Posix              libraries/filepath/System/FilePath/Internal.hs:877:9         3.1    4.2
$wspan                     GHC.Internal.List                  libraries/ghc-internal/src/GHC/Internal/List.hs:1282:1-4     2.5    2.7
escapeURIChar              Network.URI                        Network/URI.hs:(1000,1)-(1011,63)                            2.3    1.8
bracket1                   GHC.Internal.IO                    <no location info>                                           2.3    7.3
$wouter                    Data.Text.Internal                 <no location info>                                           2.2    0.3
map                        GHC.Internal.Base                  libraries/ghc-internal/src/GHC/Internal/Base.hs:1892:1-3     2.2    2.8
escapeURIString            Network.URI                        Network/URI.hs:1043:1-51                                     2.1    2.6
normalise                  System.FilePath.Posix              libraries/filepath/System/FilePath/Internal.hs:1028:1-9      1.8    0.0
firstJustM                 Control.Monad.Extra                src/Control/Monad/Extra.hs:(293,1)-(294,66)                  1.3    0.3
dropTrailingPathSeparator  System.FilePath.Posix              libraries/filepath/System/FilePath/Internal.hs:770:1-25      1.3    0.0
uses_                      Development.IDE.Core.Shake         src/Development/IDE/Core/Shake.hs:(1130,1)-(1134,27)         1.3    0.7
findWithIndex              PrimitiveExtras.SmallArray         library/PrimitiveExtras/SmallArray.hs:(160,3)-(169,15)       1.0    0.1
unsafeDupableInterleaveIO1 GHC.Internal.IO.Unsafe             <no location info>                                           1.0    1.5
unpackAppendCString#       GHC.CString                        libraries/ghc-prim/GHC/CString.hs:192:1-20                   0.7    1.5
  • (++) from list concatenation is responsible for 30% of the alloc, and 15% of the total runtime
  • A lot of the relevant items are related to list, char, filepath (which are actually list of char).

The process took 370s, close to the 5m I've observed without profiling, so we can assume that we are not just observing profiling overhead.

At that point, I started to be worried: why the hell does HLS spend 3 minutes doing file handling.

When observing the prof file with profiteur, I had a few surprises:

Image

  • 95% of the time is spent in getModuleGraphRule. This is completely coherent with the idea of HLS doing a lot of "pre processing". This rule is called once to build the complete dependency graph before starting doing the typechecking.
  • 73% of the time is spent in getLocatedImportsRule: this rule returns the files (from the project) associed with all the imported modules in another file of the project. e.g. if module A.hs imports B and C, this rule will look for the path on disk of B.hs and C.hs.
  • locateModule is called 17390 times.
  • toNormalizeFilePath' is called 6_830_022 for a total of 60% of the total runtime! toNormalizeFilePath is indeed a "slow" function because it works on String, but calling it 6 millions time is anyway too much ;)
  • getFileExists in called 6830022 times too, this is an IO operation, knida expensive (it takes 3% of the total CPU time, but it may be way more on the wall clock time because this things in doing IO...)

Discussion

It was clear that there was an issue with getFileExists which was called too many times, and the number of call of locateModule is also surprising. I had to check if locateModule had some sort of caching, but anyway, why calling it so much time. My codebase contains 1117 module as file, I actually depends on a bit more than 2 thousands module:

$ pcregrep -h -o2 '^import( qualified)*([^(]*)' **/*.hs | sort | uniq | wc -l
2505

So why calling that 17K times? And which is this function doing so much getFileExists.

Digging into its implementation, I found that it loops through all the include path to look for a module. Recall that I have 254 "packages", so import path, 254 * 17390 adds up to close to 5 millions of the getfileExists call.

So we here have a first identify problem: the graph dependency build is linear with the number of include path and can really explode. It also explains why not much person had reported similar issue: there may not be much large codebase with so much include path in the wild.

I had done a quick test by copying all the modules of my project in the same hierarchy and indeed, this is way way faster and actually most of my problems are solved. But actually, HLS should do better, because projects with a few include path exists, and here there is a linear factor to drop. Note also that ghci DOES not suffer from this problem.

Regarding why locateModule is called 17K times. In rawDependencyInformation, for each module, we extract the list of the import from this module and try to find the associated file. This is done in the GetLocatedImports rule which calls locateModule for each import, but without caching in anyway. Hence the logic depends on the number of times an import appears in files, instead of the fact that it appears.

Numbers adds up pretty well:

git grep '^import' **/*.hs | wc -l
16058

(I do have some generated files which are stored in /nix/store) hence it does not match the 17K, but it confirms the estimation.

Solutions

  • Regarding the locateModule called as many time as there are import in the codebase, we can introduce caching. I'm not yet fully used to the internal of the rule build system, but a rule which takes a module name and returns the associated file path would do the caching. I'm however not aware of how this could be invalidated (e.g. if the file is moved). Is this important to handle (e.g. is HLS considered robust to file move. Actually, if startup time could be improved, restarting the LSP could be a good solution). Implementing this will immediately (on my codebase) yield a 8x performance boost. This value seems to be the "mean" number of module imported per module. This would actually benefit ALL projects using HLS.
  • Regarding the linear time dependency to the number of import paths, from my point of view it could be fixed by an initial walk of all the include paths directories, extracting the module names from the found files and building a map. The process will hence depend uniquely on the number of files stored in the import path directories. The only risk is if the import path contains an important number of other files. I think there is really low risk for that and HLS could warn the user that one of its import path contains millions of files and damage performances. And anyway, even with millions of files, the it will still be more efficient that the current situation (because walking directories is more efficient than testing millions of time). This rule will also have the problem of being invalidated by files move and such.
  • Note that the locateModule logic is also doing the check on disk AND the check in the package DB. There are also room for optimization here considering that the package DB is already available in memory.
  • I just realized that the bios can either return module name of filepath. When it returns filepath, HLS have nothing to do to convert that to filepath. However when it returns module name, HLS first convert them to filepath (I don't know where, I need to investigate) and THEN do the build graph dependency. So the logic is maybe duplicated and maybe there are room for improvement here too. TODO: investigate that too.

Note that both problem could be solved at the same time using by building this Map ModuleName FilePath at startup.

Regarding the invalidation by files move, most HLS client now can send event when files are moved, this could force the invalidation of the module to file cache.

Both these change have the potential to just removes the build graph cost time completely. In my project, it means 4 minutes of initialisation turned into 5s.

Memory usage of FilePath and IntSet

Regarding FilePath, it is obvious that [Char] is less storage efficient than Text or even ByteString which are packed data. It is also way more complex for the GC to trace / copy them. One fix could be to store them as Text in the different DB and convert them to FilePath on demand each time this is required. The temporary [Char] would not put GC under much pressure because it will be discarded quickly. Note that it may even be possible to use file api primitives which works with Text or ByteString to avoid the conversion.

Regarding IntSet, it is super efficient when we want to store few values with respect to a large possible set (e.g. if the set of possibles values is all Int).

In our use case, the domain is small (e.g. number of module files in the project). For example, to store 1000 modules, a compact structure such as a bitset will use a fixed 128 bytes (+ small overhead) and will have virtually no impact on the GC. On the other hand, an IntSet uses 8 bytes for each stored Int, so is already less efficient than a bitset once we want to store more than 16 Int, and that's not taken into account the overhead of the tree itself. Without going into the details and I may be wrong by a factor of 2, but each Int is itself packed into a Tip constructor and the Tree nodes contains multiples pointers: a tree of 3 Int takes 152 bytes, which is already more than a packed bitset able to store up to 1000 modules.

Conclusion

I would like to work on the implementation of the different solution discussed here. I may not be fast because of my time constraints or low understanding of the details of the codebase, but this is a good entry issue and I would like to be more involved in HLS dev. Also, these change will have a huge impact at work.

I will try to soon provide an MR fixing the massive number of file lookup and later, once benchmark wil have show that building the graph is neglible, work on the memory usage of the graph.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceIssues about memory consumption, responsiveness, etc.type: bugSomething isn't right: doesn't work as intended, documentation is missing/outdated, etc..

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions