|
| 1 | += Measuring Memory Usage in Rust |
| 2 | +@matklad |
| 3 | +:sectanchors: |
| 4 | +:experimental: |
| 5 | +:page-layout: post |
| 6 | + |
| 7 | +**** |
| 8 | +rust-analyzer is a new "IDE backend" for the https://www.rust-lang.org/[Rust] programming language. |
| 9 | +Support rust-analyzer on https://opencollective.com/rust-analyzer/[Open Collective] or https://github.com/sponsors/rust-analyzer[GitHub Sponsors]. |
| 10 | +**** |
| 11 | + |
| 12 | +This post documents a couple of fun tricks we use in rust-analyzer for measuring memory consumption. |
| 13 | + |
| 14 | +In general, there are two broad approaches to profiling the memory usage of a program. |
| 15 | + |
| 16 | +_The first approach_ is based on "`heap parsing`". |
| 17 | +At a particular point in time, the profiler looks at all the memory currently occupied by the program (the heap). |
| 18 | +In its raw form, the memory is just a bag of bytes, `Vec<u8>`. |
| 19 | +However the profiler, using some help from the language's runtime, is able to re-interpret these bytes as collections of object ("`parse the heap`"). |
| 20 | +It then traverses the graph of objects and computes how many instances of each object are there and how much memory they occupy. |
| 21 | +The profiler also tracks the ownership relations, to ferret out facts like "`90% of strings in this program are owned by the ``Config`` struct`". |
| 22 | +This is the approach I am familiar with from the JVM ecosystem. |
| 23 | +Java's garbage collector needs to understand the heap to search for unreachable objects, and the same information is used to analyze heap snapshots. |
| 24 | + |
| 25 | +_The second approach_ is based on instrumenting the calls to allocation and deallocation routines. |
| 26 | +The profiler captures backtraces when the program calls `malloc` and `free` and constructs a flamegraph displaying "`hot`" functions which allocate a lot. |
| 27 | +This is how, for example, https://github.com/KDE/heaptrack[heaptrack] works (see also https://github.com/cuviper/alloc_geiger[alloc geiger]). |
| 28 | + |
| 29 | +The two approaches are complimentary. |
| 30 | +If the problem is that the application does to many short-lived allocations (instead of re-using the buffers), it would be invisible for the first approach, but very clear in the second one. |
| 31 | +If the problem is that, in a steady state, the application uses to much memory, the first approach would work better for pointing out which data structures need most attention. |
| 32 | + |
| 33 | +In rust-analyzer, we are generally interested in keeping the overall memory usage small, and can make better use of heap parsing approach. |
| 34 | +Specifically, most of the rust-analyzer's data is stored in the incremental computation tables, and we want to know which table is the heaviest. |
| 35 | + |
| 36 | +Unfortunately, Rust use garbage collection, so just parsing the heap bytes at runtime is impossible. |
| 37 | +The best available alternative is instrumenting data structures for the purposes of measuring memory size. |
| 38 | +That is, writing a proc-macro which adds `fn total_size(&self) -> usize` method to annotated types, and calling that manually from the root of the data. |
| 39 | +There is Servo's https://github.com/servo/servo/tree/2d3811c21bf1c02911d5002f9670349c5cf4f500/components/malloc_size_of[`malloc_size_of`] crate for doing that, but it is not published to crates.io. |
| 40 | + |
| 41 | +Another alternative is running the program under valgrind to gain runtime introspectability. |
| 42 | +https://www.valgrind.org/docs/manual/ms-manual.html[Massif] and and https://www.valgrind.org/docs/manual/dh-manual.html[DHAT] work that way. |
| 43 | +Running with valgrind is pretty slow, and still doesn't give the Java-level fidelity. |
| 44 | + |
| 45 | +Instead, rust-analyzer mainly relies on a much simpler approach for figuring out which things are heavy. |
| 46 | +This is the first trick of this article: |
| 47 | + |
| 48 | +== Archimedes' Method |
| 49 | + |
| 50 | +It's relatively easy to find out the total memory allocated at any given point in time. |
| 51 | +For glibc, there's https://man7.org/linux/man-pages/man3/mallinfo.3.html[mallinfo] function, a https://docs.rs/jemalloc-ctl/0.3.3/jemalloc_ctl/stats/struct.allocated.html[similar API] exists for jemalloc. |
| 52 | +It's even possible to implement a https://doc.rust-lang.org/stable/std/alloc/trait.GlobalAlloc.html[`GlobalAlloc`] which tracks this number. |
| 53 | + |
| 54 | +And, if you can measure total memory usage, you can measure memory usage of any specific data structure by: |
| 55 | + |
| 56 | +. noting the current memory usage |
| 57 | +. dropping the data structure |
| 58 | +. noting the current memory usage again |
| 59 | + |
| 60 | +The difference between the two measurements is the size of the data structure. |
| 61 | +And this is exactly what rust-analyzer does to find the largest caches: https://github.com/rust-analyzer/rust-analyzer/blob/b988c6f84e06bdc5562c70f28586b9eeaae3a39c/crates/ide_db/src/apply_change.rs#L104-L238[source]. |
| 62 | + |
| 63 | +Two small notes about this method: |
| 64 | + |
| 65 | +* It's important to ask the allocator about the available memory, and not the operating system. |
| 66 | + OS can only tell how many pages the program consumes. |
| 67 | + Only the allocator knows which of those pages are free and which hold allocated objects. |
| 68 | +* When measuring relative sizes, it's important to note the unaccounted-for amount in the end, such that the total adds up to 100%. |
| 69 | + It might be the case that the bottleneck lies in the dark matter outside of explicit measurements! |
| 70 | + |
| 71 | +== Amdahl's Estimator |
| 72 | + |
| 73 | +The second trick is related to the https://en.wikipedia.org/wiki/Amdahl's_law[Amdahl's law]. |
| 74 | +When optimizing a specific component, it's important to note not only how much more efficient it becomes, but also overall contribution of the component to the system. |
| 75 | +Making an algorithm twice as fast can improve the overall performance only by 5%, if the algorithm is only 10% of the whole task. |
| 76 | + |
| 77 | +In rust-analyzer's case, the optimization we are considering is adding interning to `Name`. |
| 78 | +At the moment, a ``Name`` is represented with a small sized optimized string (24 bytes inline + maybe some heap storage): |
| 79 | + |
| 80 | +[source,rust] |
| 81 | +---- |
| 82 | +struct Name { |
| 83 | + text: SmolStr, |
| 84 | +} |
| 85 | +---- |
| 86 | + |
| 87 | +Instead, we can use an interned index (4 bytes): |
| 88 | + |
| 89 | +[source,rust] |
| 90 | +---- |
| 91 | +struct Name { |
| 92 | + idx: u32 |
| 93 | +} |
| 94 | +---- |
| 95 | + |
| 96 | +However, just trying out this optimization is not easy, as an interner is a thorny piece of global state. |
| 97 | +Is it worth it? |
| 98 | + |
| 99 | +If we look at the `Name` itself, it's pretty clear that the optimization is valuable: it reduces memory usage by 6x! |
| 100 | +But how much is it important in the grand scheme of things? |
| 101 | +How to measure the impact of ``Name``s on overall memory usage? |
| 102 | + |
| 103 | +One approach is to just apply the optimization and measure the improvement after the fact. |
| 104 | +But there's a lazier way: instead of making the `Name` smaller and measuring the improvement, we make it *bigger* and measure the worsening. |
| 105 | +Specifically, its easy to change the `Name` to this: |
| 106 | + |
| 107 | +[source,rust] |
| 108 | +---- |
| 109 | +struct Name { |
| 110 | + text: SmolStr, |
| 111 | + // Copy of `text` |
| 112 | + _ballast: SmolStr, |
| 113 | +} |
| 114 | +---- |
| 115 | + |
| 116 | +Now, if the new `Name` increases the overall memory consumption by `N`, we can estimate the total size of old ``Name``s as `N` as well, as they are twice as small. |
| 117 | + |
| 118 | +Sometimes, quick and simple hacks works better than the finest instruments :) |
0 commit comments