Blog post: measuring memory usage in Rust

matklad · matklad · commit a0c3e29dfe7a · 2020-12-04T18:10:36.000+03:00
diff --git a/blog/_posts/2020-12-04-measuring-memory-usage-in-rust.adoc b/blog/_posts/2020-12-04-measuring-memory-usage-in-rust.adoc
@@ -0,0 +1,118 @@
+= Measuring Memory Usage in Rust
+@matklad
+:sectanchors:
+:experimental:
+:page-layout: post
+
+****
+rust-analyzer is a new "IDE backend" for the https://www.rust-lang.org/[Rust] programming language.
+Support rust-analyzer on https://opencollective.com/rust-analyzer/[Open Collective] or https://github.com/sponsors/rust-analyzer[GitHub Sponsors].
+****
+
+This post documents a couple of fun tricks we use in rust-analyzer for measuring memory consumption.
+
+In general, there are two broad approaches to profiling the memory usage of a program.
+
+_The first approach_ is based on "`heap parsing`".
+At a particular point in time, the profiler looks at all the memory currently occupied by the program (the heap).
+In its raw form, the memory is just a bag of bytes, `Vec<u8>`.
+However the profiler, using some help from the language's runtime, is able to re-interpret these bytes as collections of object ("`parse the heap`").
+It then traverses the graph of objects and computes how many instances of each object are there and how much memory they occupy.
+The profiler also tracks the ownership relations, to ferret out facts like "`90% of strings in this program are owned by the ``Config`` struct`".
+This is the approach I am familiar with from the JVM ecosystem.
+Java's garbage collector needs to understand the heap to search for unreachable objects, and the same information is used to analyze heap snapshots.
+
+_The second approach_ is based on instrumenting the calls to allocation and deallocation routines.
+The profiler captures backtraces when the program calls `malloc` and `free` and constructs a flamegraph displaying "`hot`" functions which allocate a lot.
+This is how, for example, https://github.com/KDE/heaptrack[heaptrack] works (see also https://github.com/cuviper/alloc_geiger[alloc geiger]).
+
+The two approaches are complimentary.
+If the problem is that the application does to many short-lived allocations (instead of re-using the buffers), it would be invisible for the first approach, but very clear in the second one.
+If the problem is that, in a steady state, the application uses to much memory, the first approach would work better for pointing out which data structures need most attention.
+
+In rust-analyzer, we are generally interested in keeping the overall memory usage small, and can make better use of heap parsing approach.
+Specifically, most of the rust-analyzer's data is stored in the incremental computation tables, and we want to know which table is the heaviest.
+
+Unfortunately, Rust use garbage collection, so just parsing the heap bytes at runtime is impossible.
+The best available alternative is instrumenting data structures for the purposes of measuring memory size.
+That is, writing a proc-macro which adds `fn total_size(&self) -> usize` method to annotated types, and calling that manually from the root of the data.
+There is Servo's https://github.com/servo/servo/tree/2d3811c21bf1c02911d5002f9670349c5cf4f500/components/malloc_size_of[`malloc_size_of`] crate for doing that, but it is not published to crates.io.
+
+Another alternative is running the program under valgrind to gain runtime introspectability.
+https://www.valgrind.org/docs/manual/ms-manual.html[Massif] and and https://www.valgrind.org/docs/manual/dh-manual.html[DHAT] work that way.
+Running with valgrind is pretty slow, and still doesn't give the Java-level fidelity.
+
+Instead, rust-analyzer mainly relies on a much simpler approach for figuring out which things are heavy.
+This is the first trick of this article:
+
+== Archimedes' Method
+
+It's relatively easy to find out the total memory allocated at any given point in time.
+For glibc, there's https://man7.org/linux/man-pages/man3/mallinfo.3.html[mallinfo] function, a https://docs.rs/jemalloc-ctl/0.3.3/jemalloc_ctl/stats/struct.allocated.html[similar API] exists for jemalloc.
+It's even possible to implement a https://doc.rust-lang.org/stable/std/alloc/trait.GlobalAlloc.html[`GlobalAlloc`] which tracks this number.
+
+And, if you can measure total memory usage, you can measure memory usage of any specific data structure by:
+
+. noting the current memory usage
+. dropping the data structure
+. noting the current memory usage again
+
+The difference between the two measurements is the size of the data structure.
+And this is exactly what rust-analyzer does to find the largest caches: https://github.com/rust-analyzer/rust-analyzer/blob/b988c6f84e06bdc5562c70f28586b9eeaae3a39c/crates/ide_db/src/apply_change.rs#L104-L238[source].
+
+Two small notes about this method:
+
+* It's important to ask the allocator about the available memory, and not the operating system.
+  OS can only tell how many pages the program consumes.
+  Only the allocator knows which of those pages are free and which hold allocated objects.
+* When measuring relative sizes, it's important to note the unaccounted-for amount in the end, such that the total adds up to 100%.
+  It might be the case that the bottleneck lies in the dark matter outside of explicit measurements!
+
+== Amdahl's Estimator
+
+The second trick is related to the https://en.wikipedia.org/wiki/Amdahl's_law[Amdahl's law].
+When optimizing a specific component, it's important to note not only how much more efficient it becomes, but also overall contribution of the component to the system.
+Making an algorithm twice as fast can improve the overall performance only by 5%, if the algorithm is only 10% of the whole task.
+
+In rust-analyzer's case, the optimization we are considering is adding interning to `Name`.
+At the moment, a ``Name`` is represented with a small sized optimized string (24 bytes inline + maybe some heap storage):
+
+[source,rust]
+----
+struct Name {
+    text: SmolStr,
+}
+----
+
+Instead, we can use an interned index (4 bytes):
+
+[source,rust]
+----
+struct Name {
+    idx: u32
+}
+----
+
+However, just trying out this optimization is not easy, as an interner is a thorny piece of global state.
+Is it worth it?
+
+If we look at the `Name` itself, it's pretty clear that the optimization is valuable: it reduces memory usage by 6x!
+But how much is it important in the grand scheme of things?
+How to measure the impact of ``Name``s on overall memory usage?
+
+One approach is to just apply the optimization and measure the improvement after the fact.
+But there's a lazier way: instead of making the `Name` smaller and measuring the improvement, we make it *bigger* and measure the worsening.
+Specifically, its easy to change the `Name` to this:
+
+[source,rust]
+----
+struct Name {
+    text: SmolStr,
+    // Copy of `text`
+    _ballast: SmolStr,
+}
+----
+
+Now, if the new `Name` increases the overall memory consumption by `N`, we can estimate the total size of old ``Name``s as `N` as well, as they are twice as small.
+
+Sometimes, quick and simple hacks works better than the finest instruments :)