From ad25039452c750b03881e4dac895c2bec4a97205 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Tue, 23 Jun 2020 22:53:04 -0700 Subject: [PATCH 1/5] Inside LTO improvements. --- .../2020-06-24-lto-improvements.md | 89 +++++++++++++++++++ 1 file changed, 89 insertions(+) create mode 100644 posts/inside-rust/2020-06-24-lto-improvements.md diff --git a/posts/inside-rust/2020-06-24-lto-improvements.md b/posts/inside-rust/2020-06-24-lto-improvements.md new file mode 100644 index 000000000..473fc958e --- /dev/null +++ b/posts/inside-rust/2020-06-24-lto-improvements.md @@ -0,0 +1,89 @@ +--- +layout: post +title: "Disk space and LTO improvements" +author: Eric Huss +description: "Disk space and LTO improvements" +team: the Cargo team +--- + +Thanks to the work of [Nicholas Nethercote] and [Alex Crichton], there have been some recent improvements that reduce the size of compiled libraries, and improves the compile-time performance, particularly when using LTO. This post dives into some of the details of what changed, and an estimation of the benefits. + +These changes have been added incrementally over the past three months, with the latest changes landing just a few days ago on the nightly channel. The bulk of the improvements will be found in the 1.46 stable release. It would be great for any projects that use LTO to test it out on the nightly channel and report any issues that arise. + +[Nicholas Nethercote]: https://github.com/nnethercote +[Alex Crichton]: https://github.com/alexcrichton/ + +## Background + +When compiling a library, `rustc` saves the output in an `rlib` file which is an [archive file]. This has historically contained these two things (among others): + +* Object code, which is the result of code generation. This is used during regular linking. +* [LLVM bitcode], which is a binary representation of LLVM's intermediate representation. This can be used for [Link Time Optimization] (LTO). + +LTO is an optimization technique that can perform whole-program analysis. It analyzes all of the bitcode from every library at once, and performs optimizations and code generation. `rustc` supports several forms of LTO: + +* Fat LTO. This performs "full" LTO, which can take a long time to complete and may require a significant amount of memory. +* [Thin LTO]. This is a lightweight version of "fat" LTO that can achieve similar performance improvements while taking much less time to complete. +* Thin-local LTO. By default, `rustc` will split a crate into multiple "codegen units" so that they can be processed in parallel by LLVM. But this prevents some optimizations as code is separated into different codegen units, and is handled independently. Thin-local LTO will perform thin LTO across the codegen units within a single crate, bringing back some optimizations that would otherwise be lost by the separation. This is `rustc`'s default behavior if opt-level is greater than 0. + +## What has changed + +Changes have been made to both `rustc` and Cargo to control which libraries should include object code and which should include bitcode based on the project's [profile] LTO settings. If the project is not using LTO, then Cargo will instruct `rustc` to not place bitcode in the rlib files, which should reduce the amount of disk space used. This may have a small improvement in performance since `rustc` no longer needs to compress and write out the bitcode. + +If the project is using LTO, then Cargo will instruct `rustc` to not place object code in the rlib files, avoiding the expensive code generation step. This should improve the build time when building from scratch, and reduce the amount of disk space used. + +Two `rustc` flags are now available to control how the rlib is constructed: + +* [`-C linker-plugin-lto`] causes `rustc` to only place bitcode in the `.o` files, and skips code generation. Cargo uses this when the rlib is only intended for use with LTO. This can also be used when doing cross-language LTO. +* [`-C embed-bitcode=no`] causes `rustc` to avoid placing bitcode in the rlib altogether. Cargo uses this when LTO is not being used, which reduces some disk space usage. + +Additionally, the method in which bitcode is embedded in the rlib has changed. Previously, `rustc` would place compressed bitcode as a `.bc.z` file in the rlib archive. Now, the bitcode is placed as an uncompressed section within each `.o` [object file] in the rlib archive. This avoids a small performance hit for compressing the bitcode, and also matches the standard format used by clang. + +## Improvements + +The following is a summary of improvements observed on a small number of real-world projects of small and medium size. Improvements of a project will depend heavily on the code, optimization settings, operating system, environment, and hardware. These were recorded with the 2020-06-21 nightly release on Linux with parallel job settings between 2 and 32. + +The performance wins for debug builds were anywhere from 0% to 4.7% faster. Larger binary crates tended to fair better than smaller library crates. + +LTO builds were recorded anywhere from 4% to 20% faster. Thin LTO faired consistently better than fat LTO. + +The number of parallel jobs also had a large impact on the amount of improvement. Lower parallel job counts saw substantially more benefit than higher ones. A project built with `-j2` can be 20% faster, whereas the same project at `-j32` would only be 1% faster. Presumably this is because the code-generation phase benefits from higher concurrency, so it was taking a relatively smaller total percentage of time. + +The overall target directory size is typically 20% to 30% smaller for debug builds. LTO builds did not see as much of an improvement, ranging from 11% to 19% smaller. + +## More details + +Nicholas Nethercote wrote about the journey to implement these changes at . It took several PRs across `rustc` and Cargo to make this happen: + +- [#66598](https://github.com/rust-lang/rust/pull/66598) — The original approach, that was decided to be too simplistic. +- [#66961](https://github.com/rust-lang/rust/issues/66961) — The issue outlining the strategy that was employed. +- [#70289](https://github.com/rust-lang/rust/pull/70289) + [#70297](https://github.com/rust-lang/rust/pull/70297) + [#70345](https://github.com/rust-lang/rust/pull/70345) + [#70384](https://github.com/rust-lang/rust/pull/70384) + [#70644](https://github.com/rust-lang/rust/pull/70644) + [#70729](https://github.com/rust-lang/rust/pull/70729) + [#71374](https://github.com/rust-lang/rust/pull/71374) + [#71716](https://github.com/rust-lang/rust/pull/71716) + [#71754](https://github.com/rust-lang/rust/pull/71754) — A series of refactorings to prepare for the new behavior and do some cleanup. +- [#71323](https://github.com/rust-lang/rust/pull/71323) — Introduced a new flag to control whether or not bitcode is embedded. +- [#70458](https://github.com/rust-lang/rust/pull/70458) [#71528](https://github.com/rust-lang/rust/pull/71528) — Switched how LLVM bitcode is embedded. +- [#8066](https://github.com/rust-lang/cargo/pull/8066) + [#8192](https://github.com/rust-lang/cargo/pull/8192) + [#8204](https://github.com/rust-lang/cargo/pull/8204) + [#8226](https://github.com/rust-lang/cargo/pull/8226) + [#8254](https://github.com/rust-lang/cargo/pull/8254) + [#8349](https://github.com/rust-lang/cargo/pull/8349) — The series of Cargo changes to implement the new functionality. + +## Conclusion + +Although this is a conceptually simple change (LTO=bitcode, non-LTO=object code), it took quite a bit of preparation and work to make it happen. There were many edge cases and platform-specific behaviors to consider, and testing to perform. And, of course, the obligatory bike-shedding over the names of new command-line flags. This resulted in quite a substantial improvement in performance, particularly for LTO builds, and a huge improvement in disk space usage. Thanks to all of those that helped to make this happen! + +[archive file]: https://en.wikipedia.org/wiki/Ar_(Unix) +[LLVM bitcode]: https://llvm.org/docs/BitCodeFormat.html +[Link Time Optimization]: https://llvm.org/docs/LinkTimeOptimization.html +[Thin LTO]: http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html +[profile]: https://doc.rust-lang.org/cargo/reference/profiles.html +[object file]: https://en.wikipedia.org/wiki/Object_file +[`-C linker-plugin-lto`]: https://doc.rust-lang.org/nightly/rustc/codegen-options/#linker-plugin-lto +[`-C embed-bitcode=no`]: https://doc.rust-lang.org/nightly/rustc/codegen-options/#embed-bitcode From 0d53d3ca0a14839d7360965039e019b6c3746dc9 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 24 Jun 2020 12:00:43 -0700 Subject: [PATCH 2/5] Update LTO post based on feedback. --- posts/inside-rust/2020-06-24-lto-improvements.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/posts/inside-rust/2020-06-24-lto-improvements.md b/posts/inside-rust/2020-06-24-lto-improvements.md index 473fc958e..94957998e 100644 --- a/posts/inside-rust/2020-06-24-lto-improvements.md +++ b/posts/inside-rust/2020-06-24-lto-improvements.md @@ -15,10 +15,11 @@ These changes have been added incrementally over the past three months, with the ## Background -When compiling a library, `rustc` saves the output in an `rlib` file which is an [archive file]. This has historically contained these two things (among others): +When compiling a library, `rustc` saves the output in an `rlib` file which is an [archive file]. This has historically contained the following: * Object code, which is the result of code generation. This is used during regular linking. * [LLVM bitcode], which is a binary representation of LLVM's intermediate representation. This can be used for [Link Time Optimization] (LTO). +* Rust-specific metadata, which covers [a wide range of data][metadata] about the crate. LTO is an optimization technique that can perform whole-program analysis. It analyzes all of the bitcode from every library at once, and performs optimizations and code generation. `rustc` supports several forms of LTO: @@ -37,7 +38,7 @@ Two `rustc` flags are now available to control how the rlib is constructed: * [`-C linker-plugin-lto`] causes `rustc` to only place bitcode in the `.o` files, and skips code generation. Cargo uses this when the rlib is only intended for use with LTO. This can also be used when doing cross-language LTO. * [`-C embed-bitcode=no`] causes `rustc` to avoid placing bitcode in the rlib altogether. Cargo uses this when LTO is not being used, which reduces some disk space usage. -Additionally, the method in which bitcode is embedded in the rlib has changed. Previously, `rustc` would place compressed bitcode as a `.bc.z` file in the rlib archive. Now, the bitcode is placed as an uncompressed section within each `.o` [object file] in the rlib archive. This avoids a small performance hit for compressing the bitcode, and also matches the standard format used by clang. +Additionally, the method in which bitcode is embedded in the rlib has changed. Previously, `rustc` would place compressed bitcode as a `.bc.z` file in the rlib archive. Now, the bitcode is placed as an uncompressed section within each `.o` [object file] in the rlib archive. This can sometimes be a small performance benefit, because it avoids cost of compressing the bitcode, and sometimes can be slower due to needing to write more data to disk. This change helped simplify the implementation, and also matches the behavior of clang's `-fembed-bitcode` option (typically used with Apple's iOS-based operating systems). ## Improvements @@ -87,3 +88,4 @@ Although this is a conceptually simple change (LTO=bitcode, non-LTO=object code) [object file]: https://en.wikipedia.org/wiki/Object_file [`-C linker-plugin-lto`]: https://doc.rust-lang.org/nightly/rustc/codegen-options/#linker-plugin-lto [`-C embed-bitcode=no`]: https://doc.rust-lang.org/nightly/rustc/codegen-options/#embed-bitcode +[metadata]: https://github.com/rust-lang/rust/blob/0b66a89735305ebac93894461559576495ab920e/src/librustc_metadata/rmeta/mod.rs#L172-L214 From f430fe483680b3f2d7c8ffe1fcdeec47440b4993 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 24 Jun 2020 15:06:26 -0700 Subject: [PATCH 3/5] Update LTO post with more review feedback. --- posts/inside-rust/2020-06-24-lto-improvements.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/posts/inside-rust/2020-06-24-lto-improvements.md b/posts/inside-rust/2020-06-24-lto-improvements.md index 94957998e..0e5f0e45e 100644 --- a/posts/inside-rust/2020-06-24-lto-improvements.md +++ b/posts/inside-rust/2020-06-24-lto-improvements.md @@ -24,7 +24,7 @@ When compiling a library, `rustc` saves the output in an `rlib` file which is an LTO is an optimization technique that can perform whole-program analysis. It analyzes all of the bitcode from every library at once, and performs optimizations and code generation. `rustc` supports several forms of LTO: * Fat LTO. This performs "full" LTO, which can take a long time to complete and may require a significant amount of memory. -* [Thin LTO]. This is a lightweight version of "fat" LTO that can achieve similar performance improvements while taking much less time to complete. +* [Thin LTO]. This LTO variant supports much better parallelism than fat LTO. It can achieve similar performance improvements as fat LTO (sometimes even better!), while taking much less total time by taking advantage of more CPUs. * Thin-local LTO. By default, `rustc` will split a crate into multiple "codegen units" so that they can be processed in parallel by LLVM. But this prevents some optimizations as code is separated into different codegen units, and is handled independently. Thin-local LTO will perform thin LTO across the codegen units within a single crate, bringing back some optimizations that would otherwise be lost by the separation. This is `rustc`'s default behavior if opt-level is greater than 0. ## What has changed @@ -35,7 +35,7 @@ If the project is using LTO, then Cargo will instruct `rustc` to not place objec Two `rustc` flags are now available to control how the rlib is constructed: -* [`-C linker-plugin-lto`] causes `rustc` to only place bitcode in the `.o` files, and skips code generation. Cargo uses this when the rlib is only intended for use with LTO. This can also be used when doing cross-language LTO. +* [`-C linker-plugin-lto`] causes `rustc` to only place bitcode in the `.o` files, and skips code generation. This flag was [originally added][linker-plugin-lto-track] to support cross-language LTO. Cargo now uses this when the rlib is only intended for use with LTO. * [`-C embed-bitcode=no`] causes `rustc` to avoid placing bitcode in the rlib altogether. Cargo uses this when LTO is not being used, which reduces some disk space usage. Additionally, the method in which bitcode is embedded in the rlib has changed. Previously, `rustc` would place compressed bitcode as a `.bc.z` file in the rlib archive. Now, the bitcode is placed as an uncompressed section within each `.o` [object file] in the rlib archive. This can sometimes be a small performance benefit, because it avoids cost of compressing the bitcode, and sometimes can be slower due to needing to write more data to disk. This change helped simplify the implementation, and also matches the behavior of clang's `-fembed-bitcode` option (typically used with Apple's iOS-based operating systems). @@ -89,3 +89,4 @@ Although this is a conceptually simple change (LTO=bitcode, non-LTO=object code) [`-C linker-plugin-lto`]: https://doc.rust-lang.org/nightly/rustc/codegen-options/#linker-plugin-lto [`-C embed-bitcode=no`]: https://doc.rust-lang.org/nightly/rustc/codegen-options/#embed-bitcode [metadata]: https://github.com/rust-lang/rust/blob/0b66a89735305ebac93894461559576495ab920e/src/librustc_metadata/rmeta/mod.rs#L172-L214 +[linker-plugin-lto-track]: https://github.com/rust-lang/rust/issues/49879 From dbb3634aac07e967ef0dc0d24ef3f9e5d7644758 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 24 Jun 2020 15:14:43 -0700 Subject: [PATCH 4/5] Note initial release date. --- posts/inside-rust/2020-06-24-lto-improvements.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/posts/inside-rust/2020-06-24-lto-improvements.md b/posts/inside-rust/2020-06-24-lto-improvements.md index 0e5f0e45e..f3f3de1bd 100644 --- a/posts/inside-rust/2020-06-24-lto-improvements.md +++ b/posts/inside-rust/2020-06-24-lto-improvements.md @@ -8,7 +8,7 @@ team: the Cargo team Date: Sun, 28 Jun 2020 09:29:01 -0700 Subject: [PATCH 5/5] Update post date. --- ...0-06-24-lto-improvements.md => 2020-06-29-lto-improvements.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename posts/inside-rust/{2020-06-24-lto-improvements.md => 2020-06-29-lto-improvements.md} (100%) diff --git a/posts/inside-rust/2020-06-24-lto-improvements.md b/posts/inside-rust/2020-06-29-lto-improvements.md similarity index 100% rename from posts/inside-rust/2020-06-24-lto-improvements.md rename to posts/inside-rust/2020-06-29-lto-improvements.md