Inside LTO improvements. #627

ehuss · 2020-06-24T05:56:13Z

This is just a random post about some recent performance and disk-space improvements that have recently been introduced. I kinda felt someone might find it interesting.

rust-highfive · 2020-06-24T05:56:15Z

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

ehuss · 2020-06-24T05:57:11Z

@alexcrichton or @nnethercote, if either of you could look this over for any major errors, I would appreciate it!

nnethercote

This is really good, thanks for writing it. I have a couple of minor suggestions below.

nnethercote · 2020-06-24T10:27:31Z

posts/inside-rust/2020-06-24-lto-improvements.md

+
+## Background
+
+When compiling a library, `rustc` saves the output in an `rlib` file which is an [archive file]. This has historically contained these two things (among others):


Not two but three things: there is metadata as well.

Maybe the metadata is covered by the "(among others)", but given that it's the only thing in "(among others)", it's probably just worth mentioning outright.

Yea, it seemed awkward when I wrote it. I was attempting to focus on just the things that changed. I went ahead and just added the metadata as the 3rd major part.

nnethercote · 2020-06-24T10:30:39Z

posts/inside-rust/2020-06-24-lto-improvements.md

+* [`-C linker-plugin-lto`] causes `rustc` to only place bitcode in the `.o` files, and skips code generation. Cargo uses this when the rlib is only intended for use with LTO. This can also be used when doing cross-language LTO.
+* [`-C embed-bitcode=no`] causes `rustc` to avoid placing bitcode in the rlib altogether. Cargo uses this when LTO is not being used, which reduces some disk space usage.
+
+Additionally, the method in which bitcode is embedded in the rlib has changed. Previously, `rustc` would place compressed bitcode as a `.bc.z` file in the rlib archive. Now, the bitcode is placed as an uncompressed section within each `.o` [object file] in the rlib archive. This avoids a small performance hit for compressing the bitcode, and also matches the standard format used by clang.


The performance effect here is a mixed bag. Sometimes it's a speedup because we avoid the compression cost. Sometimes it's a slowdown because more data must be written. The combined effect is hard to predict and depends on the program. Avoiding the compression certainly makes the compiler simpler, though.

It's also probably worth pointing out that "standard format" is a bit of a stretch as we discovered because only a few linkers actually work with the default output, we had to frob with some global inline asm to get the sections to work out on all platforms.

In reality this "standard format" was only created for iOS where they ship bitcode by default but wanted a mode of shipping both for debugging and such. We're kinda just piggy-backing on it at this point. In any case the words written down here are correct, so I'll leave it up to you if you'd like to add more context!

Thanks for the info! Updated with a little extra information.

alexcrichton

Thanks so much for taking the time to write this up!

alexcrichton · 2020-06-24T13:57:04Z

posts/inside-rust/2020-06-24-lto-improvements.md

+LTO is an optimization technique that can perform whole-program analysis. It analyzes all of the bitcode from every library at once, and performs optimizations and code generation. `rustc` supports several forms of LTO:
+
+* Fat LTO. This performs "full" LTO, which can take a long time to complete and may require a significant amount of memory.
+* [Thin LTO]. This is a lightweight version of "fat" LTO that can achieve similar performance improvements while taking much less time to complete.


"much less time" isn't necessarily right but rather "it does stuff in parallel while fat LTO is single threaded". I'd suspect that thin LTO does more work in terms of CPU hours but no one building Rust has only one core nowadays!

@alexcrichton Because of the phrasing of this sentence, is there ever a time where fat LTO is better than thin LTO?

Heh I wondered the same thing when ThinLTO came along! Fat LTO, however, produces smaller binaries because it can extremely aggressively internalize functions. It can also produce slightly faster binaries since total knowledge in a compiler is typically better than fragmented knowledge. ThinLTO's primary purpose is figure out what to inline across codegen units, while Fat LTO basically says "eh just do your normal thing but here's all the code in the world"

That makes sense. Is there a way we can rephrase this to make this point more clear? The way it stands now makes it seem like thin is always better. "Similar performance while taking much less time"

I've tweaked the wording a little to try to make it clearer.

I think as with any optimization settings, every project will need to experiment with different settings to see what works best for them.

alexcrichton · 2020-06-24T13:58:38Z

posts/inside-rust/2020-06-24-lto-improvements.md

+
+Two `rustc` flags are now available to control how the rlib is constructed:
+
+* [`-C linker-plugin-lto`] causes `rustc` to only place bitcode in the `.o` files, and skips code generation. Cargo uses this when the rlib is only intended for use with LTO. This can also be used when doing cross-language LTO.


It might be worth nothing that this flag is being repurposed for Cargo's use since it was first added for what the name is on the tin, it just so happens to work for Cargo's use case too!

alexcrichton · 2020-06-24T14:00:38Z

posts/inside-rust/2020-06-24-lto-improvements.md

+* [`-C linker-plugin-lto`] causes `rustc` to only place bitcode in the `.o` files, and skips code generation. Cargo uses this when the rlib is only intended for use with LTO. This can also be used when doing cross-language LTO.
+* [`-C embed-bitcode=no`] causes `rustc` to avoid placing bitcode in the rlib altogether. Cargo uses this when LTO is not being used, which reduces some disk space usage.
+
+Additionally, the method in which bitcode is embedded in the rlib has changed. Previously, `rustc` would place compressed bitcode as a `.bc.z` file in the rlib archive. Now, the bitcode is placed as an uncompressed section within each `.o` [object file] in the rlib archive. This avoids a small performance hit for compressing the bitcode, and also matches the standard format used by clang.


It's also probably worth pointing out that "standard format" is a bit of a stretch as we discovered because only a few linkers actually work with the default output, we had to frob with some global inline asm to get the sections to work out on all platforms.

In reality this "standard format" was only created for iOS where they ship bitcode by default but wanted a mode of shipping both for debugging and such. We're kinda just piggy-backing on it at this point. In any case the words written down here are correct, so I'll leave it up to you if you'd like to add more context!

alexcrichton · 2020-06-24T14:01:49Z

posts/inside-rust/2020-06-24-lto-improvements.md

+
+LTO builds were recorded anywhere from 4% to 20% faster. Thin LTO faired consistently better than fat LTO.
+
+The number of parallel jobs also had a large impact on the amount of improvement. Lower parallel job counts saw substantially more benefit than higher ones. A project built with `-j2` can be 20% faster, whereas the same project at `-j32` would only be 1% faster. Presumably this is because the code-generation phase benefits from higher concurrency, so it was taking a relatively smaller total percentage of time.


"Lower parallel job counts saw substantially more benefit than higher ones" -- oh man that's pipelining in action I think!

nikomatsakis

No detailed comments about the contents, but I applaud you writing the post! 👏 =)

Mark-Simulacrum

One minor nit, but I don't think it needs resolving either

Mark-Simulacrum · 2020-06-24T21:51:30Z

posts/inside-rust/2020-06-24-lto-improvements.md

+
+Thanks to the work of [Nicholas Nethercote] and [Alex Crichton], there have been some recent improvements that reduce the size of compiled libraries, and improves the compile-time performance, particularly when using LTO. This post dives into some of the details of what changed, and an estimation of the benefits.
+
+These changes have been added incrementally over the past three months, with the latest changes landing just a few days ago on the nightly channel. The bulk of the improvements will be found in the 1.46 stable release. It would be great for any projects that use LTO to test it out on the nightly channel and report any issues that arise.


It would be great to include a concrete date (or even commit) from which support is fully available on nightly, just so that people testing can be certain they're in good shape. (In theory they can just rustup update, I guess, but if there's regressions or whatever would be useful).

Sounds good, added the date for the most recent fixes.

nikomatsakis · 2020-06-26T20:26:19Z

@ehuss let me know when you want to post!

ehuss · 2020-06-27T16:04:24Z

This should be ready to go!

Mark-Simulacrum · 2020-06-28T12:49:45Z

Want to bump the date in the file to Monday and we can merge tomorrow?

Inside LTO improvements.

ad25039

rust-highfive assigned nikomatsakis Jun 24, 2020

nnethercote approved these changes Jun 24, 2020

View reviewed changes

alexcrichton approved these changes Jun 24, 2020

View reviewed changes

nikomatsakis approved these changes Jun 24, 2020

View reviewed changes

Update LTO post based on feedback.

0d53d3c

Mark-Simulacrum approved these changes Jun 24, 2020

View reviewed changes

ehuss added 2 commits June 24, 2020 15:06

Update LTO post with more review feedback.

f430fe4

Note initial release date.

dbb3634

ehuss force-pushed the inside-lto branch from e8eb625 to dbb3634 Compare June 24, 2020 22:20

Update post date.

76b5d57

nikomatsakis merged commit c65a377 into rust-lang:master Jun 29, 2020


		## Background

		When compiling a library, `rustc` saves the output in an `rlib` file which is an [archive file]. This has historically contained these two things (among others):


		Two `rustc` flags are now available to control how the rlib is constructed:

		* [`-C linker-plugin-lto`] causes `rustc` to only place bitcode in the `.o` files, and skips code generation. Cargo uses this when the rlib is only intended for use with LTO. This can also be used when doing cross-language LTO.


		LTO builds were recorded anywhere from 4% to 20% faster. Thin LTO faired consistently better than fat LTO.

		The number of parallel jobs also had a large impact on the amount of improvement. Lower parallel job counts saw substantially more benefit than higher ones. A project built with `-j2` can be 20% faster, whereas the same project at `-j32` would only be 1% faster. Presumably this is because the code-generation phase benefits from higher concurrency, so it was taking a relatively smaller total percentage of time.


		Thanks to the work of [Nicholas Nethercote] and [Alex Crichton], there have been some recent improvements that reduce the size of compiled libraries, and improves the compile-time performance, particularly when using LTO. This post dives into some of the details of what changed, and an estimation of the benefits.

		These changes have been added incrementally over the past three months, with the latest changes landing just a few days ago on the nightly channel. The bulk of the improvements will be found in the 1.46 stable release. It would be great for any projects that use LTO to test it out on the nightly channel and report any issues that arise.

Inside LTO improvements. #627

Inside LTO improvements. #627

Uh oh!

Conversation

ehuss commented Jun 24, 2020

Uh oh!

rust-highfive commented Jun 24, 2020

Uh oh!

ehuss commented Jun 24, 2020

Uh oh!

nnethercote left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexcrichton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikomatsakis left a comment

Choose a reason for hiding this comment

Uh oh!

Mark-Simulacrum left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikomatsakis commented Jun 26, 2020

Uh oh!

ehuss commented Jun 27, 2020

Uh oh!

Mark-Simulacrum commented Jun 28, 2020

Uh oh!

Uh oh!