Skip to content

Inside LTO improvements. #627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 29, 2020
Merged

Inside LTO improvements. #627

merged 5 commits into from
Jun 29, 2020

Conversation

ehuss
Copy link
Contributor

@ehuss ehuss commented Jun 24, 2020

This is just a random post about some recent performance and disk-space improvements that have recently been introduced. I kinda felt someone might find it interesting.

@rust-highfive
Copy link

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

@ehuss
Copy link
Contributor Author

ehuss commented Jun 24, 2020

@alexcrichton or @nnethercote, if either of you could look this over for any major errors, I would appreciate it!

Copy link
Contributor

@nnethercote nnethercote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good, thanks for writing it. I have a couple of minor suggestions below.


## Background

When compiling a library, `rustc` saves the output in an `rlib` file which is an [archive file]. This has historically contained these two things (among others):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not two but three things: there is metadata as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the metadata is covered by the "(among others)", but given that it's the only thing in "(among others)", it's probably just worth mentioning outright.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it seemed awkward when I wrote it. I was attempting to focus on just the things that changed. I went ahead and just added the metadata as the 3rd major part.

* [`-C linker-plugin-lto`] causes `rustc` to only place bitcode in the `.o` files, and skips code generation. Cargo uses this when the rlib is only intended for use with LTO. This can also be used when doing cross-language LTO.
* [`-C embed-bitcode=no`] causes `rustc` to avoid placing bitcode in the rlib altogether. Cargo uses this when LTO is not being used, which reduces some disk space usage.

Additionally, the method in which bitcode is embedded in the rlib has changed. Previously, `rustc` would place compressed bitcode as a `.bc.z` file in the rlib archive. Now, the bitcode is placed as an uncompressed section within each `.o` [object file] in the rlib archive. This avoids a small performance hit for compressing the bitcode, and also matches the standard format used by clang.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance effect here is a mixed bag. Sometimes it's a speedup because we avoid the compression cost. Sometimes it's a slowdown because more data must be written. The combined effect is hard to predict and depends on the program. Avoiding the compression certainly makes the compiler simpler, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also probably worth pointing out that "standard format" is a bit of a stretch as we discovered because only a few linkers actually work with the default output, we had to frob with some global inline asm to get the sections to work out on all platforms.

In reality this "standard format" was only created for iOS where they ship bitcode by default but wanted a mode of shipping both for debugging and such. We're kinda just piggy-backing on it at this point. In any case the words written down here are correct, so I'll leave it up to you if you'd like to add more context!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info! Updated with a little extra information.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for taking the time to write this up!

LTO is an optimization technique that can perform whole-program analysis. It analyzes all of the bitcode from every library at once, and performs optimizations and code generation. `rustc` supports several forms of LTO:

* Fat LTO. This performs "full" LTO, which can take a long time to complete and may require a significant amount of memory.
* [Thin LTO]. This is a lightweight version of "fat" LTO that can achieve similar performance improvements while taking much less time to complete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"much less time" isn't necessarily right but rather "it does stuff in parallel while fat LTO is single threaded". I'd suspect that thin LTO does more work in terms of CPU hours but no one building Rust has only one core nowadays!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexcrichton Because of the phrasing of this sentence, is there ever a time where fat LTO is better than thin LTO?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh I wondered the same thing when ThinLTO came along! Fat LTO, however, produces smaller binaries because it can extremely aggressively internalize functions. It can also produce slightly faster binaries since total knowledge in a compiler is typically better than fragmented knowledge. ThinLTO's primary purpose is figure out what to inline across codegen units, while Fat LTO basically says "eh just do your normal thing but here's all the code in the world"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Is there a way we can rephrase this to make this point more clear? The way it stands now makes it seem like thin is always better. "Similar performance while taking much less time"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tweaked the wording a little to try to make it clearer.

I think as with any optimization settings, every project will need to experiment with different settings to see what works best for them.


Two `rustc` flags are now available to control how the rlib is constructed:

* [`-C linker-plugin-lto`] causes `rustc` to only place bitcode in the `.o` files, and skips code generation. Cargo uses this when the rlib is only intended for use with LTO. This can also be used when doing cross-language LTO.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth nothing that this flag is being repurposed for Cargo's use since it was first added for what the name is on the tin, it just so happens to work for Cargo's use case too!

* [`-C linker-plugin-lto`] causes `rustc` to only place bitcode in the `.o` files, and skips code generation. Cargo uses this when the rlib is only intended for use with LTO. This can also be used when doing cross-language LTO.
* [`-C embed-bitcode=no`] causes `rustc` to avoid placing bitcode in the rlib altogether. Cargo uses this when LTO is not being used, which reduces some disk space usage.

Additionally, the method in which bitcode is embedded in the rlib has changed. Previously, `rustc` would place compressed bitcode as a `.bc.z` file in the rlib archive. Now, the bitcode is placed as an uncompressed section within each `.o` [object file] in the rlib archive. This avoids a small performance hit for compressing the bitcode, and also matches the standard format used by clang.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also probably worth pointing out that "standard format" is a bit of a stretch as we discovered because only a few linkers actually work with the default output, we had to frob with some global inline asm to get the sections to work out on all platforms.

In reality this "standard format" was only created for iOS where they ship bitcode by default but wanted a mode of shipping both for debugging and such. We're kinda just piggy-backing on it at this point. In any case the words written down here are correct, so I'll leave it up to you if you'd like to add more context!


LTO builds were recorded anywhere from 4% to 20% faster. Thin LTO faired consistently better than fat LTO.

The number of parallel jobs also had a large impact on the amount of improvement. Lower parallel job counts saw substantially more benefit than higher ones. A project built with `-j2` can be 20% faster, whereas the same project at `-j32` would only be 1% faster. Presumably this is because the code-generation phase benefits from higher concurrency, so it was taking a relatively smaller total percentage of time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Lower parallel job counts saw substantially more benefit than higher ones" -- oh man that's pipelining in action I think!

Copy link
Contributor

@nikomatsakis nikomatsakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No detailed comments about the contents, but I applaud you writing the post! 👏 =)

Copy link
Member

@Mark-Simulacrum Mark-Simulacrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor nit, but I don't think it needs resolving either


Thanks to the work of [Nicholas Nethercote] and [Alex Crichton], there have been some recent improvements that reduce the size of compiled libraries, and improves the compile-time performance, particularly when using LTO. This post dives into some of the details of what changed, and an estimation of the benefits.

These changes have been added incrementally over the past three months, with the latest changes landing just a few days ago on the nightly channel. The bulk of the improvements will be found in the 1.46 stable release. It would be great for any projects that use LTO to test it out on the nightly channel and report any issues that arise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to include a concrete date (or even commit) from which support is fully available on nightly, just so that people testing can be certain they're in good shape. (In theory they can just rustup update, I guess, but if there's regressions or whatever would be useful).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, added the date for the most recent fixes.

@nikomatsakis
Copy link
Contributor

@ehuss let me know when you want to post!

@ehuss
Copy link
Contributor Author

ehuss commented Jun 27, 2020

This should be ready to go!

@Mark-Simulacrum
Copy link
Member

Want to bump the date in the file to Monday and we can merge tomorrow?

@nikomatsakis nikomatsakis merged commit c65a377 into rust-lang:master Jun 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants