Description
I read in multiple places that rustc generating worse code than a C++ compiler would do for an equivalent C++ program is a bug. So here we go:
Summary
Rustc fails to inline trivial functions that compile down to just a few instructions, to the point where calling convention overhead is much worse than the actual function itself.
Steps to reproduce
I tried to write a minimal example at play.rust-lang.org, but everything short that I can come up with does not suffer from this issue. So instead I am going to link the project that caused me to discover this issue:
git clone https://github.com/ruuda/claxon
git checkout 2b18a49
cargo build --release --example decode
objdump -Cd target/release/examples/decode | less
# Now search for rice_to_signed or shift_left.
Actual and expected behavior
I’ll outline some of the disassembly below:
# Code for `if shift >= 8 { 0 } else { x << shift }`.
000000000000d3a0 <claxon::input::shift_left::h07bb472717a335da>:
d3a0: 89 f1 mov %esi,%ecx
d3a2: 80 e1 07 and $0x7,%cl
d3a5: 40 d2 e7 shl %cl,%dil
d3a8: 48 83 fe 07 cmp $0x7,%rsi
d3ac: 76 02 jbe d3b0 <claxon::input::shift_left::h07bb472717a335da+0x10>
d3ae: 31 ff xor %edi,%edi
d3b0: 89 f8 mov %edi,%eax
d3b2: c3 retq
d3b3: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
d3ba: 00 00 00
d3bd: 0f 1f 00 nopl (%rax)
# Code for `if x & 1 == 1 { - 1 - x / 2 } else { x / 2 }`.
# It did optimize the branch into two shifts and an xor though!
000000000000d7b0 <claxon::subframe::rice_to_signed::haed2067302c41014>:
d7b0: 48 89 f8 mov %rdi,%rax
d7b3: 48 c1 e8 3f shr $0x3f,%rax
d7b7: 48 01 f8 add %rdi,%rax
d7ba: 48 d1 f8 sar %rax
d7bd: 48 c1 e7 3f shl $0x3f,%rdi
d7c1: 48 c1 ff 3f sar $0x3f,%rdi
d7c5: 48 31 c7 xor %rax,%rdi
d7c8: 48 89 f8 mov %rdi,%rax
d7cb: c3 retq
d7cc: 0f 1f 40 00 nopl 0x0(%rax)
Note that this is not dead code, there are calls to these functions in very hot loops:
79e8: e8 c3 5d 00 00 callq d7b0 <claxon::subframe::rice_to_signed::haed2067302c41014>
66a8: e8 f3 6c 00 00 callq d3a0 <claxon::input::shift_left::h07bb472717a335da>
I would expect that functions like these would be inlined automatically, but they were not. Note that all of this code is in the same crate.
I encountered about a dozen of these during profiling, where very small functions like the ones above were showing up as hotspots. I’ve been able to speed up my program by as much as 30% just by placing a few #[inline(always)]
attributes.
There are also simple getters like Block::len
which are not inlined, but these are called from the example program which is a different crate, so that is working as intended I think.
Meta
rustc 1.14.0-nightly (3210fd5c2 2016-10-05)
binary: rustc
commit-hash: 3210fd5c20ffc6da420eb00e60bdc8704577fd3b
commit-date: 2016-10-05
host: x86_64-unknown-linux-gnu
release: 1.14.0-nightly