Description
When passing a &mut &[u8]
to a function that trims elements from the start/end of the slice, the code contains pointless length checks and stack traffic. Writing the function slightly differently, or inlining it, does produce the code I'd expect.
This is a minimal example (playground of all following snippets):
pub fn trim_in_place(a: &mut &[u8]) {
while a.first() == Some(&42) {
*a = &a[1..];
}
}
I expect this to compile to a tight loop that simply decrements the slice length and increments the data pointer in registers, until the length reaches 0 or a byte value of 42 is encountered.
Instead, the loop writes to the stack on every iteration and checks twice whether the length is zero, trying to panic for the second check (presumably comes from the a[1..]
slicing).
.LBB0_5:
movzx edx, byte ptr [rax - 1]
cmp edx, 42
jne .LBB0_8 # exit function
cmp rcx, -1
je .LBB0_9 # panic, never taken
mov qword ptr [rdi], rax
mov qword ptr [rdi + 8], rcx
dec rcx
inc rax
cmp rcx, -1
jne .LBB0_5
The .LBB0_9
branch cannot ever be taken, and is indeed optimized out when writing the function in differently. The stack traffic also disappears, though that probably has an entirely different cause. This variant compiles to better code:
pub fn trim_functionally(mut a: &[u8]) -> &[u8] {
while a.first() == Some(&42) {
a = &a[1..];
}
a
}
It does all the loop calculations in registers. Also, marking the first function as inline(always)
and defining the second function in terms of it ({ trim_in_place(&mut a); a }
) optimizes to the same code.
.LBB1_2:
movzx eax, byte ptr [rdi]
cmp eax, 42
jne .LBB1_3 # exit function
inc rdi
dec rsi
jne .LBB1_2
Both variants check if the slice is empty before entering the loop and return immediately in that case.
Background
I encountered this problem in the context of my dec2flt work, in a larger and more complicated function that trims two slices on both ends, and the slices are stored in a struct. Applying inline(always)
to that function gives a small but measurable performance improvement across the board, but increases code size by 10 KiB (Edit: I just didn't measure properly. Size is about the same.). I haven't checked the assembly code (far too much code) but I suspect this means this problem, and solution, carries over to real code.
Meta
Reproduces on nightly playground. Stable is worse for all variants, presumably because it doesn't have #26411 (yay, progress!). My local rustc
has the same problem. The exact commit is meaningless because it only exists in my rust.git
fork. It's based f9f580936d81ed527115ae86375f69bb77723b1c
.