Skip to content

Manipulating slice through &mut parameter not optimized well #27130

Closed
@hanna-kruppe

Description

@hanna-kruppe

When passing a &mut &[u8] to a function that trims elements from the start/end of the slice, the code contains pointless length checks and stack traffic. Writing the function slightly differently, or inlining it, does produce the code I'd expect.

This is a minimal example (playground of all following snippets):

pub fn trim_in_place(a: &mut &[u8]) {
    while a.first() == Some(&42) {
        *a = &a[1..];
    }
}

I expect this to compile to a tight loop that simply decrements the slice length and increments the data pointer in registers, until the length reaches 0 or a byte value of 42 is encountered.

Instead, the loop writes to the stack on every iteration and checks twice whether the length is zero, trying to panic for the second check (presumably comes from the a[1..] slicing).

.LBB0_5:
    movzx   edx, byte ptr [rax - 1]
    cmp edx, 42
    jne .LBB0_8 # exit function
    cmp rcx, -1
    je  .LBB0_9 # panic, never taken
    mov qword ptr [rdi], rax
    mov qword ptr [rdi + 8], rcx
    dec rcx
    inc rax
    cmp rcx, -1
    jne .LBB0_5

The .LBB0_9 branch cannot ever be taken, and is indeed optimized out when writing the function in differently. The stack traffic also disappears, though that probably has an entirely different cause. This variant compiles to better code:

pub fn trim_functionally(mut a: &[u8]) -> &[u8] {
    while a.first() == Some(&42) {
        a = &a[1..];
    }
    a
}

It does all the loop calculations in registers. Also, marking the first function as inline(always) and defining the second function in terms of it ({ trim_in_place(&mut a); a }) optimizes to the same code.

.LBB1_2:
    movzx   eax, byte ptr [rdi]
    cmp eax, 42
    jne .LBB1_3 # exit function
    inc rdi
    dec rsi
    jne .LBB1_2

Both variants check if the slice is empty before entering the loop and return immediately in that case.

Background

I encountered this problem in the context of my dec2flt work, in a larger and more complicated function that trims two slices on both ends, and the slices are stored in a struct. Applying inline(always) to that function gives a small but measurable performance improvement across the board, but increases code size by 10 KiB (Edit: I just didn't measure properly. Size is about the same.). I haven't checked the assembly code (far too much code) but I suspect this means this problem, and solution, carries over to real code.

Meta

Reproduces on nightly playground. Stable is worse for all variants, presumably because it doesn't have #26411 (yay, progress!). My local rustc has the same problem. The exact commit is meaningless because it only exists in my rust.git fork. It's based f9f580936d81ed527115ae86375f69bb77723b1c.

Metadata

Metadata

Assignees

Labels

A-codegenArea: Code generationC-enhancementCategory: An issue proposing an enhancement or a PR with one.E-needs-testCall for participation: An issue has been fixed and does not reproduce, but no test has been added.I-slowIssue: Problems and improvements with respect to performance of generated code.P-lowLow priorityT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions