Skip to content

Consider compiling SIMD-from-SIMD initialisation directly to a shuffle #18148

Closed
@huonw

Description

@huonw
#![crate_type = "lib"]
#![feature(tuple_indexing)]

use std::simd::f32x4;

pub fn foo(x: f32x4) -> f32x4 {
    f32x4(x.0, x.2, x.3, x.1)
}

becomes, with no optimisations:

define <4 x float> @_ZN3foo20h2254f602671f886ceaaE(<4 x float>) unnamed_addr #0 {
entry-block:
  %sret_slot = alloca <4 x float>
  %x = alloca <4 x float>
  store <4 x float> %0, <4 x float>* %x
  %1 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 0
  %2 = getelementptr inbounds <4 x float>* %x, i32 0, i32 0
  %3 = load float* %2
  store float %3, float* %1
  %4 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 1
  %5 = getelementptr inbounds <4 x float>* %x, i32 0, i32 2
  %6 = load float* %5
  store float %6, float* %4
  %7 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 2
  %8 = getelementptr inbounds <4 x float>* %x, i32 0, i32 3
  %9 = load float* %8
  store float %9, float* %7
  %10 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 3
  %11 = getelementptr inbounds <4 x float>* %x, i32 0, i32 1
  %12 = load float* %11
  store float %12, float* %10
  %13 = load <4 x float>* %sret_slot
  ret <4 x float> %13
}

with optimisations it becomes

define <4 x float> @_ZN3foo20h2254f602671f886ceaaE(<4 x float>) unnamed_addr #0 {
entry-block:
  %sret_slot.12.vec.insert = shufflevector <4 x float> %0, <4 x float> undef, <4 x i32> <i32 0, i32 2, i32 3, i32 1>
  ret <4 x float> %sret_slot.12.vec.insert
}

We could detect when a SIMD vector is being created directly from elements of another (pair of*) SIMD vector(s) and convert it directly into the appropriate shuffle instruction. This will save allocas and LLVM doing work, and probably guarantees it more than LLVM currently does. This should even work for vectors of different lengths, as long as the elements are the same.

(This is just a bug since it's an implementation detail.)

*shufflevector actually takes two operands, so f32x2(x.0, y.0, x.1, y.1) can also directly become a shuffle.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-codegenArea: Code generationT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions