Consider compiling SIMD-from-SIMD initialisation directly to a shuffle

``` rust
#![crate_type = "lib"]
#![feature(tuple_indexing)]

use std::simd::f32x4;

pub fn foo(x: f32x4) -> f32x4 {
    f32x4(x.0, x.2, x.3, x.1)
}
```

becomes, with no optimisations:

``` llvm
define <4 x float> @_ZN3foo20h2254f602671f886ceaaE(<4 x float>) unnamed_addr #0 {
entry-block:
  %sret_slot = alloca <4 x float>
  %x = alloca <4 x float>
  store <4 x float> %0, <4 x float>* %x
  %1 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 0
  %2 = getelementptr inbounds <4 x float>* %x, i32 0, i32 0
  %3 = load float* %2
  store float %3, float* %1
  %4 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 1
  %5 = getelementptr inbounds <4 x float>* %x, i32 0, i32 2
  %6 = load float* %5
  store float %6, float* %4
  %7 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 2
  %8 = getelementptr inbounds <4 x float>* %x, i32 0, i32 3
  %9 = load float* %8
  store float %9, float* %7
  %10 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 3
  %11 = getelementptr inbounds <4 x float>* %x, i32 0, i32 1
  %12 = load float* %11
  store float %12, float* %10
  %13 = load <4 x float>* %sret_slot
  ret <4 x float> %13
}
```

with optimisations it becomes

``` llvm
define <4 x float> @_ZN3foo20h2254f602671f886ceaaE(<4 x float>) unnamed_addr #0 {
entry-block:
  %sret_slot.12.vec.insert = shufflevector <4 x float> %0, <4 x float> undef, <4 x i32> <i32 0, i32 2, i32 3, i32 1>
  ret <4 x float> %sret_slot.12.vec.insert
}
```

We could detect when a SIMD vector is being created directly from elements of another (pair of*) SIMD vector(s) and convert it directly into the appropriate shuffle instruction. This will save allocas and LLVM doing work, and probably guarantees it more than LLVM currently does. This should even work for vectors of different lengths, as long as the elements are the same.

(This is just a bug since it's an implementation detail.)

*[`shufflevector`](http://llvm.org/docs/LangRef.html#shufflevector-instruction) actually takes two operands, so `f32x2(x.0, y.0, x.1, y.1)` can also directly become a shuffle.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider compiling SIMD-from-SIMD initialisation directly to a shuffle #18148

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider compiling SIMD-from-SIMD initialisation directly to a shuffle #18148

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions