Closed
Description
#![crate_type = "lib"]
#![feature(tuple_indexing)]
use std::simd::f32x4;
pub fn foo(x: f32x4) -> f32x4 {
f32x4(x.0, x.2, x.3, x.1)
}
becomes, with no optimisations:
define <4 x float> @_ZN3foo20h2254f602671f886ceaaE(<4 x float>) unnamed_addr #0 {
entry-block:
%sret_slot = alloca <4 x float>
%x = alloca <4 x float>
store <4 x float> %0, <4 x float>* %x
%1 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 0
%2 = getelementptr inbounds <4 x float>* %x, i32 0, i32 0
%3 = load float* %2
store float %3, float* %1
%4 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 1
%5 = getelementptr inbounds <4 x float>* %x, i32 0, i32 2
%6 = load float* %5
store float %6, float* %4
%7 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 2
%8 = getelementptr inbounds <4 x float>* %x, i32 0, i32 3
%9 = load float* %8
store float %9, float* %7
%10 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 3
%11 = getelementptr inbounds <4 x float>* %x, i32 0, i32 1
%12 = load float* %11
store float %12, float* %10
%13 = load <4 x float>* %sret_slot
ret <4 x float> %13
}
with optimisations it becomes
define <4 x float> @_ZN3foo20h2254f602671f886ceaaE(<4 x float>) unnamed_addr #0 {
entry-block:
%sret_slot.12.vec.insert = shufflevector <4 x float> %0, <4 x float> undef, <4 x i32> <i32 0, i32 2, i32 3, i32 1>
ret <4 x float> %sret_slot.12.vec.insert
}
We could detect when a SIMD vector is being created directly from elements of another (pair of*) SIMD vector(s) and convert it directly into the appropriate shuffle instruction. This will save allocas and LLVM doing work, and probably guarantees it more than LLVM currently does. This should even work for vectors of different lengths, as long as the elements are the same.
(This is just a bug since it's an implementation detail.)
*shufflevector
actually takes two operands, so f32x2(x.0, y.0, x.1, y.1)
can also directly become a shuffle.