-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Special-case transmute for primitive, SIMD & pointer types. #19294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This has several advantages. - it is faster in the microbenchmark (after a patch to rustc), presumably due to the fact the compiler has full control over data flow and data types, and so can arrange the loads optimally and avoid unnecessarily placing things onto the stack. It also means the compiler can see the SIMD vectors are always in the integer domain and so use (on x86) MOVDQA rather than the floating-point MOVAPS, avoiding the cross-domain penalties on platforms which have them. - the library is now usable on x86 platforms without SSE2, and also on non-x86 platforms, possibly at a performance penalty (but maybe not, e.g. it may work with ARM's NEON), but this is better than not being usable at all. - the danger and problems of `asm!` is completely removed, replaced with a pair of easy to verify `transmute`s. Before: test bench_mt19937_1_000_000_rands ... bench: 1606892 ns/iter (+/- 57461) After, with rust-lang/rust#19294: test bench_mt19937_1_000_000_rands ... bench: 1539038 ns/iter (+/- 33623) Without that patch it takes `2787449 ns/iter`.
This has several advantages. - it is faster in the microbenchmark (after a patch to rustc), presumably due to the fact the compiler has full control over data flow and data types, and so can arrange the loads optimally and avoid unnecessarily placing things onto the stack. It also means the compiler can see the SIMD vectors are always in the integer domain and so use (on x86) MOVDQA rather than the floating-point MOVAPS, avoiding the cross-domain penalties on platforms which have them. - the library is now usable on x86 platforms without SSE2, and also on non-x86 platforms, possibly at a performance penalty (but maybe not, e.g. it may work with ARM's NEON), but this is better than not being usable at all. - the danger and problems of `asm!` is completely removed, replaced with a pair of easy to verify `transmute`s. Before: test bench_mt19937_1_000_000_rands ... bench: 1606892 ns/iter (+/- 57461) After, with rust-lang/rust#19294: test bench_mt19937_1_000_000_rands ... bench: 1539038 ns/iter (+/- 33623) Without that patch it takes `2787449 ns/iter`. The last of the three (essentially identical) inner loops of the benchmark are given below. They demonstrate the advantages listed above. Before: .LBB2_6: movaps 3056(%rax,%rdx), %xmm2 movaps 1872(%rax,%rdx), %xmm3 movaps %xmm2, (%rsp) #APP psllq $19, %xmm2 pxor %xmm3, %xmm2 pshufd $27, %xmm1, %xmm1 pxor %xmm2, %xmm1 movaps %xmm1, %xmm2 movaps %xmm1, %xmm3 pand %xmm0, %xmm2 psrlq $12, %xmm3 pxor (%rsp), %xmm3 pxor %xmm2, %xmm3 #NO_APP movaps %xmm3, 3056(%rax,%rdx) addq $16, %rdx jne .LBB2_6 After: .LBB2_4: movdqa 3056(%rax,%rdx), %xmm2 pshufd $27, %xmm1, %xmm3 movdqa %xmm2, %xmm1 psllq $19, %xmm1 pxor 1872(%rax,%rdx), %xmm1 pxor %xmm3, %xmm1 movdqa %xmm1, %xmm3 psrlq $12, %xmm3 movdqa %xmm1, %xmm4 pand %xmm0, %xmm4 pxor %xmm3, %xmm4 pxor %xmm2, %xmm4 movdqa %xmm4, 3056(%rax,%rdx) addq $16, %rdx jne .LBB2_4
// Doing this special case makes conversions like `u32x4` -> | ||
// `u64x2` much nicer for LLVM and so more efficient (these | ||
// are done efficiently implicitly in C with the `__m128i` | ||
// type and so this means Rust doesn't loose out there). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"lose", not "loose"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, thanks.
Grieverheart/dsfmt-rs#2 is a real-world example of something this helps. |
a518b54
to
59f7f38
Compare
// are done efficiently implicitly in C with the `__m128i` | ||
// type and so this means Rust doesn't lose out there). | ||
let DatumBlock { bcx: bcx2, datum } = expr::trans(bcx, &*arg_exprs[0]); | ||
bcx = bcx2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equivalent to let datum = unpack_datum!(bcx, expr::trans(bcx, &*arg_exprs[0]));
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
59f7f38
to
9fc66d9
Compare
What does this do for IMO the use of This way we could also (easily) handle pointer -> pointer and pointer -> integer / integer -> pointer casts, even when the Rust type doesn't readily provide this information ( |
I think in principle this makes sense. @eddyb is probably right that examining the LLVM types may be a good idea. |
9fc66d9
to
2c2041a
Compare
Updated to look at the LLVM type. This also now includes detecting the pointer -> pointer case for bitcast. |
I expect this to work fine for types that implement Drop if |
Making this change: --- a/src/librustc_trans/trans/intrinsic.rs
+++ b/src/librustc_trans/trans/intrinsic.rs
@@ -192,11 +192,8 @@ pub fn trans_intrinsic_call<'a, 'blk, 'tcx>(mut bcx: Block<'blk, 'tcx>,
(nonpointer_nonaggregate(in_kind) && nonpointer_nonaggregate(ret_kind)) || {
in_kind == TypeKind::Pointer && ret_kind == TypeKind::Pointer
};
- let primitive =
- !ty::type_needs_drop(ccx.tcx(), in_type) &&
- !ty::type_needs_drop(ccx.tcx(), ret_ty.unwrap());
- let dest = if bitcast_compatible && primitive {
+ let dest = if bitcast_compatible {
// if we're here, the type is scalar-like (a primitive or a
// SIMD type), and so can be handled as a by-value ValueRef
// and can also be directly bitcast to the target type.
@@ -205,7 +202,11 @@ pub fn trans_intrinsic_call<'a, 'blk, 'tcx>(mut bcx: Block<'blk, 'tcx>,
// are done efficiently implicitly in C with the `__m128i`
// type and so this means Rust doesn't lose out there).
let datum = unpack_datum!(bcx, expr::trans(bcx, &*arg_exprs[0]));
- let val = datum.to_llscalarish(bcx);
+ let val = if datum.kind.is_by_ref() {
+ load_ty(bcx, datum.val, datum.ty)
+ } else {
+ datum.val
+ };
let cast_val = BitCast(bcx, val, llret_ty);
match dest { gives me aborts compiling stage2 core (i.e. the stage1 compiler is codegened incorrectly):
I assume, if it exists, destructor of the input (i.e. |
2c2041a
to
aca6f14
Compare
This detects (a subset of) the cases when `transmute::<T, U>(x)` can be lowered to a direct `bitcast T x to U` in LLVM. This assists with efficiently handling a SIMD vector as multiple different types, e.g. swapping bytes/words/double words around inside some larger vector type. C compilers like GCC and Clang handle integer vector types as `__m128i` for all widths, and implicitly insert bitcasts as required. This patch allows Rust to express this, even if it takes a bit of `unsafe`, whereas previously it was impossible to do at all without inline assembly. Example: pub fn reverse_u32s(u: u64x2) -> u64x2 { unsafe { let tmp = mem::transmute::<_, u32x4>(u); let swapped = u32x4(tmp.3, tmp.2, tmp.1, tmp.0); mem::transmute::<_, u64x2>(swapped) } } Compiling with `--opt-level=3` gives: Before define <2 x i64> @_ZN12reverse_u32s20hbdb206aba18a03d8tbaE(<2 x i64>) unnamed_addr #0 { entry-block: %1 = bitcast <2 x i64> %0 to i128 %u.0.extract.trunc = trunc i128 %1 to i32 %u.4.extract.shift = lshr i128 %1, 32 %u.4.extract.trunc = trunc i128 %u.4.extract.shift to i32 %u.8.extract.shift = lshr i128 %1, 64 %u.8.extract.trunc = trunc i128 %u.8.extract.shift to i32 %u.12.extract.shift = lshr i128 %1, 96 %u.12.extract.trunc = trunc i128 %u.12.extract.shift to i32 %2 = insertelement <4 x i32> undef, i32 %u.12.extract.trunc, i64 0 %3 = insertelement <4 x i32> %2, i32 %u.8.extract.trunc, i64 1 %4 = insertelement <4 x i32> %3, i32 %u.4.extract.trunc, i64 2 %5 = insertelement <4 x i32> %4, i32 %u.0.extract.trunc, i64 3 %6 = bitcast <4 x i32> %5 to <2 x i64> ret <2 x i64> %6 } _ZN12reverse_u32s20hbdb206aba18a03d8tbaE: .cfi_startproc movd %xmm0, %rax punpckhqdq %xmm0, %xmm0 movd %xmm0, %rcx movq %rcx, %rdx shrq $32, %rdx movq %rax, %rsi shrq $32, %rsi movd %eax, %xmm0 movd %ecx, %xmm1 punpckldq %xmm0, %xmm1 movd %esi, %xmm2 movd %edx, %xmm0 punpckldq %xmm2, %xmm0 punpckldq %xmm1, %xmm0 retq After define <2 x i64> @_ZN12reverse_u32s20hbdb206aba18a03d8tbaE(<2 x i64>) unnamed_addr #0 { entry-block: %1 = bitcast <2 x i64> %0 to <4 x i32> %2 = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0> %3 = bitcast <4 x i32> %2 to <2 x i64> ret <2 x i64> %3 } _ZN12reverse_u32s20hbdb206aba18a03d8tbaE: .cfi_startproc pshufd $27, %xmm0, %xmm0 retq
aca6f14
to
1a62066
Compare
This detects (a subset of) the cases when `transmute::<T, U>(x)` can be lowered to a direct `bitcast T x to U` in LLVM. This assists with efficiently handling a SIMD vector as multiple different types, e.g. swapping bytes/words/double words around inside some larger vector type. C compilers like GCC and Clang handle integer vector types as `__m128i` for all widths, and implicitly insert bitcasts as required. This patch allows Rust to express this, even if it takes a bit of `unsafe`, whereas previously it was impossible to do at all without inline assembly. Example: pub fn reverse_u32s(u: u64x2) -> u64x2 { unsafe { let tmp = mem::transmute::<_, u32x4>(u); let swapped = u32x4(tmp.3, tmp.2, tmp.1, tmp.0); mem::transmute::<_, u64x2>(swapped) } } Compiling with `--opt-level=3` gives: Before define <2 x i64> @_ZN12reverse_u32s20hbdb206aba18a03d8tbaE(<2 x i64>) unnamed_addr #0 { entry-block: %1 = bitcast <2 x i64> %0 to i128 %u.0.extract.trunc = trunc i128 %1 to i32 %u.4.extract.shift = lshr i128 %1, 32 %u.4.extract.trunc = trunc i128 %u.4.extract.shift to i32 %u.8.extract.shift = lshr i128 %1, 64 %u.8.extract.trunc = trunc i128 %u.8.extract.shift to i32 %u.12.extract.shift = lshr i128 %1, 96 %u.12.extract.trunc = trunc i128 %u.12.extract.shift to i32 %2 = insertelement <4 x i32> undef, i32 %u.12.extract.trunc, i64 0 %3 = insertelement <4 x i32> %2, i32 %u.8.extract.trunc, i64 1 %4 = insertelement <4 x i32> %3, i32 %u.4.extract.trunc, i64 2 %5 = insertelement <4 x i32> %4, i32 %u.0.extract.trunc, i64 3 %6 = bitcast <4 x i32> %5 to <2 x i64> ret <2 x i64> %6 } _ZN12reverse_u32s20hbdb206aba18a03d8tbaE: .cfi_startproc movd %xmm0, %rax punpckhqdq %xmm0, %xmm0 movd %xmm0, %rcx movq %rcx, %rdx shrq $32, %rdx movq %rax, %rsi shrq $32, %rsi movd %eax, %xmm0 movd %ecx, %xmm1 punpckldq %xmm0, %xmm1 movd %esi, %xmm2 movd %edx, %xmm0 punpckldq %xmm2, %xmm0 punpckldq %xmm1, %xmm0 retq After define <2 x i64> @_ZN12reverse_u32s20hbdb206aba18a03d8tbaE(<2 x i64>) unnamed_addr #0 { entry-block: %1 = bitcast <2 x i64> %0 to <4 x i32> %2 = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0> %3 = bitcast <4 x i32> %2 to <2 x i64> ret <2 x i64> %3 } _ZN12reverse_u32s20hbdb206aba18a03d8tbaE: .cfi_startproc pshufd $27, %xmm0, %xmm0 retq
Looks like bors stalled out on this PR, but it passed all platforms but the one slave that was lost, so I have merged this manually. |
This has several advantages. - it is faster in the microbenchmark (after a patch to rustc), presumably due to the fact the compiler has full control over data flow and data types, and so can arrange the loads optimally and avoid unnecessarily placing things onto the stack. It also means the compiler can see the SIMD vectors are always in the integer domain and so use (on x86) MOVDQA rather than the floating-point MOVAPS, avoiding the cross-domain penalties on platforms which have them. - the library is now usable on x86 platforms without SSE2, and also on non-x86 platforms, possibly at a performance penalty (but maybe not, e.g. it may work with ARM's NEON), but this is better than not being usable at all. - the danger and problems of `asm!` is completely removed, replaced with a pair of easy to verify `transmute`s. Before: test bench_mt19937_1_000_000_rands ... bench: 1606892 ns/iter (+/- 57461) After, with rust-lang/rust#19294: test bench_mt19937_1_000_000_rands ... bench: 1539038 ns/iter (+/- 33623) Without that patch it takes `2787449 ns/iter`. The last of the three (essentially identical) inner loops of the benchmark are given below. They demonstrate the advantages listed above. Before: .LBB2_6: movaps 3056(%rax,%rdx), %xmm2 movaps 1872(%rax,%rdx), %xmm3 movaps %xmm2, (%rsp) #APP psllq $19, %xmm2 pxor %xmm3, %xmm2 pshufd $27, %xmm1, %xmm1 pxor %xmm2, %xmm1 movaps %xmm1, %xmm2 movaps %xmm1, %xmm3 pand %xmm0, %xmm2 psrlq $12, %xmm3 pxor (%rsp), %xmm3 pxor %xmm2, %xmm3 #NO_APP movaps %xmm3, 3056(%rax,%rdx) addq $16, %rdx jne .LBB2_6 After: .LBB2_4: movdqa 3056(%rax,%rdx), %xmm2 pshufd $27, %xmm1, %xmm3 movdqa %xmm2, %xmm1 psllq $19, %xmm1 pxor 1872(%rax,%rdx), %xmm1 pxor %xmm3, %xmm1 movdqa %xmm1, %xmm3 psrlq $12, %xmm3 movdqa %xmm1, %xmm4 pand %xmm0, %xmm4 pxor %xmm3, %xmm4 pxor %xmm2, %xmm4 movdqa %xmm4, 3056(%rax,%rdx) addq $16, %rdx jne .LBB2_4
…_err_msgs minor: Show build scripts errors in server status
This detects (a subset of) the cases when
transmute::<T, U>(x)
can belowered to a direct
bitcast T x to U
in LLVM. This assists withefficiently handling a SIMD vector as multiple different types,
e.g. swapping bytes/words/double words around inside some larger vector
type.
C compilers like GCC and Clang handle integer vector types as
__m128i
for all widths, and implicitly insert bitcasts as required. This patch
allows Rust to express this, even if it takes a bit of
unsafe
, whereaspreviously it was impossible to do at all without inline assembly.
Example:
Compiling with
--opt-level=3
gives:Before
After