AMDGPU should not scalarize v2f16 / v2bf16 copysign

Currently half element copysign is scalarized and produces this ugly expansion:

```
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s

; s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; s_movk_i32 s4, 0x7fff
; v_bfi_b32 v2, s4, v0, v1
; v_lshrrev_b32_e32 v1, 16, v1
; v_lshrrev_b32_e32 v0, 16, v0
; v_bfi_b32 v0, s4, v0, v1
; s_mov_b32 s4, 0x5040100
; v_perm_b32 v0, v0, v2, s4
; s_setpc_b64 s[30:31]
define <2 x half> @copysign_v2f16(<2 x half> %a, <2 x half> %b) {
  %result = call <2 x half> @llvm.copysign.v2f16(<2 x half> %a, <2 x half> %b)
  ret <2 x half> %result
}

```

If I hack up the vector legalizer's logic, the default expansion finds a vector BFI:

WIth gx803:

```
	s_mov_b32 s4, 0x7fff7fff
	v_bfi_b32 v0, s4, v0, v1
```

With gfx9+, it does worse:
```
	v_and_b32_e32 v1, 0x80008000, v1
	s_mov_b32 s4, 0x7fff7fff
	v_and_or_b32 v0, v0, s4, v1
```


We can trivially extend the existing legal f16 copysign pattern to handle the 2 element case like in the gfx8 output. It's a little more work than that to support the cases where the sign source is a different FP type 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMDGPU should not scalarize v2f16 / v2bf16 copysign #141931

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AMDGPU should not scalarize v2f16 / v2bf16 copysign #141931

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions