-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[AMDGPU][SDAG] Initial support for ISD::PTRADD #141725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[AMDGPU][SDAG] Initial support for ISD::PTRADD #141725
Conversation
Enable generation of PTRADD SelectionDAG nodes for pointer arithmetic for SI, for now behind an internal CLI option. Also add basic patterns to match these nodes. Optimizations will come in follow-up PRs. Basic tests for SDAG codegen with PTRADD are in test/CodeGen/AMDGPU/ptradd-sdag.ll Since GlobalISel also uses the PTRADD SDAG patterns via SelectionDAGCompat, this change affects GlobalISel tests: - Uniform 32-bit address arithmetic is now lowered to s_add_i32 instead of s_add_u32, which is consistent to what SDAG does (and gives SIShrinkInstructions the chance to generate s_addk_i32). - 64-bit address arithmetic uses the [sv]_add_u64 pseudos, which is consistent with SDAG and means that GISel now generates 64-bit adds for gfx12. The only drawback with that is that we could save 1-2 instructions if we didn't use 64-bit adds with >32-bit immediate (two movs with 32-bit immediates, s_delay_alu, and a 64-bit add vs two 32-bit adds with immediate), but that's a separate problem. - The register class for the dead carry-out/sign-bit operand of V_ADD_CO_U32_e64 on architectures without carry-less additions now is sreg_64 instead of sreg_64_xexec. I'm not sure if that loses us something worth preserving; I haven't found an obvious way to avoid this. Overall, the changes in the GlobalISel tests seem to be improvements.
@llvm/pr-subscribers-llvm-selectiondag @llvm/pr-subscribers-backend-amdgpu Author: Fabian Ritter (ritter-x2a) ChangesEnable generation of PTRADD SelectionDAG nodes for pointer arithmetic for SI, Since GlobalISel also uses the PTRADD SDAG patterns via SelectionDAGCompat,
Overall, the changes in the GlobalISel tests seem to be improvements. Patch is 1.01 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/141725.diff 50 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index ade88a16193b8..15e060a6e8e59 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -61,6 +61,13 @@ static cl::opt<bool> UseDivergentRegisterIndexing(
cl::desc("Use indirect register addressing for divergent indexes"),
cl::init(false));
+// TODO This option should be removed once we switch to always using PTRADD in
+// the SelectionDAG.
+static cl::opt<bool> UseSelectionDAGPTRADD(
+ "amdgpu-use-sdag-ptradd", cl::Hidden,
+ cl::desc("Generate ISD::PTRADD nodes in the SelectionDAG ISel"),
+ cl::init(false));
+
static bool denormalModeIsFlushAllF32(const MachineFunction &MF) {
const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();
return Info->getMode().FP32Denormals == DenormalMode::getPreserveSign();
@@ -10419,6 +10426,11 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
}
}
+bool SITargetLowering::shouldPreservePtrArith(const Function &F,
+ EVT PtrVT) const {
+ return UseSelectionDAGPTRADD;
+}
+
// The raw.(t)buffer and struct.(t)buffer intrinsics have two offset args:
// offset (the offset that is included in bounds checking and swizzling, to be
// split between the instruction's voffset and immoffset fields) and soffset
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index c42366a1c04c8..bd9ec7cb8ec48 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -258,6 +258,8 @@ class SITargetLowering final : public AMDGPUTargetLowering {
bool shouldExpandVectorDynExt(SDNode *N) const;
+ bool shouldPreservePtrArith(const Function &F, EVT PtrVT) const override;
+
private:
// Analyze a combined offset from an amdgcn_s_buffer_load intrinsic and store
// the three offsets (voffset, soffset and instoffset) into the SDValue[3]
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 2e2913d88cc54..3ded1393e2ce3 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -1376,6 +1376,37 @@ def : GCNPat <
(i32 (V_MOV_B32_e32 (i32 0))), sub1)
>;
+//===----------------------------------------------------------------------===//
+// PTRADD Patterns
+//===----------------------------------------------------------------------===//
+
+def : GCNPat<
+ (DivergentBinFrag<ptradd> i64:$src0, i64:$src1),
+ (V_ADD_U64_PSEUDO $src0, $src1)>;
+
+def : GCNPat<
+ (DivergentBinFrag<ptradd> i32:$src0, i32:$src1),
+ (V_ADD_U32_e64 $src0, $src1, 0)> {
+ let SubtargetPredicate = HasAddNoCarryInsts;
+}
+
+def : GCNPat<
+ (DivergentBinFrag<ptradd> i32:$src0, i32:$src1),
+ (V_ADD_CO_U32_e64 $src0, $src1)> {
+ let SubtargetPredicate = NotHasAddNoCarryInsts;
+}
+
+def : GCNPat<
+ (UniformBinFrag<ptradd> i64:$src0, i64:$src1),
+ (S_ADD_U64_PSEUDO $src0, $src1)>;
+
+// Whether we select S_ADD_I32 or S_ADD_U32 does not make much of a
+// difference. Most notably, S_ADD_I32 instructions can be transformed
+// to S_ADDK_I32, so we select that.
+def : GCNPat<
+ (UniformBinFrag<ptradd> i32:$src0, i32:$src1),
+ (S_ADD_I32 $src0, $src1)>;
+
/********** ============================================ **********/
/********** Extraction, Insertion, Building and Casting **********/
/********** ============================================ **********/
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll
index 7adaddf2fc8ba..5a3b36fc1ada2 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll
@@ -36,15 +36,15 @@ define amdgpu_kernel void @kernel_caller_stack() {
; FLATSCR-NEXT: s_mov_b32 s32, 0
; FLATSCR-NEXT: s_add_u32 flat_scratch_lo, s8, s13
; FLATSCR-NEXT: s_addc_u32 flat_scratch_hi, s9, 0
-; FLATSCR-NEXT: s_add_u32 s0, s32, 4
+; FLATSCR-NEXT: s_add_i32 s0, s32, 4
; FLATSCR-NEXT: v_mov_b32_e32 v0, 9
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
-; FLATSCR-NEXT: s_add_u32 s0, s32, 8
+; FLATSCR-NEXT: s_add_i32 s0, s32, 8
; FLATSCR-NEXT: v_mov_b32_e32 v0, 10
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
-; FLATSCR-NEXT: s_add_u32 s0, s32, 12
+; FLATSCR-NEXT: s_add_i32 s0, s32, 12
; FLATSCR-NEXT: v_mov_b32_e32 v0, 11
-; FLATSCR-NEXT: s_add_u32 s2, s32, 16
+; FLATSCR-NEXT: s_add_i32 s2, s32, 16
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
; FLATSCR-NEXT: v_mov_b32_e32 v0, 12
; FLATSCR-NEXT: s_getpc_b64 s[0:1]
@@ -189,13 +189,13 @@ define amdgpu_kernel void @kernel_caller_byval() {
; FLATSCR-NEXT: s_getpc_b64 s[0:1]
; FLATSCR-NEXT: s_add_u32 s0, s0, external_void_func_byval@rel32@lo+4
; FLATSCR-NEXT: s_addc_u32 s1, s1, external_void_func_byval@rel32@hi+12
-; FLATSCR-NEXT: s_add_u32 s2, s32, 8
-; FLATSCR-NEXT: s_add_u32 s3, s32, 16
-; FLATSCR-NEXT: s_add_u32 s4, s32, 24
-; FLATSCR-NEXT: s_add_u32 s5, s32, 32
-; FLATSCR-NEXT: s_add_u32 s6, s32, 40
-; FLATSCR-NEXT: s_add_u32 s7, s32, 48
-; FLATSCR-NEXT: s_add_u32 s8, s32, 56
+; FLATSCR-NEXT: s_add_i32 s2, s32, 8
+; FLATSCR-NEXT: s_add_i32 s3, s32, 16
+; FLATSCR-NEXT: s_add_i32 s4, s32, 24
+; FLATSCR-NEXT: s_add_i32 s5, s32, 32
+; FLATSCR-NEXT: s_add_i32 s6, s32, 40
+; FLATSCR-NEXT: s_add_i32 s7, s32, 48
+; FLATSCR-NEXT: s_add_i32 s8, s32, 56
; FLATSCR-NEXT: s_waitcnt vmcnt(7)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[0:1], s32
; FLATSCR-NEXT: s_waitcnt vmcnt(7)
@@ -266,16 +266,16 @@ define void @func_caller_stack() {
; FLATSCR-NEXT: s_mov_b64 exec, s[2:3]
; FLATSCR-NEXT: s_add_i32 s32, s32, 16
; FLATSCR-NEXT: v_writelane_b32 v40, s0, 2
-; FLATSCR-NEXT: s_add_u32 s0, s32, 4
+; FLATSCR-NEXT: s_add_i32 s0, s32, 4
; FLATSCR-NEXT: v_mov_b32_e32 v0, 9
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
-; FLATSCR-NEXT: s_add_u32 s0, s32, 8
+; FLATSCR-NEXT: s_add_i32 s0, s32, 8
; FLATSCR-NEXT: v_mov_b32_e32 v0, 10
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
-; FLATSCR-NEXT: s_add_u32 s0, s32, 12
+; FLATSCR-NEXT: s_add_i32 s0, s32, 12
; FLATSCR-NEXT: v_mov_b32_e32 v0, 11
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
-; FLATSCR-NEXT: s_add_u32 s0, s32, 16
+; FLATSCR-NEXT: s_add_i32 s0, s32, 16
; FLATSCR-NEXT: v_mov_b32_e32 v0, 12
; FLATSCR-NEXT: v_writelane_b32 v40, s30, 0
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
@@ -393,8 +393,8 @@ define void @func_caller_byval(ptr addrspace(5) %argptr) {
; FLATSCR-NEXT: s_add_i32 s32, s32, 16
; FLATSCR-NEXT: v_add_u32_e32 v3, 8, v0
; FLATSCR-NEXT: v_writelane_b32 v40, s0, 2
-; FLATSCR-NEXT: s_add_u32 s0, s32, 8
-; FLATSCR-NEXT: s_add_u32 s2, s32, 56
+; FLATSCR-NEXT: s_add_i32 s0, s32, 8
+; FLATSCR-NEXT: s_add_i32 s2, s32, 56
; FLATSCR-NEXT: v_writelane_b32 v40, s30, 0
; FLATSCR-NEXT: v_writelane_b32 v40, s31, 1
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
@@ -404,28 +404,28 @@ define void @func_caller_byval(ptr addrspace(5) %argptr) {
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[1:2], s0
; FLATSCR-NEXT: scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT: s_add_u32 s0, s32, 16
+; FLATSCR-NEXT: s_add_i32 s0, s32, 16
; FLATSCR-NEXT: v_add_u32_e32 v3, 24, v0
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[1:2], s0
; FLATSCR-NEXT: scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT: s_add_u32 s0, s32, 24
+; FLATSCR-NEXT: s_add_i32 s0, s32, 24
; FLATSCR-NEXT: v_add_u32_e32 v3, 32, v0
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[1:2], s0
; FLATSCR-NEXT: scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT: s_add_u32 s0, s32, 32
+; FLATSCR-NEXT: s_add_i32 s0, s32, 32
; FLATSCR-NEXT: v_add_u32_e32 v3, 40, v0
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[1:2], s0
; FLATSCR-NEXT: scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT: s_add_u32 s0, s32, 40
+; FLATSCR-NEXT: s_add_i32 s0, s32, 40
; FLATSCR-NEXT: v_add_u32_e32 v3, 48, v0
; FLATSCR-NEXT: v_add_u32_e32 v0, 56, v0
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[1:2], s0
; FLATSCR-NEXT: scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT: s_add_u32 s0, s32, 48
+; FLATSCR-NEXT: s_add_i32 s0, s32, 48
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[1:2], s0
; FLATSCR-NEXT: scratch_load_dwordx2 v[0:1], v0, off
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
index 6b767d9e754be..a1bb8b390847f 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
@@ -20,7 +20,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align4(i32 %n) {
; GFX9-NEXT: v_mov_b32_e32 v1, s4
; GFX9-NEXT: s_lshl_b32 s5, s5, 6
; GFX9-NEXT: s_mov_b32 s33, 0
-; GFX9-NEXT: s_add_u32 s32, s4, s5
+; GFX9-NEXT: s_add_i32 s32, s4, s5
; GFX9-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
; GFX9-NEXT: s_endpgm
;
@@ -39,7 +39,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align4(i32 %n) {
; GFX10-NEXT: s_lshl2_add_u32 s5, s5, 15
; GFX10-NEXT: s_and_b32 s5, s5, -16
; GFX10-NEXT: s_lshl_b32 s5, s5, 5
-; GFX10-NEXT: s_add_u32 s32, s4, s5
+; GFX10-NEXT: s_add_i32 s32, s4, s5
; GFX10-NEXT: s_endpgm
;
; GFX11-LABEL: kernel_dynamic_stackalloc_sgpr_align4:
@@ -56,7 +56,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align4(i32 %n) {
; GFX11-NEXT: s_and_b32 s1, s1, -16
; GFX11-NEXT: s_lshl_b32 s1, s1, 5
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT: s_add_u32 s32, s0, s1
+; GFX11-NEXT: s_add_i32 s32, s0, s1
; GFX11-NEXT: s_endpgm
%alloca = alloca i32, i32 %n, align 4, addrspace(5)
store i32 0, ptr addrspace(5) %alloca
@@ -84,7 +84,7 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX9-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX9-NEXT: s_and_b32 s4, s4, -16
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
-; GFX9-NEXT: s_add_u32 s32, s6, s4
+; GFX9-NEXT: s_add_i32 s32, s6, s4
; GFX9-NEXT: s_mov_b32 s32, s33
; GFX9-NEXT: s_mov_b32 s33, s7
; GFX9-NEXT: s_waitcnt vmcnt(0)
@@ -110,7 +110,7 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX10-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
-; GFX10-NEXT: s_add_u32 s32, s6, s4
+; GFX10-NEXT: s_add_i32 s32, s6, s4
; GFX10-NEXT: s_mov_b32 s32, s33
; GFX10-NEXT: s_mov_b32 s33, s7
; GFX10-NEXT: s_setpc_b64 s[30:31]
@@ -136,7 +136,7 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX11-NEXT: s_and_b32 s0, s0, -16
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT: s_add_u32 s32, s2, s0
+; GFX11-NEXT: s_add_i32 s32, s2, s0
; GFX11-NEXT: s_mov_b32 s32, s33
; GFX11-NEXT: s_mov_b32 s33, s3
; GFX11-NEXT: s_setpc_b64 s[30:31]
@@ -161,7 +161,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align16(i32 %n) {
; GFX9-NEXT: v_mov_b32_e32 v1, s4
; GFX9-NEXT: s_lshl_b32 s5, s5, 6
; GFX9-NEXT: s_mov_b32 s33, 0
-; GFX9-NEXT: s_add_u32 s32, s4, s5
+; GFX9-NEXT: s_add_i32 s32, s4, s5
; GFX9-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
; GFX9-NEXT: s_endpgm
;
@@ -180,7 +180,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align16(i32 %n) {
; GFX10-NEXT: s_lshl2_add_u32 s5, s5, 15
; GFX10-NEXT: s_and_b32 s5, s5, -16
; GFX10-NEXT: s_lshl_b32 s5, s5, 5
-; GFX10-NEXT: s_add_u32 s32, s4, s5
+; GFX10-NEXT: s_add_i32 s32, s4, s5
; GFX10-NEXT: s_endpgm
;
; GFX11-LABEL: kernel_dynamic_stackalloc_sgpr_align16:
@@ -197,7 +197,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align16(i32 %n) {
; GFX11-NEXT: s_and_b32 s1, s1, -16
; GFX11-NEXT: s_lshl_b32 s1, s1, 5
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT: s_add_u32 s32, s0, s1
+; GFX11-NEXT: s_add_i32 s32, s0, s1
; GFX11-NEXT: s_endpgm
%alloca = alloca i32, i32 %n, align 16, addrspace(5)
store i32 0, ptr addrspace(5) %alloca
@@ -225,7 +225,7 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX9-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX9-NEXT: s_and_b32 s4, s4, -16
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
-; GFX9-NEXT: s_add_u32 s32, s6, s4
+; GFX9-NEXT: s_add_i32 s32, s6, s4
; GFX9-NEXT: s_mov_b32 s32, s33
; GFX9-NEXT: s_mov_b32 s33, s7
; GFX9-NEXT: s_waitcnt vmcnt(0)
@@ -251,7 +251,7 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX10-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
-; GFX10-NEXT: s_add_u32 s32, s6, s4
+; GFX10-NEXT: s_add_i32 s32, s6, s4
; GFX10-NEXT: s_mov_b32 s32, s33
; GFX10-NEXT: s_mov_b32 s33, s7
; GFX10-NEXT: s_setpc_b64 s[30:31]
@@ -277,7 +277,7 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX11-NEXT: s_and_b32 s0, s0, -16
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT: s_add_u32 s32, s2, s0
+; GFX11-NEXT: s_add_i32 s32, s2, s0
; GFX11-NEXT: s_mov_b32 s32, s33
; GFX11-NEXT: s_mov_b32 s33, s3
; GFX11-NEXT: s_setpc_b64 s[30:31]
@@ -294,7 +294,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
; GFX9-NEXT: s_movk_i32 s32, 0x800
; GFX9-NEXT: s_add_u32 s0, s0, s17
; GFX9-NEXT: s_addc_u32 s1, s1, 0
-; GFX9-NEXT: s_add_u32 s5, s32, 0x7ff
+; GFX9-NEXT: s_add_i32 s5, s32, 0x7ff
; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX9-NEXT: s_and_b32 s5, s5, 0xfffff800
@@ -303,7 +303,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
; GFX9-NEXT: v_mov_b32_e32 v1, s5
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
; GFX9-NEXT: s_mov_b32 s33, 0
-; GFX9-NEXT: s_add_u32 s32, s5, s4
+; GFX9-NEXT: s_add_i32 s32, s5, s4
; GFX9-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
; GFX9-NEXT: s_endpgm
;
@@ -313,7 +313,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
; GFX10-NEXT: s_movk_i32 s32, 0x400
; GFX10-NEXT: s_add_u32 s0, s0, s17
; GFX10-NEXT: s_addc_u32 s1, s1, 0
-; GFX10-NEXT: s_add_u32 s5, s32, 0x3ff
+; GFX10-NEXT: s_add_i32 s5, s32, 0x3ff
; GFX10-NEXT: v_mov_b32_e32 v0, 0
; GFX10-NEXT: s_and_b32 s5, s5, 0xfffffc00
; GFX10-NEXT: s_mov_b32 s33, 0
@@ -323,7 +323,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
; GFX10-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
-; GFX10-NEXT: s_add_u32 s32, s5, s4
+; GFX10-NEXT: s_add_i32 s32, s5, s4
; GFX10-NEXT: s_endpgm
;
; GFX11-LABEL: kernel_dynamic_stackalloc_sgpr_align32:
@@ -331,7 +331,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
; GFX11-NEXT: s_load_b32 s0, s[4:5], 0x0
; GFX11-NEXT: s_mov_b32 s32, 32
; GFX11-NEXT: v_mov_b32_e32 v0, 0
-; GFX11-NEXT: s_add_u32 s1, s32, 0x3ff
+; GFX11-NEXT: s_add_i32 s1, s32, 0x3ff
; GFX11-NEXT: s_mov_b32 s33, 0
; GFX11-NEXT: s_and_b32 s1, s1, 0xfffffc00
; GFX11-NEXT: scratch_store_b32 off, v0, s1
@@ -341,7 +341,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
; GFX11-NEXT: s_and_b32 s0, s0, -16
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT: s_add_u32 s32, s1, s0
+; GFX11-NEXT: s_add_i32 s32, s1, s0
; GFX11-NEXT: s_endpgm
%alloca = alloca i32, i32 %n, align 32, addrspace(5)
store i32 0, ptr addrspace(5) %alloca
@@ -366,7 +366,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX9-NEXT: s_mov_b32 s33, s6
; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: s_load_dword s4, s[4:5], 0x0
-; GFX9-NEXT: s_add_u32 s5, s32, 0x7ff
+; GFX9-NEXT: s_add_i32 s5, s32, 0x7ff
; GFX9-NEXT: s_and_b32 s5, s5, 0xfffff800
; GFX9-NEXT: v_mov_b32_e32 v1, s5
; GFX9-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
@@ -374,7 +374,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX9-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX9-NEXT: s_and_b32 s4, s4, -16
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
-; GFX9-NEXT: s_add_u32 s32, s5, s4
+; GFX9-NEXT: s_add_i32 s32, s5, s4
; GFX9-NEXT: s_mov_b32 s32, s34
; GFX9-NEXT: s_mov_b32 s34, s7
; GFX9-NEXT: s_waitcnt vmcnt(0)
@@ -397,7 +397,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX10-NEXT: s_mov_b32 s33, s6
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: s_load_dword s4, s[4:5], 0x0
-; GFX10-NEXT: s_add_u32 s5, s32, 0x3ff
+; GFX10-NEXT: s_add_i32 s5, s32, 0x3ff
; GFX10-NEXT: s_and_b32 s5, s5, 0xfffffc00
; GFX10-NEXT: v_mov_b32_e32 v1, s5
; GFX10-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
@@ -405,7 +405,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX10-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
-; GFX10-NEXT: s_add_u32 s32, s5, s4
+; GFX10-NEXT: s_add_i32 s32, s5, s4
; GFX10-NEXT: s_mov_b32 s32, s34
; GFX10-NEXT: s_mov_b32 s34, s7
; GFX10-NEXT: s_setpc_b64 s[30:31]
@@ -427,7 +427,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX11-NEXT: s_mov_b32 s33, s2
; GFX11-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-NEXT: s_load_b32 s0, s[0:1], 0x0
-; GFX11-NEXT: s_add_u32 s1, s32, 0x3ff
+; GFX11-NEXT: s_add_i32 s1, s32, 0x3ff
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(SALU_CYCLE_1)
; GFX11-NEXT: s_and_b32 s1, s1, 0xfffffc00
; GFX11-NEXT: scratch_store_b32 off, v0, s1
@@ -436,7 +436,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX11-NEXT: s_and_b32 s0, s0, -16
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
-; GFX11-NEXT: s_add_u32 s32, s1, s0
+; GFX11-NEXT: s_add_i32 s32, s1, s0
; GFX11-NEXT: s_mov_b32 s32, s34
; GFX11-NEXT: s_mov_b32 s34, s3
; GFX11-NEXT: s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
index 8a80afd4a768f..d1083588e8ac0 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
@@ -855,7 +855,7 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
; GFX9-NEXT: s_lshl_b32 s0, s0, 7
; GFX9-NEXT: v_sub_u32_e32 v0, 0, v0
; GFX9-NEXT: v_add_u32_e32 v1, 0x100, v1
-; GFX9-NEXT: s_add_u32 s0, 0x100, s0
+; GFX9-NEXT: s_addk_i32 s0, 0x100
; GFX9-NEXT: v_mov_b32_e32 v2, 15
; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX9-NEXT: scratch_store_dword v1, v2, off offset:128
@@ -883,7 +883,7 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: s_lshl_b32 s0, s0, 7
-; GFX10-NEXT: s_add_u32 s0, 0x100, s0
+; GFX10-NEXT: s_addk_i32 s0, 0x100
; GFX10-NEXT: v_add_nc_u32_e32 v1, s0, v1
; GFX10-NEXT: scratch_load_dword v0, v1, off offset:124 glc dlc
; GFX10-NEXT: s_waitcnt vmcnt(0)
@@ -899,7 +899,7 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
; GFX942-NEXT: v_sub_u32_e32 v0, 0, v0
; GFX942-NEXT: s_waitcnt lgkmcnt(0)
; GFX942-NEXT: s_lshl_b32 s0, s0, 7
-; GFX942-NEXT: s_add_u32 s0, 0x100, s0
+; GFX942-NEXT: s_addk_i32 s0, 0x100
; GFX942-NEX...
[truncated]
|
GlobalISel should not select the pseudos. We should directly implement the legality rules for 64-bit pointer arithmetic |
I think the only plus is it removes one useless entry from the allocation order, since exec is always reserved and will never be used for the assignment |
@@ -10419,6 +10426,11 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op, | |||
} | |||
} | |||
|
|||
bool SITargetLowering::shouldPreservePtrArith(const Function &F, | |||
EVT PtrVT) const { | |||
return UseSelectionDAGPTRADD; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the only case you care about is the 64-bit flat case, you could split this and only handle 64-bit pointers initially
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
def : GCNPat< | ||
(DivergentBinFrag<ptradd> i64:$src0, i64:$src1), | ||
(V_ADD_U64_PSEUDO $src0, $src1)>; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe should set GISelShouldIgnore on the 64-bit cases, the pseudos are mostly a SelectionDAG hack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// Whether we select S_ADD_I32 or S_ADD_U32 does not make much of a | ||
// difference. Most notably, S_ADD_I32 instructions can be transformed | ||
// to S_ADDK_I32, so we select that. | ||
def : GCNPat< | ||
(UniformBinFrag<ptradd> i32:$src0, i32:$src1), | ||
(S_ADD_I32 $src0, $src1)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there now dead manual select code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, GlobalISel lowered that in AMDGPUInstructionSelector::selectG_ADD_SUB
, with the same code that lowers normal ADDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There certainly isn't with the current version of the PR, with 32-bit handling in general and 64-bit handling for GlobalISel disabled.
// the SelectionDAG. | ||
static cl::opt<bool> UseSelectionDAGPTRADD( | ||
"amdgpu-use-sdag-ptradd", cl::Hidden, | ||
cl::desc("Generate ISD::PTRADD nodes for 64-bit pointer arithmetic in the " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only 32-bit pointer?
Enable generation of PTRADD SelectionDAG nodes for pointer arithmetic for SI,
for now behind an internal CLI option. Also add basic patterns to match these
nodes. Optimizations will come in follow-up PRs. Basic tests for SDAG codegen
with PTRADD are in test/CodeGen/AMDGPU/ptradd-sdag.ll
Only affects 64-bit address spaces for now, since the immediate use case only
affects the flat address space.
For SWDEV-516125.