Skip to content

[AMDGPU][SDAG] Initial support for ISD::PTRADD #141725

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ritter-x2a
Copy link
Member

@ritter-x2a ritter-x2a commented May 28, 2025

Enable generation of PTRADD SelectionDAG nodes for pointer arithmetic for SI,
for now behind an internal CLI option. Also add basic patterns to match these
nodes. Optimizations will come in follow-up PRs. Basic tests for SDAG codegen
with PTRADD are in test/CodeGen/AMDGPU/ptradd-sdag.ll

Only affects 64-bit address spaces for now, since the immediate use case only
affects the flat address space.

For SWDEV-516125.

Enable generation of PTRADD SelectionDAG nodes for pointer arithmetic for SI,
for now behind an internal CLI option. Also add basic patterns to match these
nodes. Optimizations will come in follow-up PRs. Basic tests for SDAG codegen
with PTRADD are in test/CodeGen/AMDGPU/ptradd-sdag.ll

Since GlobalISel also uses the PTRADD SDAG patterns via SelectionDAGCompat,
this change affects GlobalISel tests:
- Uniform 32-bit address arithmetic is now lowered to s_add_i32 instead of
  s_add_u32, which is consistent to what SDAG does (and gives
  SIShrinkInstructions the chance to generate s_addk_i32).
- 64-bit address arithmetic uses the [sv]_add_u64 pseudos, which is consistent
  with SDAG and means that GISel now generates 64-bit adds for gfx12. The only
  drawback with that is that we could save 1-2 instructions if we didn't use
  64-bit adds with >32-bit immediate (two movs with 32-bit immediates,
  s_delay_alu, and a 64-bit add vs two 32-bit adds with immediate), but that's
  a separate problem.
- The register class for the dead carry-out/sign-bit operand of
  V_ADD_CO_U32_e64 on architectures without carry-less additions now is sreg_64
  instead of sreg_64_xexec. I'm not sure if that loses us something worth
  preserving; I haven't found an obvious way to avoid this.

Overall, the changes in the GlobalISel tests seem to be improvements.
Copy link
Member Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@llvmbot
Copy link
Member

llvmbot commented May 28, 2025

@llvm/pr-subscribers-llvm-selectiondag
@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Fabian Ritter (ritter-x2a)

Changes

Enable generation of PTRADD SelectionDAG nodes for pointer arithmetic for SI,
for now behind an internal CLI option. Also add basic patterns to match these
nodes. Optimizations will come in follow-up PRs. Basic tests for SDAG codegen
with PTRADD are in test/CodeGen/AMDGPU/ptradd-sdag.ll

Since GlobalISel also uses the PTRADD SDAG patterns via SelectionDAGCompat,
this change affects GlobalISel tests:

  • Uniform 32-bit address arithmetic is now lowered to s_add_i32 instead of
    s_add_u32, which is consistent to what SDAG does (and gives
    SIShrinkInstructions the chance to generate s_addk_i32).
  • 64-bit address arithmetic uses the [sv]_add_u64 pseudos, which is consistent
    with SDAG and means that GISel now generates 64-bit adds for gfx12. The only
    drawback with that is that we could save 1-2 instructions if we didn't use
    64-bit adds with >32-bit immediate (two movs with 32-bit immediates,
    s_delay_alu, and a 64-bit add vs two 32-bit adds with immediate), but that's
    a separate problem.
  • The register class for the dead carry-out/sign-bit operand of
    V_ADD_CO_U32_e64 on architectures without carry-less additions now is sreg_64
    instead of sreg_64_xexec. I'm not sure if that loses us something worth
    preserving; I haven't found an obvious way to avoid this.

Overall, the changes in the GlobalISel tests seem to be improvements.


Patch is 1.01 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/141725.diff

50 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+12)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.h (+2)
  • (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+31)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll (+22-22)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll (+48-48)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement-stack-lower.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-amdgpu-atomic-cmpxchg-flat.mir (+16-64)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-amdgpu-atomic-cmpxchg-global.mir (+32-92)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-atomic-cmpxchg-local.mir (+14-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-atomic-cmpxchg-region.mir (+14-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-atomicrmw-add-flat.mir (+48-192)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-atomicrmw-add-global.mir (+44-176)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-atomicrmw-xchg-local.mir (+6-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-atomicrmw-xchg-region.mir (+6-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-atomic-flat.mir (+12-48)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-atomic-global.mir (+24-72)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-atomic-local.mir (+19-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-constant.mir (+20-80)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-flat.mir (+152-608)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-global-old-legalization.mir (+194-680)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-global-saddr.mir (+72-240)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-global.mir (+194-680)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-local-128.mir (+25-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-local.mir (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-private.mir (+17-17)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-smrd.mir (+3-21)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-pattern-add3.mir (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-ptr-add.mir (+84-234)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-sextload-local.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-store-flat.mir (+50-200)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-store-global.mir (+56-200)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-store-local.mir (+52-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-store-private.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-zextload-local.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/lds-relocs.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/lds-zero-initializer.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mubuf-global.ll (+30-23)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll (+15-15)
  • (modified) llvm/test/CodeGen/AMDGPU/atomics_cond_sub.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+64-64)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll (+44-90)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-load.ll (+134-238)
  • (modified) llvm/test/CodeGen/AMDGPU/isel-amdgpu-cs-chain-cc.ll (+262-262)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.prefetch.ll (+29-54)
  • (modified) llvm/test/CodeGen/AMDGPU/offset-split-flat.ll (+36-24)
  • (modified) llvm/test/CodeGen/AMDGPU/offset-split-global.ll (+36-24)
  • (added) llvm/test/CodeGen/AMDGPU/ptradd-sdag.ll (+1013)
  • (modified) llvm/test/CodeGen/AMDGPU/spill-vgpr-block.ll (+1-2)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index ade88a16193b8..15e060a6e8e59 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -61,6 +61,13 @@ static cl::opt<bool> UseDivergentRegisterIndexing(
     cl::desc("Use indirect register addressing for divergent indexes"),
     cl::init(false));
 
+// TODO This option should be removed once we switch to always using PTRADD in
+// the SelectionDAG.
+static cl::opt<bool> UseSelectionDAGPTRADD(
+    "amdgpu-use-sdag-ptradd", cl::Hidden,
+    cl::desc("Generate ISD::PTRADD nodes in the SelectionDAG ISel"),
+    cl::init(false));
+
 static bool denormalModeIsFlushAllF32(const MachineFunction &MF) {
   const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();
   return Info->getMode().FP32Denormals == DenormalMode::getPreserveSign();
@@ -10419,6 +10426,11 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
   }
 }
 
+bool SITargetLowering::shouldPreservePtrArith(const Function &F,
+                                              EVT PtrVT) const {
+  return UseSelectionDAGPTRADD;
+}
+
 // The raw.(t)buffer and struct.(t)buffer intrinsics have two offset args:
 // offset (the offset that is included in bounds checking and swizzling, to be
 // split between the instruction's voffset and immoffset fields) and soffset
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index c42366a1c04c8..bd9ec7cb8ec48 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -258,6 +258,8 @@ class SITargetLowering final : public AMDGPUTargetLowering {
 
   bool shouldExpandVectorDynExt(SDNode *N) const;
 
+  bool shouldPreservePtrArith(const Function &F, EVT PtrVT) const override;
+
 private:
   // Analyze a combined offset from an amdgcn_s_buffer_load intrinsic and store
   // the three offsets (voffset, soffset and instoffset) into the SDValue[3]
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 2e2913d88cc54..3ded1393e2ce3 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -1376,6 +1376,37 @@ def : GCNPat <
       (i32 (V_MOV_B32_e32 (i32 0))), sub1)
 >;
 
+//===----------------------------------------------------------------------===//
+// PTRADD Patterns
+//===----------------------------------------------------------------------===//
+
+def : GCNPat<
+  (DivergentBinFrag<ptradd> i64:$src0, i64:$src1),
+  (V_ADD_U64_PSEUDO $src0, $src1)>;
+
+def : GCNPat<
+  (DivergentBinFrag<ptradd> i32:$src0, i32:$src1),
+  (V_ADD_U32_e64 $src0, $src1, 0)> {
+  let SubtargetPredicate = HasAddNoCarryInsts;
+}
+
+def : GCNPat<
+  (DivergentBinFrag<ptradd> i32:$src0, i32:$src1),
+  (V_ADD_CO_U32_e64 $src0, $src1)> {
+  let SubtargetPredicate = NotHasAddNoCarryInsts;
+}
+
+def : GCNPat<
+  (UniformBinFrag<ptradd> i64:$src0, i64:$src1),
+  (S_ADD_U64_PSEUDO $src0, $src1)>;
+
+// Whether we select S_ADD_I32 or S_ADD_U32 does not make much of a
+// difference. Most notably, S_ADD_I32 instructions can be transformed
+// to S_ADDK_I32, so we select that.
+def : GCNPat<
+  (UniformBinFrag<ptradd> i32:$src0, i32:$src1),
+  (S_ADD_I32 $src0, $src1)>;
+
 /********** ============================================ **********/
 /********** Extraction, Insertion, Building and Casting  **********/
 /********** ============================================ **********/
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll
index 7adaddf2fc8ba..5a3b36fc1ada2 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll
@@ -36,15 +36,15 @@ define amdgpu_kernel void @kernel_caller_stack() {
 ; FLATSCR-NEXT:    s_mov_b32 s32, 0
 ; FLATSCR-NEXT:    s_add_u32 flat_scratch_lo, s8, s13
 ; FLATSCR-NEXT:    s_addc_u32 flat_scratch_hi, s9, 0
-; FLATSCR-NEXT:    s_add_u32 s0, s32, 4
+; FLATSCR-NEXT:    s_add_i32 s0, s32, 4
 ; FLATSCR-NEXT:    v_mov_b32_e32 v0, 9
 ; FLATSCR-NEXT:    scratch_store_dword off, v0, s0
-; FLATSCR-NEXT:    s_add_u32 s0, s32, 8
+; FLATSCR-NEXT:    s_add_i32 s0, s32, 8
 ; FLATSCR-NEXT:    v_mov_b32_e32 v0, 10
 ; FLATSCR-NEXT:    scratch_store_dword off, v0, s0
-; FLATSCR-NEXT:    s_add_u32 s0, s32, 12
+; FLATSCR-NEXT:    s_add_i32 s0, s32, 12
 ; FLATSCR-NEXT:    v_mov_b32_e32 v0, 11
-; FLATSCR-NEXT:    s_add_u32 s2, s32, 16
+; FLATSCR-NEXT:    s_add_i32 s2, s32, 16
 ; FLATSCR-NEXT:    scratch_store_dword off, v0, s0
 ; FLATSCR-NEXT:    v_mov_b32_e32 v0, 12
 ; FLATSCR-NEXT:    s_getpc_b64 s[0:1]
@@ -189,13 +189,13 @@ define amdgpu_kernel void @kernel_caller_byval() {
 ; FLATSCR-NEXT:    s_getpc_b64 s[0:1]
 ; FLATSCR-NEXT:    s_add_u32 s0, s0, external_void_func_byval@rel32@lo+4
 ; FLATSCR-NEXT:    s_addc_u32 s1, s1, external_void_func_byval@rel32@hi+12
-; FLATSCR-NEXT:    s_add_u32 s2, s32, 8
-; FLATSCR-NEXT:    s_add_u32 s3, s32, 16
-; FLATSCR-NEXT:    s_add_u32 s4, s32, 24
-; FLATSCR-NEXT:    s_add_u32 s5, s32, 32
-; FLATSCR-NEXT:    s_add_u32 s6, s32, 40
-; FLATSCR-NEXT:    s_add_u32 s7, s32, 48
-; FLATSCR-NEXT:    s_add_u32 s8, s32, 56
+; FLATSCR-NEXT:    s_add_i32 s2, s32, 8
+; FLATSCR-NEXT:    s_add_i32 s3, s32, 16
+; FLATSCR-NEXT:    s_add_i32 s4, s32, 24
+; FLATSCR-NEXT:    s_add_i32 s5, s32, 32
+; FLATSCR-NEXT:    s_add_i32 s6, s32, 40
+; FLATSCR-NEXT:    s_add_i32 s7, s32, 48
+; FLATSCR-NEXT:    s_add_i32 s8, s32, 56
 ; FLATSCR-NEXT:    s_waitcnt vmcnt(7)
 ; FLATSCR-NEXT:    scratch_store_dwordx2 off, v[0:1], s32
 ; FLATSCR-NEXT:    s_waitcnt vmcnt(7)
@@ -266,16 +266,16 @@ define void @func_caller_stack() {
 ; FLATSCR-NEXT:    s_mov_b64 exec, s[2:3]
 ; FLATSCR-NEXT:    s_add_i32 s32, s32, 16
 ; FLATSCR-NEXT:    v_writelane_b32 v40, s0, 2
-; FLATSCR-NEXT:    s_add_u32 s0, s32, 4
+; FLATSCR-NEXT:    s_add_i32 s0, s32, 4
 ; FLATSCR-NEXT:    v_mov_b32_e32 v0, 9
 ; FLATSCR-NEXT:    scratch_store_dword off, v0, s0
-; FLATSCR-NEXT:    s_add_u32 s0, s32, 8
+; FLATSCR-NEXT:    s_add_i32 s0, s32, 8
 ; FLATSCR-NEXT:    v_mov_b32_e32 v0, 10
 ; FLATSCR-NEXT:    scratch_store_dword off, v0, s0
-; FLATSCR-NEXT:    s_add_u32 s0, s32, 12
+; FLATSCR-NEXT:    s_add_i32 s0, s32, 12
 ; FLATSCR-NEXT:    v_mov_b32_e32 v0, 11
 ; FLATSCR-NEXT:    scratch_store_dword off, v0, s0
-; FLATSCR-NEXT:    s_add_u32 s0, s32, 16
+; FLATSCR-NEXT:    s_add_i32 s0, s32, 16
 ; FLATSCR-NEXT:    v_mov_b32_e32 v0, 12
 ; FLATSCR-NEXT:    v_writelane_b32 v40, s30, 0
 ; FLATSCR-NEXT:    scratch_store_dword off, v0, s0
@@ -393,8 +393,8 @@ define void @func_caller_byval(ptr addrspace(5) %argptr) {
 ; FLATSCR-NEXT:    s_add_i32 s32, s32, 16
 ; FLATSCR-NEXT:    v_add_u32_e32 v3, 8, v0
 ; FLATSCR-NEXT:    v_writelane_b32 v40, s0, 2
-; FLATSCR-NEXT:    s_add_u32 s0, s32, 8
-; FLATSCR-NEXT:    s_add_u32 s2, s32, 56
+; FLATSCR-NEXT:    s_add_i32 s0, s32, 8
+; FLATSCR-NEXT:    s_add_i32 s2, s32, 56
 ; FLATSCR-NEXT:    v_writelane_b32 v40, s30, 0
 ; FLATSCR-NEXT:    v_writelane_b32 v40, s31, 1
 ; FLATSCR-NEXT:    s_waitcnt vmcnt(0)
@@ -404,28 +404,28 @@ define void @func_caller_byval(ptr addrspace(5) %argptr) {
 ; FLATSCR-NEXT:    s_waitcnt vmcnt(0)
 ; FLATSCR-NEXT:    scratch_store_dwordx2 off, v[1:2], s0
 ; FLATSCR-NEXT:    scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT:    s_add_u32 s0, s32, 16
+; FLATSCR-NEXT:    s_add_i32 s0, s32, 16
 ; FLATSCR-NEXT:    v_add_u32_e32 v3, 24, v0
 ; FLATSCR-NEXT:    s_waitcnt vmcnt(0)
 ; FLATSCR-NEXT:    scratch_store_dwordx2 off, v[1:2], s0
 ; FLATSCR-NEXT:    scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT:    s_add_u32 s0, s32, 24
+; FLATSCR-NEXT:    s_add_i32 s0, s32, 24
 ; FLATSCR-NEXT:    v_add_u32_e32 v3, 32, v0
 ; FLATSCR-NEXT:    s_waitcnt vmcnt(0)
 ; FLATSCR-NEXT:    scratch_store_dwordx2 off, v[1:2], s0
 ; FLATSCR-NEXT:    scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT:    s_add_u32 s0, s32, 32
+; FLATSCR-NEXT:    s_add_i32 s0, s32, 32
 ; FLATSCR-NEXT:    v_add_u32_e32 v3, 40, v0
 ; FLATSCR-NEXT:    s_waitcnt vmcnt(0)
 ; FLATSCR-NEXT:    scratch_store_dwordx2 off, v[1:2], s0
 ; FLATSCR-NEXT:    scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT:    s_add_u32 s0, s32, 40
+; FLATSCR-NEXT:    s_add_i32 s0, s32, 40
 ; FLATSCR-NEXT:    v_add_u32_e32 v3, 48, v0
 ; FLATSCR-NEXT:    v_add_u32_e32 v0, 56, v0
 ; FLATSCR-NEXT:    s_waitcnt vmcnt(0)
 ; FLATSCR-NEXT:    scratch_store_dwordx2 off, v[1:2], s0
 ; FLATSCR-NEXT:    scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT:    s_add_u32 s0, s32, 48
+; FLATSCR-NEXT:    s_add_i32 s0, s32, 48
 ; FLATSCR-NEXT:    s_waitcnt vmcnt(0)
 ; FLATSCR-NEXT:    scratch_store_dwordx2 off, v[1:2], s0
 ; FLATSCR-NEXT:    scratch_load_dwordx2 v[0:1], v0, off
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
index 6b767d9e754be..a1bb8b390847f 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
@@ -20,7 +20,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align4(i32 %n) {
 ; GFX9-NEXT:    v_mov_b32_e32 v1, s4
 ; GFX9-NEXT:    s_lshl_b32 s5, s5, 6
 ; GFX9-NEXT:    s_mov_b32 s33, 0
-; GFX9-NEXT:    s_add_u32 s32, s4, s5
+; GFX9-NEXT:    s_add_i32 s32, s4, s5
 ; GFX9-NEXT:    buffer_store_dword v0, v1, s[0:3], 0 offen
 ; GFX9-NEXT:    s_endpgm
 ;
@@ -39,7 +39,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align4(i32 %n) {
 ; GFX10-NEXT:    s_lshl2_add_u32 s5, s5, 15
 ; GFX10-NEXT:    s_and_b32 s5, s5, -16
 ; GFX10-NEXT:    s_lshl_b32 s5, s5, 5
-; GFX10-NEXT:    s_add_u32 s32, s4, s5
+; GFX10-NEXT:    s_add_i32 s32, s4, s5
 ; GFX10-NEXT:    s_endpgm
 ;
 ; GFX11-LABEL: kernel_dynamic_stackalloc_sgpr_align4:
@@ -56,7 +56,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align4(i32 %n) {
 ; GFX11-NEXT:    s_and_b32 s1, s1, -16
 ; GFX11-NEXT:    s_lshl_b32 s1, s1, 5
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT:    s_add_u32 s32, s0, s1
+; GFX11-NEXT:    s_add_i32 s32, s0, s1
 ; GFX11-NEXT:    s_endpgm
   %alloca = alloca i32, i32 %n, align 4, addrspace(5)
   store i32 0, ptr addrspace(5) %alloca
@@ -84,7 +84,7 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
 ; GFX9-NEXT:    s_lshl2_add_u32 s4, s4, 15
 ; GFX9-NEXT:    s_and_b32 s4, s4, -16
 ; GFX9-NEXT:    s_lshl_b32 s4, s4, 6
-; GFX9-NEXT:    s_add_u32 s32, s6, s4
+; GFX9-NEXT:    s_add_i32 s32, s6, s4
 ; GFX9-NEXT:    s_mov_b32 s32, s33
 ; GFX9-NEXT:    s_mov_b32 s33, s7
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -110,7 +110,7 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
 ; GFX10-NEXT:    s_lshl2_add_u32 s4, s4, 15
 ; GFX10-NEXT:    s_and_b32 s4, s4, -16
 ; GFX10-NEXT:    s_lshl_b32 s4, s4, 5
-; GFX10-NEXT:    s_add_u32 s32, s6, s4
+; GFX10-NEXT:    s_add_i32 s32, s6, s4
 ; GFX10-NEXT:    s_mov_b32 s32, s33
 ; GFX10-NEXT:    s_mov_b32 s33, s7
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
@@ -136,7 +136,7 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
 ; GFX11-NEXT:    s_and_b32 s0, s0, -16
 ; GFX11-NEXT:    s_lshl_b32 s0, s0, 5
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT:    s_add_u32 s32, s2, s0
+; GFX11-NEXT:    s_add_i32 s32, s2, s0
 ; GFX11-NEXT:    s_mov_b32 s32, s33
 ; GFX11-NEXT:    s_mov_b32 s33, s3
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
@@ -161,7 +161,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align16(i32 %n) {
 ; GFX9-NEXT:    v_mov_b32_e32 v1, s4
 ; GFX9-NEXT:    s_lshl_b32 s5, s5, 6
 ; GFX9-NEXT:    s_mov_b32 s33, 0
-; GFX9-NEXT:    s_add_u32 s32, s4, s5
+; GFX9-NEXT:    s_add_i32 s32, s4, s5
 ; GFX9-NEXT:    buffer_store_dword v0, v1, s[0:3], 0 offen
 ; GFX9-NEXT:    s_endpgm
 ;
@@ -180,7 +180,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align16(i32 %n) {
 ; GFX10-NEXT:    s_lshl2_add_u32 s5, s5, 15
 ; GFX10-NEXT:    s_and_b32 s5, s5, -16
 ; GFX10-NEXT:    s_lshl_b32 s5, s5, 5
-; GFX10-NEXT:    s_add_u32 s32, s4, s5
+; GFX10-NEXT:    s_add_i32 s32, s4, s5
 ; GFX10-NEXT:    s_endpgm
 ;
 ; GFX11-LABEL: kernel_dynamic_stackalloc_sgpr_align16:
@@ -197,7 +197,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align16(i32 %n) {
 ; GFX11-NEXT:    s_and_b32 s1, s1, -16
 ; GFX11-NEXT:    s_lshl_b32 s1, s1, 5
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT:    s_add_u32 s32, s0, s1
+; GFX11-NEXT:    s_add_i32 s32, s0, s1
 ; GFX11-NEXT:    s_endpgm
   %alloca = alloca i32, i32 %n, align 16, addrspace(5)
   store i32 0, ptr addrspace(5) %alloca
@@ -225,7 +225,7 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
 ; GFX9-NEXT:    s_lshl2_add_u32 s4, s4, 15
 ; GFX9-NEXT:    s_and_b32 s4, s4, -16
 ; GFX9-NEXT:    s_lshl_b32 s4, s4, 6
-; GFX9-NEXT:    s_add_u32 s32, s6, s4
+; GFX9-NEXT:    s_add_i32 s32, s6, s4
 ; GFX9-NEXT:    s_mov_b32 s32, s33
 ; GFX9-NEXT:    s_mov_b32 s33, s7
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -251,7 +251,7 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
 ; GFX10-NEXT:    s_lshl2_add_u32 s4, s4, 15
 ; GFX10-NEXT:    s_and_b32 s4, s4, -16
 ; GFX10-NEXT:    s_lshl_b32 s4, s4, 5
-; GFX10-NEXT:    s_add_u32 s32, s6, s4
+; GFX10-NEXT:    s_add_i32 s32, s6, s4
 ; GFX10-NEXT:    s_mov_b32 s32, s33
 ; GFX10-NEXT:    s_mov_b32 s33, s7
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
@@ -277,7 +277,7 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
 ; GFX11-NEXT:    s_and_b32 s0, s0, -16
 ; GFX11-NEXT:    s_lshl_b32 s0, s0, 5
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT:    s_add_u32 s32, s2, s0
+; GFX11-NEXT:    s_add_i32 s32, s2, s0
 ; GFX11-NEXT:    s_mov_b32 s32, s33
 ; GFX11-NEXT:    s_mov_b32 s33, s3
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
@@ -294,7 +294,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
 ; GFX9-NEXT:    s_movk_i32 s32, 0x800
 ; GFX9-NEXT:    s_add_u32 s0, s0, s17
 ; GFX9-NEXT:    s_addc_u32 s1, s1, 0
-; GFX9-NEXT:    s_add_u32 s5, s32, 0x7ff
+; GFX9-NEXT:    s_add_i32 s5, s32, 0x7ff
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-NEXT:    s_lshl2_add_u32 s4, s4, 15
 ; GFX9-NEXT:    s_and_b32 s5, s5, 0xfffff800
@@ -303,7 +303,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
 ; GFX9-NEXT:    v_mov_b32_e32 v1, s5
 ; GFX9-NEXT:    s_lshl_b32 s4, s4, 6
 ; GFX9-NEXT:    s_mov_b32 s33, 0
-; GFX9-NEXT:    s_add_u32 s32, s5, s4
+; GFX9-NEXT:    s_add_i32 s32, s5, s4
 ; GFX9-NEXT:    buffer_store_dword v0, v1, s[0:3], 0 offen
 ; GFX9-NEXT:    s_endpgm
 ;
@@ -313,7 +313,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
 ; GFX10-NEXT:    s_movk_i32 s32, 0x400
 ; GFX10-NEXT:    s_add_u32 s0, s0, s17
 ; GFX10-NEXT:    s_addc_u32 s1, s1, 0
-; GFX10-NEXT:    s_add_u32 s5, s32, 0x3ff
+; GFX10-NEXT:    s_add_i32 s5, s32, 0x3ff
 ; GFX10-NEXT:    v_mov_b32_e32 v0, 0
 ; GFX10-NEXT:    s_and_b32 s5, s5, 0xfffffc00
 ; GFX10-NEXT:    s_mov_b32 s33, 0
@@ -323,7 +323,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
 ; GFX10-NEXT:    s_lshl2_add_u32 s4, s4, 15
 ; GFX10-NEXT:    s_and_b32 s4, s4, -16
 ; GFX10-NEXT:    s_lshl_b32 s4, s4, 5
-; GFX10-NEXT:    s_add_u32 s32, s5, s4
+; GFX10-NEXT:    s_add_i32 s32, s5, s4
 ; GFX10-NEXT:    s_endpgm
 ;
 ; GFX11-LABEL: kernel_dynamic_stackalloc_sgpr_align32:
@@ -331,7 +331,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
 ; GFX11-NEXT:    s_load_b32 s0, s[4:5], 0x0
 ; GFX11-NEXT:    s_mov_b32 s32, 32
 ; GFX11-NEXT:    v_mov_b32_e32 v0, 0
-; GFX11-NEXT:    s_add_u32 s1, s32, 0x3ff
+; GFX11-NEXT:    s_add_i32 s1, s32, 0x3ff
 ; GFX11-NEXT:    s_mov_b32 s33, 0
 ; GFX11-NEXT:    s_and_b32 s1, s1, 0xfffffc00
 ; GFX11-NEXT:    scratch_store_b32 off, v0, s1
@@ -341,7 +341,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
 ; GFX11-NEXT:    s_and_b32 s0, s0, -16
 ; GFX11-NEXT:    s_lshl_b32 s0, s0, 5
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT:    s_add_u32 s32, s1, s0
+; GFX11-NEXT:    s_add_i32 s32, s1, s0
 ; GFX11-NEXT:    s_endpgm
   %alloca = alloca i32, i32 %n, align 32, addrspace(5)
   store i32 0, ptr addrspace(5) %alloca
@@ -366,7 +366,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
 ; GFX9-NEXT:    s_mov_b32 s33, s6
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-NEXT:    s_load_dword s4, s[4:5], 0x0
-; GFX9-NEXT:    s_add_u32 s5, s32, 0x7ff
+; GFX9-NEXT:    s_add_i32 s5, s32, 0x7ff
 ; GFX9-NEXT:    s_and_b32 s5, s5, 0xfffff800
 ; GFX9-NEXT:    v_mov_b32_e32 v1, s5
 ; GFX9-NEXT:    buffer_store_dword v0, v1, s[0:3], 0 offen
@@ -374,7 +374,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
 ; GFX9-NEXT:    s_lshl2_add_u32 s4, s4, 15
 ; GFX9-NEXT:    s_and_b32 s4, s4, -16
 ; GFX9-NEXT:    s_lshl_b32 s4, s4, 6
-; GFX9-NEXT:    s_add_u32 s32, s5, s4
+; GFX9-NEXT:    s_add_i32 s32, s5, s4
 ; GFX9-NEXT:    s_mov_b32 s32, s34
 ; GFX9-NEXT:    s_mov_b32 s34, s7
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -397,7 +397,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
 ; GFX10-NEXT:    s_mov_b32 s33, s6
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX10-NEXT:    s_load_dword s4, s[4:5], 0x0
-; GFX10-NEXT:    s_add_u32 s5, s32, 0x3ff
+; GFX10-NEXT:    s_add_i32 s5, s32, 0x3ff
 ; GFX10-NEXT:    s_and_b32 s5, s5, 0xfffffc00
 ; GFX10-NEXT:    v_mov_b32_e32 v1, s5
 ; GFX10-NEXT:    buffer_store_dword v0, v1, s[0:3], 0 offen
@@ -405,7 +405,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
 ; GFX10-NEXT:    s_lshl2_add_u32 s4, s4, 15
 ; GFX10-NEXT:    s_and_b32 s4, s4, -16
 ; GFX10-NEXT:    s_lshl_b32 s4, s4, 5
-; GFX10-NEXT:    s_add_u32 s32, s5, s4
+; GFX10-NEXT:    s_add_i32 s32, s5, s4
 ; GFX10-NEXT:    s_mov_b32 s32, s34
 ; GFX10-NEXT:    s_mov_b32 s34, s7
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
@@ -427,7 +427,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
 ; GFX11-NEXT:    s_mov_b32 s33, s2
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    s_load_b32 s0, s[0:1], 0x0
-; GFX11-NEXT:    s_add_u32 s1, s32, 0x3ff
+; GFX11-NEXT:    s_add_i32 s1, s32, 0x3ff
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    s_and_b32 s1, s1, 0xfffffc00
 ; GFX11-NEXT:    scratch_store_b32 off, v0, s1
@@ -436,7 +436,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
 ; GFX11-NEXT:    s_and_b32 s0, s0, -16
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    s_lshl_b32 s0, s0, 5
-; GFX11-NEXT:    s_add_u32 s32, s1, s0
+; GFX11-NEXT:    s_add_i32 s32, s1, s0
 ; GFX11-NEXT:    s_mov_b32 s32, s34
 ; GFX11-NEXT:    s_mov_b32 s34, s3
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
index 8a80afd4a768f..d1083588e8ac0 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
@@ -855,7 +855,7 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; GFX9-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX9-NEXT:    v_sub_u32_e32 v0, 0, v0
 ; GFX9-NEXT:    v_add_u32_e32 v1, 0x100, v1
-; GFX9-NEXT:    s_add_u32 s0, 0x100, s0
+; GFX9-NEXT:    s_addk_i32 s0, 0x100
 ; GFX9-NEXT:    v_mov_b32_e32 v2, 15
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
 ; GFX9-NEXT:    scratch_store_dword v1, v2, off offset:128
@@ -883,7 +883,7 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX10-NEXT:    s_lshl_b32 s0, s0, 7
-; GFX10-NEXT:    s_add_u32 s0, 0x100, s0
+; GFX10-NEXT:    s_addk_i32 s0, 0x100
 ; GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
 ; GFX10-NEXT:    scratch_load_dword v0, v1, off offset:124 glc dlc
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
@@ -899,7 +899,7 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; GFX942-NEXT:    v_sub_u32_e32 v0, 0, v0
 ; GFX942-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX942-NEXT:    s_lshl_b32 s0, s0, 7
-; GFX942-NEXT:    s_add_u32 s0, 0x100, s0
+; GFX942-NEXT:    s_addk_i32 s0, 0x100
 ; GFX942-NEX...
[truncated]

@ritter-x2a ritter-x2a added the llvm:SelectionDAG SelectionDAGISel as well label May 28, 2025 — with Graphite App
@ritter-x2a ritter-x2a marked this pull request as ready for review May 28, 2025 08:13
@arsenm
Copy link
Contributor

arsenm commented May 28, 2025

64-bit address arithmetic uses the [sv]_add_u64 pseudos, which is consistent
with SDAG and means that GISel now generates 64-bit adds for gfx12.

GlobalISel should not select the pseudos. We should directly implement the legality rules for 64-bit pointer arithmetic

@arsenm
Copy link
Contributor

arsenm commented May 28, 2025

now is sreg_64 instead of sreg_64_xexec. I'm not sure if that loses us something worth
preserving; I haven't found an obvious way to avoid this.

I think the only plus is it removes one useless entry from the allocation order, since exec is always reserved and will never be used for the assignment

@@ -10419,6 +10426,11 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
}
}

bool SITargetLowering::shouldPreservePtrArith(const Function &F,
EVT PtrVT) const {
return UseSelectionDAGPTRADD;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the only case you care about is the 64-bit flat case, you could split this and only handle 64-bit pointers initially

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 1383 to 1386
def : GCNPat<
(DivergentBinFrag<ptradd> i64:$src0, i64:$src1),
(V_ADD_U64_PSEUDO $src0, $src1)>;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should set GISelShouldIgnore on the 64-bit cases, the pseudos are mostly a SelectionDAG hack

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 1403 to 1408
// Whether we select S_ADD_I32 or S_ADD_U32 does not make much of a
// difference. Most notably, S_ADD_I32 instructions can be transformed
// to S_ADDK_I32, so we select that.
def : GCNPat<
(UniformBinFrag<ptradd> i32:$src0, i32:$src1),
(S_ADD_I32 $src0, $src1)>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there now dead manual select code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, GlobalISel lowered that in AMDGPUInstructionSelector::selectG_ADD_SUB, with the same code that lowers normal ADDs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There certainly isn't with the current version of the PR, with 32-bit handling in general and 64-bit handling for GlobalISel disabled.

// the SelectionDAG.
static cl::opt<bool> UseSelectionDAGPTRADD(
"amdgpu-use-sdag-ptradd", cl::Hidden,
cl::desc("Generate ISD::PTRADD nodes for 64-bit pointer arithmetic in the "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only 32-bit pointer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants