X86: Add X86TTIImpl::isProfitableToSinkOperands hook for immediate operands. #141326

pcc · 2025-05-24T03:15:05Z

Casts taking a constant expression (generally derived from a global
variable address) as an operand are not profitable to sink because they
appear as subexpressions in the instruction sequence generated by the
LowerTypeTests pass which is expected to pattern match to the rotate
instruction's immediate operand.

Created using spr 1.3.6-beta.1

llvmbot · 2025-05-24T03:15:36Z

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-transforms

Author: Peter Collingbourne (pcc)

Changes

Casts taking a constant expression (generally derived from a global
variable address) as an operand are not profitable to CSE because they
appear as subexpressions in the instruction sequence generated by the
LowerTypeTests pass which is expected to pattern match to the rotate
instruction. These casts are expected to be code generated into the
instruction's immediate operand, so there is no benefit to hoisting
them anyway.

Full diff: https://github.com/llvm/llvm-project/pull/141326.diff

4 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+13)
(modified) llvm/lib/Target/X86/X86ISelLowering.h (+2)
(modified) llvm/lib/Transforms/Scalar/GVN.cpp (+2-2)
(added) llvm/test/Other/inhibit-zext-constant-hoist.ll (+179)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 6b4b4beb97ca5..ec252bcd9577a 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -62119,3 +62119,16 @@ Align X86TargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
     return Align(1ULL << ExperimentalPrefInnermostLoopAlignment);
   return TargetLowering::getPrefLoopAlignment();
 }
+
+bool X86TargetLowering::isProfitableToHoist(Instruction *I) const {
+  // Casts taking a constant expression (generally derived from a global
+  // variable address) as an operand are not profitable to CSE because they
+  // appear as subexpressions in the instruction sequence generated by the
+  // LowerTypeTests pass which is expected to pattern match to the rotate
+  // instruction. These casts are expected to be code generated into the
+  // instruction's immediate operand, so there is no benefit to hoisting them
+  // anyway.
+  if (isa<CastInst>(I) && isa<ConstantExpr>(I->getOperand(0)))
+    return false;
+  return true;
+}
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 662552a972249..302121e3ced5a 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1669,6 +1669,8 @@ namespace llvm {
 
     Align getPrefLoopAlignment(MachineLoop *ML) const override;
 
+    bool isProfitableToHoist(Instruction *I) const override;
+
     EVT getTypeToTransformTo(LLVMContext &Context, EVT VT) const override {
       if (VT == MVT::f80)
         return EVT::getIntegerVT(Context, 96);
diff --git a/llvm/lib/Transforms/Scalar/GVN.cpp b/llvm/lib/Transforms/Scalar/GVN.cpp
index 0cb38bcfff2d4..1a72e3b48105d 100644
--- a/llvm/lib/Transforms/Scalar/GVN.cpp
+++ b/llvm/lib/Transforms/Scalar/GVN.cpp
@@ -2724,7 +2724,7 @@ bool GVNPass::processInstruction(Instruction *I) {
   // If this GVN would effectively hoist the instruction and that's known to be
   // unprofitable, don't do it. Remember the instruction so that it can be used
   // by other instructions in the same block.
-  if (auto *ReplI = dyn_cast<Instruction>(I)) {
+  if (auto *ReplI = dyn_cast<Instruction>(Repl)) {
     if (ReplI->getParent() != I->getParent() && !TTI->isProfitableToHoist(I)) {
       LeaderTable.insert(Num, I, I->getParent());
       return false;
@@ -2891,7 +2891,7 @@ bool GVNPass::performScalarPRE(Instruction *CurInst) {
   if (isa<AllocaInst>(CurInst) || CurInst->isTerminator() ||
       isa<PHINode>(CurInst) || CurInst->getType()->isVoidTy() ||
       CurInst->mayReadFromMemory() || CurInst->mayHaveSideEffects() ||
-      isa<DbgInfoIntrinsic>(CurInst))
+      isa<DbgInfoIntrinsic>(CurInst) || !TTI->isProfitableToHoist(CurInst))
     return false;
 
   // Don't do PRE on compares. The PHI would prevent CodeGenPrepare from
diff --git a/llvm/test/Other/inhibit-zext-constant-hoist.ll b/llvm/test/Other/inhibit-zext-constant-hoist.ll
new file mode 100644
index 0000000000000..eaad0b8a04e81
--- /dev/null
+++ b/llvm/test/Other/inhibit-zext-constant-hoist.ll
@@ -0,0 +1,179 @@
+; REQUIRES: x86-registered-target
+; Make sure that optimizations do not hoist zext(const).
+
+; This IR is normally generated by LowerTypeTests during ThinLTO importing
+; so it will go through the ThinLTO pass pipeline.
+; RUN: opt -passes='thinlto<O0>' -S < %s | FileCheck %s
+; RUN: opt -passes='thinlto<O1>' -S < %s | FileCheck %s
+; RUN: opt -passes='thinlto<O2>' -S < %s | FileCheck %s
+; RUN: opt -passes='thinlto<O3>' -S < %s | FileCheck %s
+
+; Also check the regular pipelines for completeness.
+; RUN: opt -O0 -S < %s | FileCheck %s
+; RUN: opt -O1 -S < %s | FileCheck %s
+; RUN: opt -O2 -S < %s | FileCheck %s
+; RUN: opt -O3 -S < %s | FileCheck %s
+; RUN: opt -Os -S < %s | FileCheck %s
+; RUN: opt -Oz -S < %s | FileCheck %s
+
+target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
+target triple = "x86_64-unknown-linux-gnu"
+
+@__typeid__ZTS1S_global_addr = external hidden global [0 x i8], code_model "small"
+@__typeid__ZTS1S_align = external hidden global [0 x i8], !absolute_symbol !0
+@__typeid__ZTS1S_size_m1 = external hidden global [0 x i8], !absolute_symbol !1
+
+; Check that we still have two pairs of zexts (non dominating case).
+
+; CHECK: define void @f1
+define void @f1(i1 noundef zeroext %0, ptr noundef %1, ptr noundef %2) {
+  br i1 %0, label %4, label %17
+
+4:
+  %5 = load ptr, ptr %1, align 8
+  %6 = ptrtoint ptr %5 to i64
+  %7 = sub i64 %6, ptrtoint (ptr @__typeid__ZTS1S_global_addr to i64)
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %8 = zext i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8) to i64
+  %9 = lshr i64 %7, %8
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %10 = zext i8 sub (i8 64, i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8)) to i64
+  %11 = shl i64 %7, %10
+  %12 = or i64 %9, %11
+  %13 = icmp ule i64 %12, ptrtoint (ptr @__typeid__ZTS1S_size_m1 to i64)
+  br i1 %13, label %15, label %14
+
+14:
+  call void @llvm.ubsantrap(i8 2)
+  unreachable
+
+15:
+  %16 = load ptr, ptr %5, align 8
+  call void %16(ptr noundef nonnull align 8 dereferenceable(8) %1)
+  br label %30
+
+17:
+  %18 = load ptr, ptr %2, align 8
+  %19 = ptrtoint ptr %18 to i64
+  %20 = sub i64 %19, ptrtoint (ptr @__typeid__ZTS1S_global_addr to i64)
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %21 = zext i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8) to i64
+  %22 = lshr i64 %20, %21
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %23 = zext i8 sub (i8 64, i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8)) to i64
+  %24 = shl i64 %20, %23
+  %25 = or i64 %22, %24
+  %26 = icmp ule i64 %25, ptrtoint (ptr @__typeid__ZTS1S_size_m1 to i64)
+  br i1 %26, label %28, label %27
+
+27:
+  call void @llvm.ubsantrap(i8 2) #6
+  unreachable
+
+28:
+  %29 = load ptr, ptr %18, align 8
+  call void %29(ptr noundef nonnull align 8 dereferenceable(8) %2)
+  br label %30
+
+30:
+  ret void
+}
+
+; Check that we still have two pairs of zexts (dominating case).
+
+; CHECK: define void @f2
+define void @f2(i1 noundef zeroext %0, ptr noundef %1, ptr noundef %2) {
+  %4 = load ptr, ptr %1, align 8
+  %5 = ptrtoint ptr %4 to i64
+  %6 = sub i64 %5, ptrtoint (ptr @__typeid__ZTS1S_global_addr to i64)
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %7 = zext i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8) to i64
+  %8 = lshr i64 %6, %7
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %9 = zext i8 sub (i8 64, i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8)) to i64
+  %10 = shl i64 %6, %9
+  %11 = or i64 %8, %10
+  %12 = icmp ule i64 %11, ptrtoint (ptr @__typeid__ZTS1S_size_m1 to i64)
+  br i1 %12, label %14, label %13
+
+13:                                               ; preds = %3
+  call void @llvm.ubsantrap(i8 2)
+  unreachable
+
+14:                                               ; preds = %3
+  %15 = load ptr, ptr %4, align 8
+  call void %15(ptr noundef nonnull align 8 dereferenceable(8) %1)
+  br i1 %0, label %16, label %29
+
+16:                                               ; preds = %14
+  %17 = load ptr, ptr %2, align 8
+  %18 = ptrtoint ptr %17 to i64
+  %19 = sub i64 %18, ptrtoint (ptr @__typeid__ZTS1S_global_addr to i64)
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %20 = zext i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8) to i64
+  %21 = lshr i64 %19, %20
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %22 = zext i8 sub (i8 64, i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8)) to i64
+  %23 = shl i64 %19, %22
+  %24 = or i64 %21, %23
+  %25 = icmp ule i64 %24, ptrtoint (ptr @__typeid__ZTS1S_size_m1 to i64)
+  br i1 %25, label %27, label %26
+
+26:                                               ; preds = %16
+  call void @llvm.ubsantrap(i8 2) #6
+  unreachable
+
+27:                                               ; preds = %16
+  %28 = load ptr, ptr %17, align 8
+  call void %28(ptr noundef nonnull align 8 dereferenceable(8) %2)
+  br label %29
+
+29:                                               ; preds = %27, %14
+  ret void
+}
+
+; Check that the zexts aren't moved to the preheader (or anywhere else)
+; and stay in the same basic block.
+
+; CHECK: define void @f3
+define void @f3(ptr noundef readonly captures(address) %0, ptr noundef readnone captures(address) %1) {
+  %3 = icmp eq ptr %0, %1
+  br i1 %3, label %21, label %4
+
+4:
+  ; CHECK: = phi
+  %5 = phi ptr [ %19, %17 ], [ %0, %2 ]
+  %6 = load ptr, ptr %5, align 8
+  %7 = load ptr, ptr %6, align 8
+  %8 = ptrtoint ptr %7 to i64
+  %9 = sub i64 %8, ptrtoint (ptr @__typeid__ZTS1S_global_addr to i64)
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %10 = zext i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8) to i64
+  %11 = lshr i64 %9, %10
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %12 = zext i8 sub (i8 64, i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8)) to i64
+  %13 = shl i64 %9, %12
+  %14 = or i64 %11, %13
+  %15 = icmp ule i64 %14, ptrtoint (ptr @__typeid__ZTS1S_size_m1 to i64)
+  br i1 %15, label %17, label %16
+
+16:
+  call void @llvm.ubsantrap(i8 2)
+  unreachable
+
+17:
+  %18 = load ptr, ptr %7, align 8
+  call void %18(ptr noundef nonnull align 8 dereferenceable(8) %6)
+  %19 = getelementptr inbounds nuw i8, ptr %5, i64 8
+  %20 = icmp eq ptr %19, %1
+  br i1 %20, label %21, label %4
+
+21:
+  ret void
+}
+
+declare i1 @llvm.type.test(ptr, metadata)
+declare void @llvm.ubsantrap(i8 immarg)
+
+!0 = !{i64 0, i64 256}
+!1 = !{i64 0, i64 128}

nikic

I'm not super clear on what it is you're trying to fix here.

Probably LowerTypeTests should be directly emitting a fshl/fshr instead of a bit shift sequence to make matching to rotate more reliable. Would that help you or not?

pcc · 2025-05-27T20:00:00Z

I'm fixing the code generation for the test cases that I'm adding (inhibit-zext-constant-hoist.ll) which were all extracted from a build of a large internal program built with CFI. Previously f1 looked like this where align was hoisted:

f1:                                     # @f1
	.cfi_startproc
# %bb.0:
	movl	$__typeid__ZTS1S_align, %eax
	movzbl	%al, %ecx
	movb	$64, %al
	subb	%cl, %al
	movzbl	%al, %eax
	testl	%edi, %edi
	je	.LBB0_3
# %bb.1:
	movq	(%rsi), %r8
	movl	$__typeid__ZTS1S_global_addr, %edx
	movq	%r8, %rdi
	subq	%rdx, %rdi
	movq	%rdi, %rdx
                                        # kill: def $cl killed $cl killed $rcx
	shrq	%cl, %rdx
	movl	%eax, %ecx
	shlq	%cl, %rdi
	orq	%rdx, %rdi
	cmpq	$__typeid__ZTS1S_size_m1@ABS8, %rdi
	jbe	.LBB0_4
.LBB0_2:
	ud1l	2(%eax), %eax
.LBB0_3:
	movq	(%rdx), %r8
	movl	$__typeid__ZTS1S_global_addr, %esi
	movq	%r8, %rdi
	subq	%rsi, %rdi
	movq	%rdi, %rsi
                                        # kill: def $cl killed $cl killed $rcx
	shrq	%cl, %rsi
	movl	%eax, %ecx
	shlq	%cl, %rdi
	orq	%rsi, %rdi
	cmpq	$__typeid__ZTS1S_size_m1@ABS8, %rdi
	movq	%rdx, %rsi
	ja	.LBB0_2
.LBB0_4:
	movq	%rsi, %rdi
	jmpq	*(%r8)                          # TAILCALL
.Lfunc_end0:
	.size	f1, .Lfunc_end0-f1
	.cfi_endproc

Now f1 looks like this:

f1:                                     # @f1
	.cfi_startproc
# %bb.0:
	testl	%edi, %edi
	je	.LBB0_3
# %bb.1:
	movq	(%rsi), %rax
	movl	$__typeid__ZTS1S_global_addr, %ecx
	movq	%rax, %rdx
	subq	%rcx, %rdx
	rorq	$__typeid__ZTS1S_align, %rdx
	cmpq	$__typeid__ZTS1S_size_m1@ABS8, %rdx
	jbe	.LBB0_4
.LBB0_2:
	ud1l	2(%eax), %eax
.LBB0_3:
	movq	(%rdx), %rax
	movl	$__typeid__ZTS1S_global_addr, %ecx
	movq	%rax, %rsi
	subq	%rcx, %rsi
	rorq	$__typeid__ZTS1S_align, %rsi
	cmpq	$__typeid__ZTS1S_size_m1@ABS8, %rsi
	movq	%rdx, %rsi
	ja	.LBB0_2
.LBB0_4:
	movq	%rsi, %rdi
	jmpq	*(%rax)                         # TAILCALL
.Lfunc_end0:
	.size	f1, .Lfunc_end0-f1
	.cfi_endproc

The other cases look similar before my change.

The poor codegen issue was introduced (I believe) by #71040 which removed the zext ConstantExprs that LowerTypeTests was using to keep everything in the same basic block for matching.

I think that because it's the zext being hoisted and not the shifts it wouldn't make a difference to use fshl/fshr but I can check.

nikic · 2025-05-27T20:17:46Z

@pcc I think using fsh should at least help to get a ror instead of the shr+shl+or.

Actually getting the value duplicated+sunk into each block is typically done in CGP, which has a bunch of related transforms. The most generic is probably tryToSinkFreeOperands driven by TTI.isProfitableToSinkOperands.

Created using spr 1.3.6-beta.1

pcc · 2025-05-27T22:21:59Z

@pcc I think using fsh should at least help to get a ror instead of the shr+shl+or.

Actually getting the value duplicated+sunk into each block is typically done in CGP, which has a bunch of related transforms. The most generic is probably tryToSinkFreeOperands driven by TTI.isProfitableToSinkOperands.

Thanks for the suggestion, that works a lot better. Updated the PR.

pcc · 2025-05-28T01:14:30Z

This PR doesn't entirely fix the problem. Another case that I'm seeing (and which is more effectively prevented by the previous approach) is where GVN PRE moves the zext behind a phi, and after subsequent optimization passes it turns into a phi pointing to other phis which ultimately point to identical (in the isIdenticalTo sense) zext instructions. We could address this in CGP by trying harder to undo what GVN did: we can check that all instructions directly/indirectly referenced by a phi are identical and then replace the usages of the phi with a clone of one of the instructions.

pcc · 2025-05-28T04:55:44Z

This PR doesn't entirely fix the problem. Another case that I'm seeing (and which is more effectively prevented by the previous approach) is where GVN PRE moves the zext behind a phi, and after subsequent optimization passes it turns into a phi pointing to other phis which ultimately point to identical (in the isIdenticalTo sense) zext instructions. We could address this in CGP by trying harder to undo what GVN did: we can check that all instructions directly/indirectly referenced by a phi are identical and then replace the usages of the phi with a clone of one of the instructions.

Done in #141716

nikic · 2025-05-28T09:53:37Z

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

+      Ops.push_back(ShiftAmount);
+      return true;
+    }
+  }


This check needs to be more specific. Even the zext ptrtoint (ptr @g to i8) pattern would not become a relocatable ror immediate under normal circumstances. The fact that the global has !absolute_symbol with an appropriate range is load-bearing here. (Looking at selectRelocImm, it does specifically check for getAbsoluteSymbolRange.)

[𝘀𝗽𝗿] initial version

0bac7b4

Created using spr 1.3.6-beta.1

llvmbot added backend:X86 llvm:transforms labels May 24, 2025

nikic reviewed May 24, 2025

View reviewed changes

Move to isProfitableToSinkOperands

2c6462f

Created using spr 1.3.6-beta.1

pcc changed the title ~~X86: Add X86TargetLowering::isProfitableToHoist hook for immediate operands.~~ X86: Add X86TTIImpl::isProfitableToSinkOperands hook for immediate operands. May 27, 2025

pcc mentioned this pull request May 27, 2025

Transforms: Have CSE/GVN/LICM check isProfitableToHoist() before hoisting. #141325

Closed

pcc mentioned this pull request May 28, 2025

CodeGenPrepare, X86: Support sinking phi nodes. #141716

Open

nikic reviewed May 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

X86: Add X86TTIImpl::isProfitableToSinkOperands hook for immediate operands. #141326

X86: Add X86TTIImpl::isProfitableToSinkOperands hook for immediate operands. #141326

pcc commented May 24, 2025 •

edited

Loading

Uh oh!

llvmbot commented May 24, 2025 •

edited

Loading

Uh oh!

nikic left a comment

Uh oh!

pcc commented May 27, 2025

Uh oh!

nikic commented May 27, 2025

Uh oh!

pcc commented May 27, 2025

Uh oh!

pcc commented May 28, 2025

Uh oh!

pcc commented May 28, 2025

Uh oh!

nikic May 28, 2025

Uh oh!

Uh oh!

X86: Add X86TTIImpl::isProfitableToSinkOperands hook for immediate operands. #141326

Are you sure you want to change the base?

X86: Add X86TTIImpl::isProfitableToSinkOperands hook for immediate operands. #141326

Conversation

pcc commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

pcc commented May 27, 2025

Uh oh!

nikic commented May 27, 2025

Uh oh!

pcc commented May 27, 2025

Uh oh!

pcc commented May 28, 2025

Uh oh!

pcc commented May 28, 2025

Uh oh!

nikic May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcc commented May 24, 2025 •

edited

Loading

llvmbot commented May 24, 2025 •

edited

Loading