Skip to content

X86: Add X86TTIImpl::isProfitableToSinkOperands hook for immediate operands. #141326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: users/pcc/spr/main.x86-add-x86targetloweringisprofitabletohoist-hook-for-immediate-operands
Choose a base branch
from

Conversation

pcc
Copy link
Contributor

@pcc pcc commented May 24, 2025

Casts taking a constant expression (generally derived from a global
variable address) as an operand are not profitable to sink because they
appear as subexpressions in the instruction sequence generated by the
LowerTypeTests pass which is expected to pattern match to the rotate
instruction's immediate operand.

Created using spr 1.3.6-beta.1
@llvmbot
Copy link
Member

llvmbot commented May 24, 2025

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-transforms

Author: Peter Collingbourne (pcc)

Changes

Casts taking a constant expression (generally derived from a global
variable address) as an operand are not profitable to CSE because they
appear as subexpressions in the instruction sequence generated by the
LowerTypeTests pass which is expected to pattern match to the rotate
instruction. These casts are expected to be code generated into the
instruction's immediate operand, so there is no benefit to hoisting
them anyway.


Full diff: https://github.com/llvm/llvm-project/pull/141326.diff

4 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+13)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.h (+2)
  • (modified) llvm/lib/Transforms/Scalar/GVN.cpp (+2-2)
  • (added) llvm/test/Other/inhibit-zext-constant-hoist.ll (+179)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 6b4b4beb97ca5..ec252bcd9577a 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -62119,3 +62119,16 @@ Align X86TargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
     return Align(1ULL << ExperimentalPrefInnermostLoopAlignment);
   return TargetLowering::getPrefLoopAlignment();
 }
+
+bool X86TargetLowering::isProfitableToHoist(Instruction *I) const {
+  // Casts taking a constant expression (generally derived from a global
+  // variable address) as an operand are not profitable to CSE because they
+  // appear as subexpressions in the instruction sequence generated by the
+  // LowerTypeTests pass which is expected to pattern match to the rotate
+  // instruction. These casts are expected to be code generated into the
+  // instruction's immediate operand, so there is no benefit to hoisting them
+  // anyway.
+  if (isa<CastInst>(I) && isa<ConstantExpr>(I->getOperand(0)))
+    return false;
+  return true;
+}
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 662552a972249..302121e3ced5a 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1669,6 +1669,8 @@ namespace llvm {
 
     Align getPrefLoopAlignment(MachineLoop *ML) const override;
 
+    bool isProfitableToHoist(Instruction *I) const override;
+
     EVT getTypeToTransformTo(LLVMContext &Context, EVT VT) const override {
       if (VT == MVT::f80)
         return EVT::getIntegerVT(Context, 96);
diff --git a/llvm/lib/Transforms/Scalar/GVN.cpp b/llvm/lib/Transforms/Scalar/GVN.cpp
index 0cb38bcfff2d4..1a72e3b48105d 100644
--- a/llvm/lib/Transforms/Scalar/GVN.cpp
+++ b/llvm/lib/Transforms/Scalar/GVN.cpp
@@ -2724,7 +2724,7 @@ bool GVNPass::processInstruction(Instruction *I) {
   // If this GVN would effectively hoist the instruction and that's known to be
   // unprofitable, don't do it. Remember the instruction so that it can be used
   // by other instructions in the same block.
-  if (auto *ReplI = dyn_cast<Instruction>(I)) {
+  if (auto *ReplI = dyn_cast<Instruction>(Repl)) {
     if (ReplI->getParent() != I->getParent() && !TTI->isProfitableToHoist(I)) {
       LeaderTable.insert(Num, I, I->getParent());
       return false;
@@ -2891,7 +2891,7 @@ bool GVNPass::performScalarPRE(Instruction *CurInst) {
   if (isa<AllocaInst>(CurInst) || CurInst->isTerminator() ||
       isa<PHINode>(CurInst) || CurInst->getType()->isVoidTy() ||
       CurInst->mayReadFromMemory() || CurInst->mayHaveSideEffects() ||
-      isa<DbgInfoIntrinsic>(CurInst))
+      isa<DbgInfoIntrinsic>(CurInst) || !TTI->isProfitableToHoist(CurInst))
     return false;
 
   // Don't do PRE on compares. The PHI would prevent CodeGenPrepare from
diff --git a/llvm/test/Other/inhibit-zext-constant-hoist.ll b/llvm/test/Other/inhibit-zext-constant-hoist.ll
new file mode 100644
index 0000000000000..eaad0b8a04e81
--- /dev/null
+++ b/llvm/test/Other/inhibit-zext-constant-hoist.ll
@@ -0,0 +1,179 @@
+; REQUIRES: x86-registered-target
+; Make sure that optimizations do not hoist zext(const).
+
+; This IR is normally generated by LowerTypeTests during ThinLTO importing
+; so it will go through the ThinLTO pass pipeline.
+; RUN: opt -passes='thinlto<O0>' -S < %s | FileCheck %s
+; RUN: opt -passes='thinlto<O1>' -S < %s | FileCheck %s
+; RUN: opt -passes='thinlto<O2>' -S < %s | FileCheck %s
+; RUN: opt -passes='thinlto<O3>' -S < %s | FileCheck %s
+
+; Also check the regular pipelines for completeness.
+; RUN: opt -O0 -S < %s | FileCheck %s
+; RUN: opt -O1 -S < %s | FileCheck %s
+; RUN: opt -O2 -S < %s | FileCheck %s
+; RUN: opt -O3 -S < %s | FileCheck %s
+; RUN: opt -Os -S < %s | FileCheck %s
+; RUN: opt -Oz -S < %s | FileCheck %s
+
+target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
+target triple = "x86_64-unknown-linux-gnu"
+
+@__typeid__ZTS1S_global_addr = external hidden global [0 x i8], code_model "small"
+@__typeid__ZTS1S_align = external hidden global [0 x i8], !absolute_symbol !0
+@__typeid__ZTS1S_size_m1 = external hidden global [0 x i8], !absolute_symbol !1
+
+; Check that we still have two pairs of zexts (non dominating case).
+
+; CHECK: define void @f1
+define void @f1(i1 noundef zeroext %0, ptr noundef %1, ptr noundef %2) {
+  br i1 %0, label %4, label %17
+
+4:
+  %5 = load ptr, ptr %1, align 8
+  %6 = ptrtoint ptr %5 to i64
+  %7 = sub i64 %6, ptrtoint (ptr @__typeid__ZTS1S_global_addr to i64)
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %8 = zext i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8) to i64
+  %9 = lshr i64 %7, %8
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %10 = zext i8 sub (i8 64, i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8)) to i64
+  %11 = shl i64 %7, %10
+  %12 = or i64 %9, %11
+  %13 = icmp ule i64 %12, ptrtoint (ptr @__typeid__ZTS1S_size_m1 to i64)
+  br i1 %13, label %15, label %14
+
+14:
+  call void @llvm.ubsantrap(i8 2)
+  unreachable
+
+15:
+  %16 = load ptr, ptr %5, align 8
+  call void %16(ptr noundef nonnull align 8 dereferenceable(8) %1)
+  br label %30
+
+17:
+  %18 = load ptr, ptr %2, align 8
+  %19 = ptrtoint ptr %18 to i64
+  %20 = sub i64 %19, ptrtoint (ptr @__typeid__ZTS1S_global_addr to i64)
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %21 = zext i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8) to i64
+  %22 = lshr i64 %20, %21
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %23 = zext i8 sub (i8 64, i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8)) to i64
+  %24 = shl i64 %20, %23
+  %25 = or i64 %22, %24
+  %26 = icmp ule i64 %25, ptrtoint (ptr @__typeid__ZTS1S_size_m1 to i64)
+  br i1 %26, label %28, label %27
+
+27:
+  call void @llvm.ubsantrap(i8 2) #6
+  unreachable
+
+28:
+  %29 = load ptr, ptr %18, align 8
+  call void %29(ptr noundef nonnull align 8 dereferenceable(8) %2)
+  br label %30
+
+30:
+  ret void
+}
+
+; Check that we still have two pairs of zexts (dominating case).
+
+; CHECK: define void @f2
+define void @f2(i1 noundef zeroext %0, ptr noundef %1, ptr noundef %2) {
+  %4 = load ptr, ptr %1, align 8
+  %5 = ptrtoint ptr %4 to i64
+  %6 = sub i64 %5, ptrtoint (ptr @__typeid__ZTS1S_global_addr to i64)
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %7 = zext i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8) to i64
+  %8 = lshr i64 %6, %7
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %9 = zext i8 sub (i8 64, i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8)) to i64
+  %10 = shl i64 %6, %9
+  %11 = or i64 %8, %10
+  %12 = icmp ule i64 %11, ptrtoint (ptr @__typeid__ZTS1S_size_m1 to i64)
+  br i1 %12, label %14, label %13
+
+13:                                               ; preds = %3
+  call void @llvm.ubsantrap(i8 2)
+  unreachable
+
+14:                                               ; preds = %3
+  %15 = load ptr, ptr %4, align 8
+  call void %15(ptr noundef nonnull align 8 dereferenceable(8) %1)
+  br i1 %0, label %16, label %29
+
+16:                                               ; preds = %14
+  %17 = load ptr, ptr %2, align 8
+  %18 = ptrtoint ptr %17 to i64
+  %19 = sub i64 %18, ptrtoint (ptr @__typeid__ZTS1S_global_addr to i64)
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %20 = zext i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8) to i64
+  %21 = lshr i64 %19, %20
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %22 = zext i8 sub (i8 64, i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8)) to i64
+  %23 = shl i64 %19, %22
+  %24 = or i64 %21, %23
+  %25 = icmp ule i64 %24, ptrtoint (ptr @__typeid__ZTS1S_size_m1 to i64)
+  br i1 %25, label %27, label %26
+
+26:                                               ; preds = %16
+  call void @llvm.ubsantrap(i8 2) #6
+  unreachable
+
+27:                                               ; preds = %16
+  %28 = load ptr, ptr %17, align 8
+  call void %28(ptr noundef nonnull align 8 dereferenceable(8) %2)
+  br label %29
+
+29:                                               ; preds = %27, %14
+  ret void
+}
+
+; Check that the zexts aren't moved to the preheader (or anywhere else)
+; and stay in the same basic block.
+
+; CHECK: define void @f3
+define void @f3(ptr noundef readonly captures(address) %0, ptr noundef readnone captures(address) %1) {
+  %3 = icmp eq ptr %0, %1
+  br i1 %3, label %21, label %4
+
+4:
+  ; CHECK: = phi
+  %5 = phi ptr [ %19, %17 ], [ %0, %2 ]
+  %6 = load ptr, ptr %5, align 8
+  %7 = load ptr, ptr %6, align 8
+  %8 = ptrtoint ptr %7 to i64
+  %9 = sub i64 %8, ptrtoint (ptr @__typeid__ZTS1S_global_addr to i64)
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %10 = zext i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8) to i64
+  %11 = lshr i64 %9, %10
+  ; CHECK: zext {{.*}} @__typeid__ZTS1S_align
+  %12 = zext i8 sub (i8 64, i8 ptrtoint (ptr @__typeid__ZTS1S_align to i8)) to i64
+  %13 = shl i64 %9, %12
+  %14 = or i64 %11, %13
+  %15 = icmp ule i64 %14, ptrtoint (ptr @__typeid__ZTS1S_size_m1 to i64)
+  br i1 %15, label %17, label %16
+
+16:
+  call void @llvm.ubsantrap(i8 2)
+  unreachable
+
+17:
+  %18 = load ptr, ptr %7, align 8
+  call void %18(ptr noundef nonnull align 8 dereferenceable(8) %6)
+  %19 = getelementptr inbounds nuw i8, ptr %5, i64 8
+  %20 = icmp eq ptr %19, %1
+  br i1 %20, label %21, label %4
+
+21:
+  ret void
+}
+
+declare i1 @llvm.type.test(ptr, metadata)
+declare void @llvm.ubsantrap(i8 immarg)
+
+!0 = !{i64 0, i64 256}
+!1 = !{i64 0, i64 128}

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super clear on what it is you're trying to fix here.

Probably LowerTypeTests should be directly emitting a fshl/fshr instead of a bit shift sequence to make matching to rotate more reliable. Would that help you or not?

@pcc
Copy link
Contributor Author

pcc commented May 27, 2025

I'm fixing the code generation for the test cases that I'm adding (inhibit-zext-constant-hoist.ll) which were all extracted from a build of a large internal program built with CFI. Previously f1 looked like this where align was hoisted:

f1:                                     # @f1
	.cfi_startproc
# %bb.0:
	movl	$__typeid__ZTS1S_align, %eax
	movzbl	%al, %ecx
	movb	$64, %al
	subb	%cl, %al
	movzbl	%al, %eax
	testl	%edi, %edi
	je	.LBB0_3
# %bb.1:
	movq	(%rsi), %r8
	movl	$__typeid__ZTS1S_global_addr, %edx
	movq	%r8, %rdi
	subq	%rdx, %rdi
	movq	%rdi, %rdx
                                        # kill: def $cl killed $cl killed $rcx
	shrq	%cl, %rdx
	movl	%eax, %ecx
	shlq	%cl, %rdi
	orq	%rdx, %rdi
	cmpq	$__typeid__ZTS1S_size_m1@ABS8, %rdi
	jbe	.LBB0_4
.LBB0_2:
	ud1l	2(%eax), %eax
.LBB0_3:
	movq	(%rdx), %r8
	movl	$__typeid__ZTS1S_global_addr, %esi
	movq	%r8, %rdi
	subq	%rsi, %rdi
	movq	%rdi, %rsi
                                        # kill: def $cl killed $cl killed $rcx
	shrq	%cl, %rsi
	movl	%eax, %ecx
	shlq	%cl, %rdi
	orq	%rsi, %rdi
	cmpq	$__typeid__ZTS1S_size_m1@ABS8, %rdi
	movq	%rdx, %rsi
	ja	.LBB0_2
.LBB0_4:
	movq	%rsi, %rdi
	jmpq	*(%r8)                          # TAILCALL
.Lfunc_end0:
	.size	f1, .Lfunc_end0-f1
	.cfi_endproc

Now f1 looks like this:

f1:                                     # @f1
	.cfi_startproc
# %bb.0:
	testl	%edi, %edi
	je	.LBB0_3
# %bb.1:
	movq	(%rsi), %rax
	movl	$__typeid__ZTS1S_global_addr, %ecx
	movq	%rax, %rdx
	subq	%rcx, %rdx
	rorq	$__typeid__ZTS1S_align, %rdx
	cmpq	$__typeid__ZTS1S_size_m1@ABS8, %rdx
	jbe	.LBB0_4
.LBB0_2:
	ud1l	2(%eax), %eax
.LBB0_3:
	movq	(%rdx), %rax
	movl	$__typeid__ZTS1S_global_addr, %ecx
	movq	%rax, %rsi
	subq	%rcx, %rsi
	rorq	$__typeid__ZTS1S_align, %rsi
	cmpq	$__typeid__ZTS1S_size_m1@ABS8, %rsi
	movq	%rdx, %rsi
	ja	.LBB0_2
.LBB0_4:
	movq	%rsi, %rdi
	jmpq	*(%rax)                         # TAILCALL
.Lfunc_end0:
	.size	f1, .Lfunc_end0-f1
	.cfi_endproc

The other cases look similar before my change.

The poor codegen issue was introduced (I believe) by #71040 which removed the zext ConstantExprs that LowerTypeTests was using to keep everything in the same basic block for matching.

I think that because it's the zext being hoisted and not the shifts it wouldn't make a difference to use fshl/fshr but I can check.

@nikic
Copy link
Contributor

nikic commented May 27, 2025

@pcc I think using fsh should at least help to get a ror instead of the shr+shl+or.

Actually getting the value duplicated+sunk into each block is typically done in CGP, which has a bunch of related transforms. The most generic is probably tryToSinkFreeOperands driven by TTI.isProfitableToSinkOperands.

Created using spr 1.3.6-beta.1
@pcc pcc changed the title X86: Add X86TargetLowering::isProfitableToHoist hook for immediate operands. X86: Add X86TTIImpl::isProfitableToSinkOperands hook for immediate operands. May 27, 2025
@pcc
Copy link
Contributor Author

pcc commented May 27, 2025

@pcc I think using fsh should at least help to get a ror instead of the shr+shl+or.

Actually getting the value duplicated+sunk into each block is typically done in CGP, which has a bunch of related transforms. The most generic is probably tryToSinkFreeOperands driven by TTI.isProfitableToSinkOperands.

Thanks for the suggestion, that works a lot better. Updated the PR.

@pcc
Copy link
Contributor Author

pcc commented May 28, 2025

This PR doesn't entirely fix the problem. Another case that I'm seeing (and which is more effectively prevented by the previous approach) is where GVN PRE moves the zext behind a phi, and after subsequent optimization passes it turns into a phi pointing to other phis which ultimately point to identical (in the isIdenticalTo sense) zext instructions. We could address this in CGP by trying harder to undo what GVN did: we can check that all instructions directly/indirectly referenced by a phi are identical and then replace the usages of the phi with a clone of one of the instructions.

@pcc
Copy link
Contributor Author

pcc commented May 28, 2025

This PR doesn't entirely fix the problem. Another case that I'm seeing (and which is more effectively prevented by the previous approach) is where GVN PRE moves the zext behind a phi, and after subsequent optimization passes it turns into a phi pointing to other phis which ultimately point to identical (in the isIdenticalTo sense) zext instructions. We could address this in CGP by trying harder to undo what GVN did: we can check that all instructions directly/indirectly referenced by a phi are identical and then replace the usages of the phi with a clone of one of the instructions.

Done in #141716

Ops.push_back(ShiftAmount);
return true;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check needs to be more specific. Even the zext ptrtoint (ptr @g to i8) pattern would not become a relocatable ror immediate under normal circumstances. The fact that the global has !absolute_symbol with an appropriate range is load-bearing here. (Looking at selectRelocImm, it does specifically check for getAbsoluteSymbolRange.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants