Skip to content

[VectorCombine] Fold chain of (scalar load)->ext->ext to load->ext. #141109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented May 22, 2025

Add a new combine that folds a chain of (scalar load)->ext->ext (with
shuffles/casts/inserts in between) to a single vector load and wide
extend.

This makes the IR simpler to analyze and to process, while the backend
can still decide to break them up. Code like that comes from code
written with vector intrinsics. Some examples of real-world use are in
https://github.com/ARM-software/astc-encoder/.

Alive2 proof for presering NNeg https://alive2.llvm.org/ce/z/HyfyPo

@llvmbot
Copy link
Member

llvmbot commented May 22, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

Add a new combine that folds a chain of (scalar load)->ext->ext (with
shuffles/casts/inserts in between) to a single vector load and wide
extend.

This makes the IR simpler to analyze and to process, while the backend
can still decide to break them up. Code like that comes from code
written with vector intrinsics. Some examples of real-world use are in
https://github.com/ARM-software/astc-encoder/.


Patch is 26.44 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/141109.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/VectorCombine.cpp (+51)
  • (added) llvm/test/Transforms/VectorCombine/AArch64/combine-shuffle-ext.ll (+472)
diff --git a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
index fe1d930f295ce..3de60adcd4b2f 100644
--- a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
+++ b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
@@ -127,6 +127,7 @@ class VectorCombine {
   bool foldShuffleOfShuffles(Instruction &I);
   bool foldShuffleOfIntrinsics(Instruction &I);
   bool foldShuffleToIdentity(Instruction &I);
+  bool foldShuffleExtExtracts(Instruction &I);
   bool foldShuffleFromReductions(Instruction &I);
   bool foldCastFromReductions(Instruction &I);
   bool foldSelectShuffle(Instruction &I, bool FromReduction = false);
@@ -2777,6 +2778,55 @@ bool VectorCombine::foldShuffleToIdentity(Instruction &I) {
   return true;
 }
 
+bool VectorCombine::foldShuffleExtExtracts(Instruction &I) {
+  // Try to fold vector zero- and sign-extends split across multiple operations
+  // into a single extend, removing redundant inserts and shuffles.
+
+  // Check if we have an extended shuffle that selects the first vector, which
+  // itself is another extend fed by a load.
+  Instruction *L;
+  if (!match(
+          &I,
+          m_OneUse(m_Shuffle(
+              m_OneUse(m_ZExtOrSExt(m_OneUse(m_BitCast(m_OneUse(m_InsertElt(
+                  m_Value(), m_OneUse(m_Instruction(L)), m_SpecificInt(0))))))),
+              m_Value()))) ||
+      !cast<ShuffleVectorInst>(&I)->isIdentityWithExtract() ||
+      !isa<LoadInst>(L))
+    return false;
+  auto *InnerExt = cast<Instruction>(I.getOperand(0));
+  auto *OuterExt = dyn_cast<Instruction>(*I.user_begin());
+  if (!isa<SExtInst, ZExtInst>(OuterExt))
+    return false;
+
+  // If the inner extend is a sign extend and the outer one isnt (i.e. a
+  // zero-extend), don't fold. If the first one is zero-extend, it doesn't
+  // matter if the second one is a sign- or zero-extend.
+  if (isa<SExtInst>(InnerExt) && !isa<SExtInst>(OuterExt))
+    return false;
+
+  // Don't try to convert the load if it has an odd size.
+  if (!DL->typeSizeEqualsStoreSize(L->getType()))
+    return false;
+  auto *DstTy = cast<FixedVectorType>(OuterExt->getType());
+  auto *SrcTy =
+      FixedVectorType::get(InnerExt->getOperand(0)->getType()->getScalarType(),
+                           DstTy->getNumElements());
+  if (DL->getTypeStoreSize(SrcTy) != DL->getTypeStoreSize(L->getType()))
+    return false;
+
+  // Convert to a vector load feeding a single wide extend.
+  Builder.SetInsertPoint(*L->getInsertionPointAfterDef());
+  auto *NewLoad = cast<LoadInst>(
+      Builder.CreateLoad(SrcTy, L->getOperand(0), L->getName() + ".vec"));
+  auto *NewExt = isa<ZExtInst>(InnerExt) ? Builder.CreateZExt(NewLoad, DstTy)
+                                         : Builder.CreateSExt(NewLoad, DstTy);
+  OuterExt->replaceAllUsesWith(NewExt);
+  replaceValue(*OuterExt, *NewExt);
+  Worklist.pushValue(NewLoad);
+  return true;
+}
+
 /// Given a commutative reduction, the order of the input lanes does not alter
 /// the results. We can use this to remove certain shuffles feeding the
 /// reduction, removing the need to shuffle at all.
@@ -3551,6 +3601,7 @@ bool VectorCombine::run() {
         break;
       case Instruction::ShuffleVector:
         MadeChange |= widenSubvectorLoad(I);
+        MadeChange |= foldShuffleExtExtracts(I);
         break;
       default:
         break;
diff --git a/llvm/test/Transforms/VectorCombine/AArch64/combine-shuffle-ext.ll b/llvm/test/Transforms/VectorCombine/AArch64/combine-shuffle-ext.ll
new file mode 100644
index 0000000000000..6dca21471a0e9
--- /dev/null
+++ b/llvm/test/Transforms/VectorCombine/AArch64/combine-shuffle-ext.ll
@@ -0,0 +1,472 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -p vector-combine -mtriple=arm64-apple-macosx -S %s | FileCheck %s
+
+declare void @use.i32(i32)
+declare void @use.v2i32(<2 x i32>)
+declare void @use.v8i8(<8 x i8>)
+declare void @use.v8i16(<8 x i16>)
+declare void @use.v4i16(<4 x i16>)
+
+define <4 x i32> @load_i32_zext_to_v4i32(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L_VEC:%.*]] = load <4 x i8>, ptr [[DI]], align 4
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext <4 x i8> [[L_VEC]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_both_nneg(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_both_nneg(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L_VEC:%.*]] = load <4 x i8>, ptr [[DI]], align 4
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext <4 x i8> [[L_VEC]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext nneg <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_clobber_after_load(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_clobber_after_load(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L_VEC:%.*]] = load <4 x i8>, ptr [[DI]], align 4
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext <4 x i8> [[L_VEC]] to <4 x i32>
+; CHECK-NEXT:    call void @use.i32(i32 0)
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  call void @use.i32(i32 0)
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_sext_zext_to_v4i32(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_sext_zext_to_v4i32(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[E_1:%.*]] = sext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[E_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = sext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_load_other_users(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_load_other_users(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[E_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[E_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    call void @use.i32(i32 [[L]])
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  call void @use.i32(i32 %l)
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_ins_other_users(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_ins_other_users(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[E_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[E_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    call void @use.v2i32(<2 x i32> [[VEC_INS]])
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  call void @use.v2i32(<2 x i32> %vec.ins)
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_bc_other_users(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_bc_other_users(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[E_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[E_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    call void @use.v8i8(<8 x i8> [[VEC_BC]])
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  call void @use.v8i8(<8 x i8> %vec.bc)
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_ext_other_users(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_ext_other_users(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[E_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[E_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    call void @use.v8i16(<8 x i16> [[E_1]])
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  call void @use.v8i16(<8 x i16> %e.1)
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_shuffle_other_users(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_shuffle_other_users(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[E_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[E_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    call void @use.v8i16(<4 x i16> [[VEC_SHUFFLE]])
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  call void @use.v8i16(<4 x i16> %vec.shuffle)
+  ret <4 x i32> %ext.2
+}
+
+define <8 x i32> @load_i64_zext_to_v8i32(ptr %di) {
+; CHECK-LABEL: define <8 x i32> @load_i64_zext_to_v8i32(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L_VEC:%.*]] = load <8 x i8>, ptr [[DI]], align 8
+; CHECK-NEXT:    [[OUTER_EXT:%.*]] = zext <8 x i8> [[L_VEC]] to <8 x i32>
+; CHECK-NEXT:    ret <8 x i32> [[OUTER_EXT]]
+;
+entry:
+  %l = load i64, ptr %di
+  %vec.ins = insertelement <2 x i64> <i64 poison, i64 0>, i64 %l, i64 0
+  %vec.bc = bitcast <2 x i64> %vec.ins to <16 x i8>
+  %ext.1 = zext <16 x i8> %vec.bc to <16 x i16>
+  %vec.shuffle = shufflevector <16 x i16> %ext.1, <16 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %outer.ext = zext nneg <8 x i16> %vec.shuffle to <8 x i32>
+  ret <8 x i32> %outer.ext
+}
+
+define <3 x i32> @load_i24_zext_to_v3i32(ptr %di) {
+; CHECK-LABEL: define <3 x i32> @load_i24_zext_to_v3i32(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L_VEC:%.*]] = load <3 x i8>, ptr [[DI]], align 4
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext <3 x i8> [[L_VEC]] to <3 x i32>
+; CHECK-NEXT:    ret <3 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i24, ptr %di
+  %vec.ins = insertelement <2 x i24> <i24 poison, i24 0>, i24 %l, i64 0
+  %vec.bc = bitcast <2 x i24> %vec.ins to <6 x i8>
+  %ext.1 = zext <6 x i8> %vec.bc to <6 x i16>
+  %vec.shuffle = shufflevector <6 x i16> %ext.1, <6 x i16> poison, <3 x i32> <i32 0, i32 1, i32 2>
+  %ext.2 = zext nneg <3 x i16> %vec.shuffle to <3 x i32>
+  ret <3 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_insert_idx_1_sext(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_insert_idx_1_sext(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 0, i32 poison>, i32 [[L]], i64 1
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[EXT_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[EXT_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 0, i32 poison>, i32 %l, i64 1
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %ext.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %ext.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2 = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @mask_extracts_not_all_elements_1_sext(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @mask_extracts_not_all_elements_1_sext(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[EXT_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[EXT_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 2>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %ext.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %ext.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 2>
+  %ext.2 = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @mask_extracts_not_all_elements_2_sext(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @mask_extracts_not_all_elements_2_sext(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[EXT_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[EXT_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %ext.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %ext.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
+  %ext.2 = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @mask_extracts_second_vector_sext(ptr %di, <8 x i16> %other) {
+; CHECK-LABEL: define <4 x i32> @mask_extracts_second_vector_sext(
+; CHECK-SAME: ptr [[DI:%.*]], <8 x i16> [[OTHER:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[EXT_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[EXT_1]], <8 x i16> [[OTHER]], <4 x i32> <i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %ext.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %ext.1, <8 x i16> %other, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
+  %ext.2 = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_sext_to_v4i32(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_sext_to_v4i32(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L_VEC:%.*]] = load <4 x i8>, ptr [[DI]], align 4
+; CHECK-NEXT:    [[EXT_2:%.*]] = sext <4 x i8> [[L_VEC]] to <4 x i32>
+; CHEC...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented May 22, 2025

@llvm/pr-subscribers-vectorizers

Author: Florian Hahn (fhahn)

Changes

Add a new combine that folds a chain of (scalar load)->ext->ext (with
shuffles/casts/inserts in between) to a single vector load and wide
extend.

This makes the IR simpler to analyze and to process, while the backend
can still decide to break them up. Code like that comes from code
written with vector intrinsics. Some examples of real-world use are in
https://github.com/ARM-software/astc-encoder/.


Patch is 26.44 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/141109.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/VectorCombine.cpp (+51)
  • (added) llvm/test/Transforms/VectorCombine/AArch64/combine-shuffle-ext.ll (+472)
diff --git a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
index fe1d930f295ce..3de60adcd4b2f 100644
--- a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
+++ b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
@@ -127,6 +127,7 @@ class VectorCombine {
   bool foldShuffleOfShuffles(Instruction &I);
   bool foldShuffleOfIntrinsics(Instruction &I);
   bool foldShuffleToIdentity(Instruction &I);
+  bool foldShuffleExtExtracts(Instruction &I);
   bool foldShuffleFromReductions(Instruction &I);
   bool foldCastFromReductions(Instruction &I);
   bool foldSelectShuffle(Instruction &I, bool FromReduction = false);
@@ -2777,6 +2778,55 @@ bool VectorCombine::foldShuffleToIdentity(Instruction &I) {
   return true;
 }
 
+bool VectorCombine::foldShuffleExtExtracts(Instruction &I) {
+  // Try to fold vector zero- and sign-extends split across multiple operations
+  // into a single extend, removing redundant inserts and shuffles.
+
+  // Check if we have an extended shuffle that selects the first vector, which
+  // itself is another extend fed by a load.
+  Instruction *L;
+  if (!match(
+          &I,
+          m_OneUse(m_Shuffle(
+              m_OneUse(m_ZExtOrSExt(m_OneUse(m_BitCast(m_OneUse(m_InsertElt(
+                  m_Value(), m_OneUse(m_Instruction(L)), m_SpecificInt(0))))))),
+              m_Value()))) ||
+      !cast<ShuffleVectorInst>(&I)->isIdentityWithExtract() ||
+      !isa<LoadInst>(L))
+    return false;
+  auto *InnerExt = cast<Instruction>(I.getOperand(0));
+  auto *OuterExt = dyn_cast<Instruction>(*I.user_begin());
+  if (!isa<SExtInst, ZExtInst>(OuterExt))
+    return false;
+
+  // If the inner extend is a sign extend and the outer one isnt (i.e. a
+  // zero-extend), don't fold. If the first one is zero-extend, it doesn't
+  // matter if the second one is a sign- or zero-extend.
+  if (isa<SExtInst>(InnerExt) && !isa<SExtInst>(OuterExt))
+    return false;
+
+  // Don't try to convert the load if it has an odd size.
+  if (!DL->typeSizeEqualsStoreSize(L->getType()))
+    return false;
+  auto *DstTy = cast<FixedVectorType>(OuterExt->getType());
+  auto *SrcTy =
+      FixedVectorType::get(InnerExt->getOperand(0)->getType()->getScalarType(),
+                           DstTy->getNumElements());
+  if (DL->getTypeStoreSize(SrcTy) != DL->getTypeStoreSize(L->getType()))
+    return false;
+
+  // Convert to a vector load feeding a single wide extend.
+  Builder.SetInsertPoint(*L->getInsertionPointAfterDef());
+  auto *NewLoad = cast<LoadInst>(
+      Builder.CreateLoad(SrcTy, L->getOperand(0), L->getName() + ".vec"));
+  auto *NewExt = isa<ZExtInst>(InnerExt) ? Builder.CreateZExt(NewLoad, DstTy)
+                                         : Builder.CreateSExt(NewLoad, DstTy);
+  OuterExt->replaceAllUsesWith(NewExt);
+  replaceValue(*OuterExt, *NewExt);
+  Worklist.pushValue(NewLoad);
+  return true;
+}
+
 /// Given a commutative reduction, the order of the input lanes does not alter
 /// the results. We can use this to remove certain shuffles feeding the
 /// reduction, removing the need to shuffle at all.
@@ -3551,6 +3601,7 @@ bool VectorCombine::run() {
         break;
       case Instruction::ShuffleVector:
         MadeChange |= widenSubvectorLoad(I);
+        MadeChange |= foldShuffleExtExtracts(I);
         break;
       default:
         break;
diff --git a/llvm/test/Transforms/VectorCombine/AArch64/combine-shuffle-ext.ll b/llvm/test/Transforms/VectorCombine/AArch64/combine-shuffle-ext.ll
new file mode 100644
index 0000000000000..6dca21471a0e9
--- /dev/null
+++ b/llvm/test/Transforms/VectorCombine/AArch64/combine-shuffle-ext.ll
@@ -0,0 +1,472 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -p vector-combine -mtriple=arm64-apple-macosx -S %s | FileCheck %s
+
+declare void @use.i32(i32)
+declare void @use.v2i32(<2 x i32>)
+declare void @use.v8i8(<8 x i8>)
+declare void @use.v8i16(<8 x i16>)
+declare void @use.v4i16(<4 x i16>)
+
+define <4 x i32> @load_i32_zext_to_v4i32(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L_VEC:%.*]] = load <4 x i8>, ptr [[DI]], align 4
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext <4 x i8> [[L_VEC]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_both_nneg(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_both_nneg(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L_VEC:%.*]] = load <4 x i8>, ptr [[DI]], align 4
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext <4 x i8> [[L_VEC]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext nneg <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_clobber_after_load(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_clobber_after_load(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L_VEC:%.*]] = load <4 x i8>, ptr [[DI]], align 4
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext <4 x i8> [[L_VEC]] to <4 x i32>
+; CHECK-NEXT:    call void @use.i32(i32 0)
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  call void @use.i32(i32 0)
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_sext_zext_to_v4i32(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_sext_zext_to_v4i32(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[E_1:%.*]] = sext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[E_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = sext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_load_other_users(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_load_other_users(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[E_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[E_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    call void @use.i32(i32 [[L]])
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  call void @use.i32(i32 %l)
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_ins_other_users(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_ins_other_users(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[E_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[E_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    call void @use.v2i32(<2 x i32> [[VEC_INS]])
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  call void @use.v2i32(<2 x i32> %vec.ins)
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_bc_other_users(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_bc_other_users(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[E_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[E_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    call void @use.v8i8(<8 x i8> [[VEC_BC]])
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  call void @use.v8i8(<8 x i8> %vec.bc)
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_ext_other_users(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_ext_other_users(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[E_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[E_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    call void @use.v8i16(<8 x i16> [[E_1]])
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  call void @use.v8i16(<8 x i16> %e.1)
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_zext_to_v4i32_shuffle_other_users(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_zext_to_v4i32_shuffle_other_users(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[E_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[E_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    call void @use.v8i16(<4 x i16> [[VEC_SHUFFLE]])
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %e.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %e.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2  = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  call void @use.v8i16(<4 x i16> %vec.shuffle)
+  ret <4 x i32> %ext.2
+}
+
+define <8 x i32> @load_i64_zext_to_v8i32(ptr %di) {
+; CHECK-LABEL: define <8 x i32> @load_i64_zext_to_v8i32(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L_VEC:%.*]] = load <8 x i8>, ptr [[DI]], align 8
+; CHECK-NEXT:    [[OUTER_EXT:%.*]] = zext <8 x i8> [[L_VEC]] to <8 x i32>
+; CHECK-NEXT:    ret <8 x i32> [[OUTER_EXT]]
+;
+entry:
+  %l = load i64, ptr %di
+  %vec.ins = insertelement <2 x i64> <i64 poison, i64 0>, i64 %l, i64 0
+  %vec.bc = bitcast <2 x i64> %vec.ins to <16 x i8>
+  %ext.1 = zext <16 x i8> %vec.bc to <16 x i16>
+  %vec.shuffle = shufflevector <16 x i16> %ext.1, <16 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %outer.ext = zext nneg <8 x i16> %vec.shuffle to <8 x i32>
+  ret <8 x i32> %outer.ext
+}
+
+define <3 x i32> @load_i24_zext_to_v3i32(ptr %di) {
+; CHECK-LABEL: define <3 x i32> @load_i24_zext_to_v3i32(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L_VEC:%.*]] = load <3 x i8>, ptr [[DI]], align 4
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext <3 x i8> [[L_VEC]] to <3 x i32>
+; CHECK-NEXT:    ret <3 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i24, ptr %di
+  %vec.ins = insertelement <2 x i24> <i24 poison, i24 0>, i24 %l, i64 0
+  %vec.bc = bitcast <2 x i24> %vec.ins to <6 x i8>
+  %ext.1 = zext <6 x i8> %vec.bc to <6 x i16>
+  %vec.shuffle = shufflevector <6 x i16> %ext.1, <6 x i16> poison, <3 x i32> <i32 0, i32 1, i32 2>
+  %ext.2 = zext nneg <3 x i16> %vec.shuffle to <3 x i32>
+  ret <3 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_insert_idx_1_sext(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_insert_idx_1_sext(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 0, i32 poison>, i32 [[L]], i64 1
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[EXT_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[EXT_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 0, i32 poison>, i32 %l, i64 1
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %ext.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %ext.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %ext.2 = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @mask_extracts_not_all_elements_1_sext(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @mask_extracts_not_all_elements_1_sext(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[EXT_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[EXT_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 2>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %ext.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %ext.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 2>
+  %ext.2 = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @mask_extracts_not_all_elements_2_sext(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @mask_extracts_not_all_elements_2_sext(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[EXT_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[EXT_1]], <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %ext.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %ext.1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
+  %ext.2 = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @mask_extracts_second_vector_sext(ptr %di, <8 x i16> %other) {
+; CHECK-LABEL: define <4 x i32> @mask_extracts_second_vector_sext(
+; CHECK-SAME: ptr [[DI:%.*]], <8 x i16> [[OTHER:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[DI]], align 4
+; CHECK-NEXT:    [[VEC_INS:%.*]] = insertelement <2 x i32> <i32 poison, i32 0>, i32 [[L]], i64 0
+; CHECK-NEXT:    [[VEC_BC:%.*]] = bitcast <2 x i32> [[VEC_INS]] to <8 x i8>
+; CHECK-NEXT:    [[EXT_1:%.*]] = zext <8 x i8> [[VEC_BC]] to <8 x i16>
+; CHECK-NEXT:    [[VEC_SHUFFLE:%.*]] = shufflevector <8 x i16> [[EXT_1]], <8 x i16> [[OTHER]], <4 x i32> <i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT:    [[EXT_2:%.*]] = zext nneg <4 x i16> [[VEC_SHUFFLE]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[EXT_2]]
+;
+entry:
+  %l = load i32, ptr %di
+  %vec.ins = insertelement <2 x i32> <i32 poison, i32 0>, i32 %l, i64 0
+  %vec.bc = bitcast <2 x i32> %vec.ins to <8 x i8>
+  %ext.1 = zext <8 x i8> %vec.bc to <8 x i16>
+  %vec.shuffle = shufflevector <8 x i16> %ext.1, <8 x i16> %other, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
+  %ext.2 = zext nneg <4 x i16> %vec.shuffle to <4 x i32>
+  ret <4 x i32> %ext.2
+}
+
+define <4 x i32> @load_i32_sext_to_v4i32(ptr %di) {
+; CHECK-LABEL: define <4 x i32> @load_i32_sext_to_v4i32(
+; CHECK-SAME: ptr [[DI:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[L_VEC:%.*]] = load <4 x i8>, ptr [[DI]], align 4
+; CHECK-NEXT:    [[EXT_2:%.*]] = sext <4 x i8> [[L_VEC]] to <4 x i32>
+; CHEC...
[truncated]

!isa<LoadInst>(L))
return false;
auto *InnerExt = cast<Instruction>(I.getOperand(0));
auto *OuterExt = dyn_cast<Instruction>(*I.user_begin());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto *OuterExt = dyn_cast<Instruction>(*I.user_begin());
auto *OuterExt = cast<Instruction>(*I.user_begin());

Unchecked dyn_cast

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed thanks. I might be mis-remembering but I thought there was a way to capture pointers to matched instructions, but I couldn't find how to do it, so I had to resort to the casts after the fact.

Builder.SetInsertPoint(*L->getInsertionPointAfterDef());
auto *NewLoad = cast<LoadInst>(
Builder.CreateLoad(SrcTy, L->getOperand(0), L->getName() + ".vec"));
auto *NewExt = isa<ZExtInst>(InnerExt) ? Builder.CreateZExt(NewLoad, DstTy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loses nneg flag on the cast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks updated to preserve, Alive2: https://alive2.llvm.org/ce/z/HyfyPo

fhahn added a commit that referenced this pull request May 23, 2025
@fhahn fhahn force-pushed the vector-combine-combine-load-ext-chain branch from e1d5787 to 5d5fe9b Compare May 23, 2025 13:09
Copy link

github-actions bot commented May 23, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request May 23, 2025
!isa<LoadInst>(L))
return false;
auto *InnerExt = cast<Instruction>(I.getOperand(0));
auto *OuterExt = cast<Instruction>(*I.user_begin());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can OuterExt cast ever fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, it is a user of an instruction, which in turn should require it itself to be an instruction.

fhahn added 3 commits May 27, 2025 11:45
Add a new combine that folds a chain of (scalar load)->ext->ext (with
shuffles/casts/inserts in between) to a single vector load and wide
extend.

This makes the IR simpler to analyze and to process, while the backend
can still decide to break them up. Code like that comes from code
written with vector intrinsics. Some examples of real-world use are in
https://github.com/ARM-software/astc-encoder/.
@fhahn fhahn force-pushed the vector-combine-combine-load-ext-chain branch from 23da6b2 to 4f1c76b Compare May 27, 2025 10:46
Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections, my only thought is why its in VectorCombine and not InstCombine if it isn't cost/target driven?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants