Skip to content

[LV] Compute register usage for interleaving on VPlan. #126437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 8, 2025

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Feb 9, 2025

Add a version of calculateRegisterUsage that works estimates register
usage for a VPlan. This mostly just ports the existing code, with some
updates to figure out what recipes will generate vectors vs scalars.

There are number of changes in the computed register usages, but they
should be more accurate w.r.t. to the generated vector code.

There are the following changes:

  • Scalar usage increases in most cases by 1, as we always create a
    scalar canonical IV, which is alive across the loop and is not
    considered by the legacy implementation

  • Output is ordered by insertion, now scalar registers are added first
    due the canonical IV phi.

  • Using the VPlan, we now also more precisely know if an induction will
    be vectorized or scalarized.

Depends on #126415

@llvmbot
Copy link
Member

llvmbot commented Feb 9, 2025

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-backend-risc-v

Author: Florian Hahn (fhahn)

Changes

Add a version of calculateRegisterUsage that works estimates register
usage for a VPlan. This mostly just ports the existing code, with some
updates to figure out what recipes will generate vectors vs scalars.

There are number of changes in the computed register usages, but they
should be more accurate w.r.t. to the generated vector code.

There are the following changes:

  • Scalar usage increases in most cases by 1, as we always create a
    scalar canonical IV, which is alive across the loop and is not
    considered by the legacy implementation

  • Output is ordered by insertion, now scalar registers are added first
    due the canonical IV phi.

  • Using the VPlan, we now also more precisely know if an induction will
    be vectorized or scalarized.

Depends on #126415


Patch is 52.47 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126437.diff

14 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+230-5)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll (+5-3)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll (+48-12)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/large-loop-rdx.ll (-4)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/reg-usage.ll (+8-4)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll (+5-5)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll (+14-6)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/i1-reg-usage.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/pr47437.ll (+21-99)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/reg-usage.ll (+3-1)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c2d347cf9b7e0c9..9727487a1dd1396 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1020,7 +1020,8 @@ class LoopVectorizationCostModel {
   /// If interleave count has been specified by metadata it will be returned.
   /// Otherwise, the interleave count is computed and returned. VF and LoopCost
   /// are the selected vectorization factor and the cost of the selected VF.
-  unsigned selectInterleaveCount(ElementCount VF, InstructionCost LoopCost);
+  unsigned selectInterleaveCount(VPlan &Plan, ElementCount VF,
+                                 InstructionCost LoopCost);
 
   /// Memory access instruction may be vectorized in more than one way.
   /// Form of instruction after vectorization depends on cost.
@@ -4878,8 +4879,232 @@ void LoopVectorizationCostModel::collectElementTypesForWidening() {
   }
 }
 
+/// Estimate the register usage for \p Plan and vectorization factors in \p VFs.
+/// Returns the register usage for each VF in \p VFs.
+static SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
+calculateRegisterUsage(VPlan &Plan, ArrayRef<ElementCount> VFs,
+                       const TargetTransformInfo &TTI) {
+  // This function calculates the register usage by measuring the highest number
+  // of values that are alive at a single location. Obviously, this is a very
+  // rough estimation. We scan the loop in a topological order in order and
+  // assign a number to each recipe. We use RPO to ensure that defs are
+  // met before their users. We assume that each recipe that has in-loop
+  // users starts an interval. We record every time that an in-loop value is
+  // used, so we have a list of the first and last occurrences of each
+  // recipe. Next, we transpose this data structure into a multi map that
+  // holds the list of intervals that *end* at a specific location. This multi
+  // map allows us to perform a linear search. We scan the instructions linearly
+  // and record each time that a new interval starts, by placing it in a set.
+  // If we find this value in the multi-map then we remove it from the set.
+  // The max register usage is the maximum size of the set.
+  // We also search for instructions that are defined outside the loop, but are
+  // used inside the loop. We need this number separately from the max-interval
+  // usage number because when we unroll, loop-invariant values do not take
+  // more register.
+  LoopVectorizationCostModel::RegisterUsage RU;
+
+  // Each 'key' in the map opens a new interval. The values
+  // of the map are the index of the 'last seen' usage of the
+  // recipe that is the key.
+  using IntervalMap = SmallDenseMap<VPRecipeBase *, unsigned, 16>;
+
+  // Maps recipe to its index.
+  SmallVector<VPRecipeBase *, 64> IdxToRecipe;
+  // Marks the end of each interval.
+  IntervalMap EndPoint;
+  // Saves the list of recipe indices that are used in the loop.
+  SmallPtrSet<VPRecipeBase *, 8> Ends;
+  // Saves the list of values that are used in the loop but are defined outside
+  // the loop (not including non-recipe values such as arguments and
+  // constants).
+  SmallSetVector<VPValue *, 8> LoopInvariants;
+  LoopInvariants.insert(&Plan.getVectorTripCount());
+
+  ReversePostOrderTraversal<VPBlockDeepTraversalWrapper<VPBlockBase *>> RPOT(
+      Plan.getVectorLoopRegion());
+  for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {
+    if (!VPBB->getParent())
+      break;
+    for (VPRecipeBase &R : *VPBB) {
+      IdxToRecipe.push_back(&R);
+
+      // Save the end location of each USE.
+      for (VPValue *U : R.operands()) {
+        auto *DefR = U->getDefiningRecipe();
+
+        // Ignore non-recipe values such as arguments, constants, etc.
+        // FIXME: Might need some motivation why these values are ignored. If
+        // for example an argument is used inside the loop it will increase the
+        // register pressure (so shouldn't we add it to LoopInvariants).
+        if (!DefR && (!U->getLiveInIRValue() ||
+                      !isa<Instruction>(U->getLiveInIRValue())))
+          continue;
+
+        // If this recipe is outside the loop then record it and continue.
+        if (!DefR) {
+          LoopInvariants.insert(U);
+          continue;
+        }
+
+        // Overwrite previous end points.
+        EndPoint[DefR] = IdxToRecipe.size();
+        Ends.insert(DefR);
+      }
+    }
+    if (VPBB == Plan.getVectorLoopRegion()->getExiting()) {
+      // VPWidenIntOrFpInductionRecipes are used implicitly at the end of the
+      // exiting block, where their increment will get materialized eventually.
+      for (auto &R : Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
+        if (isa<VPWidenIntOrFpInductionRecipe>(&R)) {
+          EndPoint[&R] = IdxToRecipe.size();
+          Ends.insert(&R);
+        }
+      }
+    }
+  }
+
+  // Saves the list of intervals that end with the index in 'key'.
+  using RecipeList = SmallVector<VPRecipeBase *, 2>;
+  SmallDenseMap<unsigned, RecipeList, 16> TransposeEnds;
+
+  // Transpose the EndPoints to a list of values that end at each index.
+  for (auto &Interval : EndPoint)
+    TransposeEnds[Interval.second].push_back(Interval.first);
+
+  SmallPtrSet<VPRecipeBase *, 8> OpenIntervals;
+  SmallVector<LoopVectorizationCostModel::RegisterUsage, 8> RUs(VFs.size());
+  SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
+
+  LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
+
+  VPTypeAnalysis TypeInfo(Plan.getCanonicalIV()->getScalarType());
+
+  const auto &TTICapture = TTI;
+  auto GetRegUsage = [&TTICapture](Type *Ty, ElementCount VF) -> unsigned {
+    if (Ty->isTokenTy() || !VectorType::isValidElementType(Ty) ||
+        (VF.isScalable() &&
+         !TTICapture.isElementTypeLegalForScalableVector(Ty)))
+      return 0;
+    return TTICapture.getRegUsageForType(VectorType::get(Ty, VF));
+  };
+
+  for (unsigned int Idx = 0, Sz = IdxToRecipe.size(); Idx < Sz; ++Idx) {
+    VPRecipeBase *R = IdxToRecipe[Idx];
+
+    //  Remove all of the recipes that end at this location.
+    RecipeList &List = TransposeEnds[Idx];
+    for (VPRecipeBase *ToRemove : List)
+      OpenIntervals.erase(ToRemove);
+
+    // Ignore recipes that are never used within the loop.
+    if (!Ends.count(R) && !R->mayHaveSideEffects())
+      continue;
+
+    // For each VF find the maximum usage of registers.
+    for (unsigned J = 0, E = VFs.size(); J < E; ++J) {
+      // Count the number of registers used, per register class, given all open
+      // intervals.
+      // Note that elements in this SmallMapVector will be default constructed
+      // as 0. So we can use "RegUsage[ClassID] += n" in the code below even if
+      // there is no previous entry for ClassID.
+      SmallMapVector<unsigned, unsigned, 4> RegUsage;
+
+      if (VFs[J].isScalar()) {
+        for (auto *Inst : OpenIntervals) {
+          for (VPValue *DefV : Inst->definedValues()) {
+            unsigned ClassID = TTI.getRegisterClassForType(
+                false, TypeInfo.inferScalarType(DefV));
+            // FIXME: The target might use more than one register for the type
+            // even in the scalar case.
+            RegUsage[ClassID] += 1;
+          }
+        }
+      } else {
+        for (auto *R : OpenIntervals) {
+          if (isa<VPVectorPointerRecipe, VPReverseVectorPointerRecipe>(R))
+            continue;
+          if (isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
+                  VPScalarIVStepsRecipe>(R) ||
+              (isa<VPInstruction>(R) &&
+               all_of(cast<VPSingleDefRecipe>(R)->users(), [&](VPUser *U) {
+                 return cast<VPRecipeBase>(U)->usesScalars(
+                     R->getVPSingleValue());
+               }))) {
+            unsigned ClassID = TTI.getRegisterClassForType(
+                false, TypeInfo.inferScalarType(R->getVPSingleValue()));
+            // FIXME: The target might use more than one register for the type
+            // even in the scalar case.
+            RegUsage[ClassID] += 1;
+          } else {
+            for (VPValue *DefV : R->definedValues()) {
+              Type *ScalarTy = TypeInfo.inferScalarType(DefV);
+              unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
+              RegUsage[ClassID] += GetRegUsage(ScalarTy, VFs[J]);
+            }
+          }
+        }
+      }
+
+      for (const auto &Pair : RegUsage) {
+        auto &Entry = MaxUsages[J][Pair.first];
+        Entry = std::max(Entry, Pair.second);
+      }
+    }
+
+    LLVM_DEBUG(dbgs() << "LV(REG): At #" << Idx << " Interval # "
+                      << OpenIntervals.size() << '\n');
+
+    // Add the current recipe to the list of open intervals.
+    OpenIntervals.insert(R);
+  }
+
+  for (unsigned Idx = 0, End = VFs.size(); Idx < End; ++Idx) {
+    // Note that elements in this SmallMapVector will be default constructed
+    // as 0. So we can use "Invariant[ClassID] += n" in the code below even if
+    // there is no previous entry for ClassID.
+    SmallMapVector<unsigned, unsigned, 4> Invariant;
+
+    for (auto *In : LoopInvariants) {
+      // FIXME: The target might use more than one register for the type
+      // even in the scalar case.
+      bool IsScalar = all_of(In->users(), [&](VPUser *U) {
+        return cast<VPRecipeBase>(U)->usesScalars(In);
+      });
+
+      ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
+      unsigned ClassID = TTI.getRegisterClassForType(
+          VF.isVector(), TypeInfo.inferScalarType(In));
+      Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
+    }
+
+    LLVM_DEBUG({
+      dbgs() << "LV(REG): VF = " << VFs[Idx] << '\n';
+      dbgs() << "LV(REG): Found max usage: " << MaxUsages[Idx].size()
+             << " item\n";
+      for (const auto &pair : MaxUsages[Idx]) {
+        dbgs() << "LV(REG): RegisterClass: "
+               << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+               << " registers\n";
+      }
+      dbgs() << "LV(REG): Found invariant usage: " << Invariant.size()
+             << " item\n";
+      for (const auto &pair : Invariant) {
+        dbgs() << "LV(REG): RegisterClass: "
+               << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+               << " registers\n";
+      }
+    });
+
+    RU.LoopInvariantRegs = Invariant;
+    RU.MaxLocalUsers = MaxUsages[Idx];
+    RUs[Idx] = RU;
+  }
+
+  return RUs;
+}
+
 unsigned
-LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
+LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
                                                   InstructionCost LoopCost) {
   // -- The interleave heuristics --
   // We interleave the loop in order to expose ILP and reduce the loop overhead.
@@ -4929,7 +5154,7 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
       return 1;
   }
 
-  RegisterUsage R = calculateRegisterUsage({VF})[0];
+  RegisterUsage R = ::calculateRegisterUsage(Plan, {VF}, TTI)[0];
   // We divide by these constants so assume that we have at least one
   // instruction that uses at least one register.
   for (auto &Pair : R.MaxLocalUsers) {
@@ -5262,7 +5487,7 @@ LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef<ElementCount> VFs) {
       OpenIntervals.erase(ToRemove);
 
     // Ignore instructions that are never used within the loop.
-    if (!Ends.count(I))
+    if (!Ends.count(I) && !I->mayHaveSideEffects())
       continue;
 
     // Skip ignored values.
@@ -10574,7 +10799,7 @@ bool LoopVectorizePass::processLoop(Loop *L) {
                            AddBranchWeights, CM.CostKind);
   if (LVP.hasPlanWithVF(VF.Width)) {
     // Select the interleave count.
-    IC = CM.selectInterleaveCount(VF.Width, VF.Cost);
+    IC = CM.selectInterleaveCount(LVP.getPlanFor(VF.Width), VF.Width, VF.Cost);
 
     unsigned SelectedIC = std::max(IC, UserIC);
     //  Optimistically generate runtime checks if they are needed. Drop them if
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll b/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll
index 69af51deea08ea9..0ec90b75002cd0f 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll
@@ -8,8 +8,8 @@ target triple = "aarch64"
 ; CHECK-LABEL: LV: Checking a loop in 'or_reduction_neon' from <stdin>
 ; CHECK: LV(REG): VF = 32
 ; CHECK-NEXT: LV(REG): Found max usage: 2 item
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
 ; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 72 registers
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 1 registers
 
 define i1 @or_reduction_neon(i32 %arg, ptr %ptr) {
 entry:
@@ -31,8 +31,8 @@ loop:
 ; CHECK-LABEL: LV: Checking a loop in 'or_reduction_sve'
 ; CHECK: LV(REG): VF = 64
 ; CHECK-NEXT: LV(REG): Found max usage: 2 item
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
 ; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 136 registers
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 1 registers
 
 define i1 @or_reduction_sve(i32 %arg, ptr %ptr) vscale_range(2,2) "target-features"="+sve" {
 entry:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll b/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll
index 8239d32445c1054..feb0f03caa5033c 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll
@@ -14,11 +14,11 @@
 define void @get_invariant_reg_usage(ptr %z) {
 ; CHECK: LV: Checking a loop in 'get_invariant_reg_usage'
 ; CHECK: LV(REG): VF = vscale x 16
-; CHECK-NEXT: LV(REG): Found max usage: 1 item
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
-; CHECK-NEXT: LV(REG): Found invariant usage: 2 item
+; CHECK-NEXT: LV(REG): Found max usage: 2 item
 ; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 8 registers 
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 1 registers
+; CHECK-NEXT: LV(REG): Found invariant usage: 1 item
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
 
 L.entry:
   %0 = load i128, ptr %z, align 16
diff --git a/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll b/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll
index f45a2f0f5b7e8f3..5baf1e013a50f87 100644
--- a/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll
@@ -9,16 +9,18 @@
 define void @bar(ptr %A, i32 signext %n) {
 ; CHECK-LABEL: bar
 ; CHECK-SCALAR:      LV(REG): Found max usage: 2 item
-; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 2 registers
+; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 3 registers
 ; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::FPRRC, 1 registers
 ; CHECK-SCALAR-NEXT: LV(REG): Found invariant usage: 1 item
 ; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
 ; CHECK-SCALAR-NEXT: LV: The target has 30 registers of LoongArch::GPRRC register class
 ; CHECK-SCALAR-NEXT: LV: The target has 32 registers of LoongArch::FPRRC register class
-; CHECK-VECTOR:      LV(REG): Found max usage: 1 item
-; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::VRRC, 3 registers
+; CHECK-VECTOR:      LV(REG): Found max usage: 2 item
+; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 2 registers
+; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::VRRC, 2 registers
 ; CHECK-VECTOR-NEXT: LV(REG): Found invariant usage: 1 item
 ; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
+; CHECK-VECTOR-NEXT: LV: The target has 30 registers of LoongArch::GPRRC register class
 ; CHECK-VECTOR-NEXT: LV: The target has 32 registers of LoongArch::VRRC register class
 
 entry:
diff --git a/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll b/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll
index e5717c4f1d91ab8..cc1e5f9a68593cc 100644
--- a/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll
@@ -18,10 +18,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
 ; CHECK-NEXT:    [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP2]], 2
 ; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK1]], label %[[VEC_EPILOG_SCALAR_PH:.*]], label %[[VECTOR_MAIN_LOOP_ITER_CHECK:.*]]
 ; CHECK:       [[VECTOR_MAIN_LOOP_ITER_CHECK]]:
-; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 16
-; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[VEC_EPILOG_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK-NEXT:    [[MIN_ITERS_CHECK3:%.*]] = icmp ult i64 [[TMP2]], 24
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK3]], label %[[VEC_EPILOG_PH:.*]], label %[[VECTOR_PH:.*]]
 ; CHECK:       [[VECTOR_PH]]:
-; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 16
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 24
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
 ; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; CHECK:       [[VECTOR_BODY]]:
@@ -35,6 +35,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
 ; CHECK-NEXT:    [[VEC_PHI15:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP48:%.*]], %[[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[VEC_PHI16:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP49:%.*]], %[[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[VEC_PHI17:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP50:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI18:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP64:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI19:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP65:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI20:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP66:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI21:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP67:%.*]], %[[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[STEP_ADD:%.*]] = add <2 x i64> [[VEC_IND]], splat (i64 2)
 ; CHECK-NEXT:    [[STEP_ADD_2:%.*]] = add <2 x i64> [[STEP_ADD]], splat (i64 2)
 ; CHECK-NEXT:    [[STEP_ADD_3:%.*]] = add <2 x i64> [[STEP_ADD_2]], splat (i64 2)
@@ -42,6 +46,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
 ; CHECK-NEXT:    [[STEP_ADD_5:%.*]] = add <2 x i64> [[STEP_ADD_4]], splat (i64 2)
 ; CHECK-NEXT:    [[STEP_ADD_6:%.*]] = add <2 x i64> [[STEP_ADD_5]], splat (i64 2)
 ; CHECK-NEXT:    [[STEP_ADD_7:%.*]] = add <2 x i64> [[STEP_ADD_6]], splat (i64 2)
+; CHECK-NEXT:    [[STEP_ADD_8:%.*]] = add <2 x i64> [[STEP_ADD_7]], splat (i64 2)
+; CHECK-NEXT:    [[STEP_ADD_9:%.*]] = add <2 x i64> [[STEP_ADD_8]], splat (i64 2)
+; CHECK-NEXT:    [[STEP_ADD_10:%.*]] = add <2 x i64> [[STEP_ADD_9]], splat (i64 2)
+; CHECK-NEXT:    [[STEP_ADD_11:%.*]] = add <2 x i64> [[STEP_ADD_10]], splat (i64 2)
 ; CHECK-NEXT:    [[TMP3:%.*]] = add i64 [[INDEX]], 0
 ; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP3]]
 ; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 0
@@ -52,6 +60,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
 ; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 10
 ; CHECK-NEXT:    [[TMP17:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 12
 ; CHECK-NEXT:    [[TMP18:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 14
+; CHECK-NEXT:    [[TMP68:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 16
+; CHECK-NEXT:    [[TMP69:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 18
+; CHECK-NEXT:    [[TMP70:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 20
+; CHECK-NEXT:    [[TMP71:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 22
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i8>, ptr [[TMP11]], align 1
 ; CHECK-NEXT:    [[WIDE_LOAD25:%.*]] = load <2 x i8>, ptr [[TMP12]], align 1
 ; CHECK-NEXT:    [[WIDE_LOAD26:%.*]] = load <2 x i8>, ptr [[TMP13]], align 1
@@ -60,6 +72,10 @@ define i1 @select_exit_cond(ptr %s...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Feb 9, 2025

@llvm/pr-subscribers-backend-powerpc

Author: Florian Hahn (fhahn)

Changes

Add a version of calculateRegisterUsage that works estimates register
usage for a VPlan. This mostly just ports the existing code, with some
updates to figure out what recipes will generate vectors vs scalars.

There are number of changes in the computed register usages, but they
should be more accurate w.r.t. to the generated vector code.

There are the following changes:

  • Scalar usage increases in most cases by 1, as we always create a
    scalar canonical IV, which is alive across the loop and is not
    considered by the legacy implementation

  • Output is ordered by insertion, now scalar registers are added first
    due the canonical IV phi.

  • Using the VPlan, we now also more precisely know if an induction will
    be vectorized or scalarized.

Depends on #126415


Patch is 52.47 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126437.diff

14 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+230-5)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll (+5-3)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll (+48-12)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/large-loop-rdx.ll (-4)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/reg-usage.ll (+8-4)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll (+5-5)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll (+14-6)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/i1-reg-usage.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/pr47437.ll (+21-99)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/reg-usage.ll (+3-1)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c2d347cf9b7e0c9..9727487a1dd1396 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1020,7 +1020,8 @@ class LoopVectorizationCostModel {
   /// If interleave count has been specified by metadata it will be returned.
   /// Otherwise, the interleave count is computed and returned. VF and LoopCost
   /// are the selected vectorization factor and the cost of the selected VF.
-  unsigned selectInterleaveCount(ElementCount VF, InstructionCost LoopCost);
+  unsigned selectInterleaveCount(VPlan &Plan, ElementCount VF,
+                                 InstructionCost LoopCost);
 
   /// Memory access instruction may be vectorized in more than one way.
   /// Form of instruction after vectorization depends on cost.
@@ -4878,8 +4879,232 @@ void LoopVectorizationCostModel::collectElementTypesForWidening() {
   }
 }
 
+/// Estimate the register usage for \p Plan and vectorization factors in \p VFs.
+/// Returns the register usage for each VF in \p VFs.
+static SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
+calculateRegisterUsage(VPlan &Plan, ArrayRef<ElementCount> VFs,
+                       const TargetTransformInfo &TTI) {
+  // This function calculates the register usage by measuring the highest number
+  // of values that are alive at a single location. Obviously, this is a very
+  // rough estimation. We scan the loop in a topological order in order and
+  // assign a number to each recipe. We use RPO to ensure that defs are
+  // met before their users. We assume that each recipe that has in-loop
+  // users starts an interval. We record every time that an in-loop value is
+  // used, so we have a list of the first and last occurrences of each
+  // recipe. Next, we transpose this data structure into a multi map that
+  // holds the list of intervals that *end* at a specific location. This multi
+  // map allows us to perform a linear search. We scan the instructions linearly
+  // and record each time that a new interval starts, by placing it in a set.
+  // If we find this value in the multi-map then we remove it from the set.
+  // The max register usage is the maximum size of the set.
+  // We also search for instructions that are defined outside the loop, but are
+  // used inside the loop. We need this number separately from the max-interval
+  // usage number because when we unroll, loop-invariant values do not take
+  // more register.
+  LoopVectorizationCostModel::RegisterUsage RU;
+
+  // Each 'key' in the map opens a new interval. The values
+  // of the map are the index of the 'last seen' usage of the
+  // recipe that is the key.
+  using IntervalMap = SmallDenseMap<VPRecipeBase *, unsigned, 16>;
+
+  // Maps recipe to its index.
+  SmallVector<VPRecipeBase *, 64> IdxToRecipe;
+  // Marks the end of each interval.
+  IntervalMap EndPoint;
+  // Saves the list of recipe indices that are used in the loop.
+  SmallPtrSet<VPRecipeBase *, 8> Ends;
+  // Saves the list of values that are used in the loop but are defined outside
+  // the loop (not including non-recipe values such as arguments and
+  // constants).
+  SmallSetVector<VPValue *, 8> LoopInvariants;
+  LoopInvariants.insert(&Plan.getVectorTripCount());
+
+  ReversePostOrderTraversal<VPBlockDeepTraversalWrapper<VPBlockBase *>> RPOT(
+      Plan.getVectorLoopRegion());
+  for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {
+    if (!VPBB->getParent())
+      break;
+    for (VPRecipeBase &R : *VPBB) {
+      IdxToRecipe.push_back(&R);
+
+      // Save the end location of each USE.
+      for (VPValue *U : R.operands()) {
+        auto *DefR = U->getDefiningRecipe();
+
+        // Ignore non-recipe values such as arguments, constants, etc.
+        // FIXME: Might need some motivation why these values are ignored. If
+        // for example an argument is used inside the loop it will increase the
+        // register pressure (so shouldn't we add it to LoopInvariants).
+        if (!DefR && (!U->getLiveInIRValue() ||
+                      !isa<Instruction>(U->getLiveInIRValue())))
+          continue;
+
+        // If this recipe is outside the loop then record it and continue.
+        if (!DefR) {
+          LoopInvariants.insert(U);
+          continue;
+        }
+
+        // Overwrite previous end points.
+        EndPoint[DefR] = IdxToRecipe.size();
+        Ends.insert(DefR);
+      }
+    }
+    if (VPBB == Plan.getVectorLoopRegion()->getExiting()) {
+      // VPWidenIntOrFpInductionRecipes are used implicitly at the end of the
+      // exiting block, where their increment will get materialized eventually.
+      for (auto &R : Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
+        if (isa<VPWidenIntOrFpInductionRecipe>(&R)) {
+          EndPoint[&R] = IdxToRecipe.size();
+          Ends.insert(&R);
+        }
+      }
+    }
+  }
+
+  // Saves the list of intervals that end with the index in 'key'.
+  using RecipeList = SmallVector<VPRecipeBase *, 2>;
+  SmallDenseMap<unsigned, RecipeList, 16> TransposeEnds;
+
+  // Transpose the EndPoints to a list of values that end at each index.
+  for (auto &Interval : EndPoint)
+    TransposeEnds[Interval.second].push_back(Interval.first);
+
+  SmallPtrSet<VPRecipeBase *, 8> OpenIntervals;
+  SmallVector<LoopVectorizationCostModel::RegisterUsage, 8> RUs(VFs.size());
+  SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
+
+  LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
+
+  VPTypeAnalysis TypeInfo(Plan.getCanonicalIV()->getScalarType());
+
+  const auto &TTICapture = TTI;
+  auto GetRegUsage = [&TTICapture](Type *Ty, ElementCount VF) -> unsigned {
+    if (Ty->isTokenTy() || !VectorType::isValidElementType(Ty) ||
+        (VF.isScalable() &&
+         !TTICapture.isElementTypeLegalForScalableVector(Ty)))
+      return 0;
+    return TTICapture.getRegUsageForType(VectorType::get(Ty, VF));
+  };
+
+  for (unsigned int Idx = 0, Sz = IdxToRecipe.size(); Idx < Sz; ++Idx) {
+    VPRecipeBase *R = IdxToRecipe[Idx];
+
+    //  Remove all of the recipes that end at this location.
+    RecipeList &List = TransposeEnds[Idx];
+    for (VPRecipeBase *ToRemove : List)
+      OpenIntervals.erase(ToRemove);
+
+    // Ignore recipes that are never used within the loop.
+    if (!Ends.count(R) && !R->mayHaveSideEffects())
+      continue;
+
+    // For each VF find the maximum usage of registers.
+    for (unsigned J = 0, E = VFs.size(); J < E; ++J) {
+      // Count the number of registers used, per register class, given all open
+      // intervals.
+      // Note that elements in this SmallMapVector will be default constructed
+      // as 0. So we can use "RegUsage[ClassID] += n" in the code below even if
+      // there is no previous entry for ClassID.
+      SmallMapVector<unsigned, unsigned, 4> RegUsage;
+
+      if (VFs[J].isScalar()) {
+        for (auto *Inst : OpenIntervals) {
+          for (VPValue *DefV : Inst->definedValues()) {
+            unsigned ClassID = TTI.getRegisterClassForType(
+                false, TypeInfo.inferScalarType(DefV));
+            // FIXME: The target might use more than one register for the type
+            // even in the scalar case.
+            RegUsage[ClassID] += 1;
+          }
+        }
+      } else {
+        for (auto *R : OpenIntervals) {
+          if (isa<VPVectorPointerRecipe, VPReverseVectorPointerRecipe>(R))
+            continue;
+          if (isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
+                  VPScalarIVStepsRecipe>(R) ||
+              (isa<VPInstruction>(R) &&
+               all_of(cast<VPSingleDefRecipe>(R)->users(), [&](VPUser *U) {
+                 return cast<VPRecipeBase>(U)->usesScalars(
+                     R->getVPSingleValue());
+               }))) {
+            unsigned ClassID = TTI.getRegisterClassForType(
+                false, TypeInfo.inferScalarType(R->getVPSingleValue()));
+            // FIXME: The target might use more than one register for the type
+            // even in the scalar case.
+            RegUsage[ClassID] += 1;
+          } else {
+            for (VPValue *DefV : R->definedValues()) {
+              Type *ScalarTy = TypeInfo.inferScalarType(DefV);
+              unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
+              RegUsage[ClassID] += GetRegUsage(ScalarTy, VFs[J]);
+            }
+          }
+        }
+      }
+
+      for (const auto &Pair : RegUsage) {
+        auto &Entry = MaxUsages[J][Pair.first];
+        Entry = std::max(Entry, Pair.second);
+      }
+    }
+
+    LLVM_DEBUG(dbgs() << "LV(REG): At #" << Idx << " Interval # "
+                      << OpenIntervals.size() << '\n');
+
+    // Add the current recipe to the list of open intervals.
+    OpenIntervals.insert(R);
+  }
+
+  for (unsigned Idx = 0, End = VFs.size(); Idx < End; ++Idx) {
+    // Note that elements in this SmallMapVector will be default constructed
+    // as 0. So we can use "Invariant[ClassID] += n" in the code below even if
+    // there is no previous entry for ClassID.
+    SmallMapVector<unsigned, unsigned, 4> Invariant;
+
+    for (auto *In : LoopInvariants) {
+      // FIXME: The target might use more than one register for the type
+      // even in the scalar case.
+      bool IsScalar = all_of(In->users(), [&](VPUser *U) {
+        return cast<VPRecipeBase>(U)->usesScalars(In);
+      });
+
+      ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
+      unsigned ClassID = TTI.getRegisterClassForType(
+          VF.isVector(), TypeInfo.inferScalarType(In));
+      Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
+    }
+
+    LLVM_DEBUG({
+      dbgs() << "LV(REG): VF = " << VFs[Idx] << '\n';
+      dbgs() << "LV(REG): Found max usage: " << MaxUsages[Idx].size()
+             << " item\n";
+      for (const auto &pair : MaxUsages[Idx]) {
+        dbgs() << "LV(REG): RegisterClass: "
+               << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+               << " registers\n";
+      }
+      dbgs() << "LV(REG): Found invariant usage: " << Invariant.size()
+             << " item\n";
+      for (const auto &pair : Invariant) {
+        dbgs() << "LV(REG): RegisterClass: "
+               << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+               << " registers\n";
+      }
+    });
+
+    RU.LoopInvariantRegs = Invariant;
+    RU.MaxLocalUsers = MaxUsages[Idx];
+    RUs[Idx] = RU;
+  }
+
+  return RUs;
+}
+
 unsigned
-LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
+LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
                                                   InstructionCost LoopCost) {
   // -- The interleave heuristics --
   // We interleave the loop in order to expose ILP and reduce the loop overhead.
@@ -4929,7 +5154,7 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
       return 1;
   }
 
-  RegisterUsage R = calculateRegisterUsage({VF})[0];
+  RegisterUsage R = ::calculateRegisterUsage(Plan, {VF}, TTI)[0];
   // We divide by these constants so assume that we have at least one
   // instruction that uses at least one register.
   for (auto &Pair : R.MaxLocalUsers) {
@@ -5262,7 +5487,7 @@ LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef<ElementCount> VFs) {
       OpenIntervals.erase(ToRemove);
 
     // Ignore instructions that are never used within the loop.
-    if (!Ends.count(I))
+    if (!Ends.count(I) && !I->mayHaveSideEffects())
       continue;
 
     // Skip ignored values.
@@ -10574,7 +10799,7 @@ bool LoopVectorizePass::processLoop(Loop *L) {
                            AddBranchWeights, CM.CostKind);
   if (LVP.hasPlanWithVF(VF.Width)) {
     // Select the interleave count.
-    IC = CM.selectInterleaveCount(VF.Width, VF.Cost);
+    IC = CM.selectInterleaveCount(LVP.getPlanFor(VF.Width), VF.Width, VF.Cost);
 
     unsigned SelectedIC = std::max(IC, UserIC);
     //  Optimistically generate runtime checks if they are needed. Drop them if
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll b/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll
index 69af51deea08ea9..0ec90b75002cd0f 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll
@@ -8,8 +8,8 @@ target triple = "aarch64"
 ; CHECK-LABEL: LV: Checking a loop in 'or_reduction_neon' from <stdin>
 ; CHECK: LV(REG): VF = 32
 ; CHECK-NEXT: LV(REG): Found max usage: 2 item
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
 ; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 72 registers
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 1 registers
 
 define i1 @or_reduction_neon(i32 %arg, ptr %ptr) {
 entry:
@@ -31,8 +31,8 @@ loop:
 ; CHECK-LABEL: LV: Checking a loop in 'or_reduction_sve'
 ; CHECK: LV(REG): VF = 64
 ; CHECK-NEXT: LV(REG): Found max usage: 2 item
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
 ; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 136 registers
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 1 registers
 
 define i1 @or_reduction_sve(i32 %arg, ptr %ptr) vscale_range(2,2) "target-features"="+sve" {
 entry:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll b/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll
index 8239d32445c1054..feb0f03caa5033c 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll
@@ -14,11 +14,11 @@
 define void @get_invariant_reg_usage(ptr %z) {
 ; CHECK: LV: Checking a loop in 'get_invariant_reg_usage'
 ; CHECK: LV(REG): VF = vscale x 16
-; CHECK-NEXT: LV(REG): Found max usage: 1 item
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
-; CHECK-NEXT: LV(REG): Found invariant usage: 2 item
+; CHECK-NEXT: LV(REG): Found max usage: 2 item
 ; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 8 registers 
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 1 registers
+; CHECK-NEXT: LV(REG): Found invariant usage: 1 item
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
 
 L.entry:
   %0 = load i128, ptr %z, align 16
diff --git a/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll b/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll
index f45a2f0f5b7e8f3..5baf1e013a50f87 100644
--- a/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll
@@ -9,16 +9,18 @@
 define void @bar(ptr %A, i32 signext %n) {
 ; CHECK-LABEL: bar
 ; CHECK-SCALAR:      LV(REG): Found max usage: 2 item
-; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 2 registers
+; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 3 registers
 ; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::FPRRC, 1 registers
 ; CHECK-SCALAR-NEXT: LV(REG): Found invariant usage: 1 item
 ; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
 ; CHECK-SCALAR-NEXT: LV: The target has 30 registers of LoongArch::GPRRC register class
 ; CHECK-SCALAR-NEXT: LV: The target has 32 registers of LoongArch::FPRRC register class
-; CHECK-VECTOR:      LV(REG): Found max usage: 1 item
-; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::VRRC, 3 registers
+; CHECK-VECTOR:      LV(REG): Found max usage: 2 item
+; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 2 registers
+; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::VRRC, 2 registers
 ; CHECK-VECTOR-NEXT: LV(REG): Found invariant usage: 1 item
 ; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
+; CHECK-VECTOR-NEXT: LV: The target has 30 registers of LoongArch::GPRRC register class
 ; CHECK-VECTOR-NEXT: LV: The target has 32 registers of LoongArch::VRRC register class
 
 entry:
diff --git a/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll b/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll
index e5717c4f1d91ab8..cc1e5f9a68593cc 100644
--- a/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll
@@ -18,10 +18,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
 ; CHECK-NEXT:    [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP2]], 2
 ; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK1]], label %[[VEC_EPILOG_SCALAR_PH:.*]], label %[[VECTOR_MAIN_LOOP_ITER_CHECK:.*]]
 ; CHECK:       [[VECTOR_MAIN_LOOP_ITER_CHECK]]:
-; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 16
-; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[VEC_EPILOG_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK-NEXT:    [[MIN_ITERS_CHECK3:%.*]] = icmp ult i64 [[TMP2]], 24
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK3]], label %[[VEC_EPILOG_PH:.*]], label %[[VECTOR_PH:.*]]
 ; CHECK:       [[VECTOR_PH]]:
-; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 16
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 24
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
 ; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; CHECK:       [[VECTOR_BODY]]:
@@ -35,6 +35,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
 ; CHECK-NEXT:    [[VEC_PHI15:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP48:%.*]], %[[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[VEC_PHI16:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP49:%.*]], %[[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[VEC_PHI17:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP50:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI18:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP64:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI19:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP65:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI20:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP66:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI21:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP67:%.*]], %[[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[STEP_ADD:%.*]] = add <2 x i64> [[VEC_IND]], splat (i64 2)
 ; CHECK-NEXT:    [[STEP_ADD_2:%.*]] = add <2 x i64> [[STEP_ADD]], splat (i64 2)
 ; CHECK-NEXT:    [[STEP_ADD_3:%.*]] = add <2 x i64> [[STEP_ADD_2]], splat (i64 2)
@@ -42,6 +46,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
 ; CHECK-NEXT:    [[STEP_ADD_5:%.*]] = add <2 x i64> [[STEP_ADD_4]], splat (i64 2)
 ; CHECK-NEXT:    [[STEP_ADD_6:%.*]] = add <2 x i64> [[STEP_ADD_5]], splat (i64 2)
 ; CHECK-NEXT:    [[STEP_ADD_7:%.*]] = add <2 x i64> [[STEP_ADD_6]], splat (i64 2)
+; CHECK-NEXT:    [[STEP_ADD_8:%.*]] = add <2 x i64> [[STEP_ADD_7]], splat (i64 2)
+; CHECK-NEXT:    [[STEP_ADD_9:%.*]] = add <2 x i64> [[STEP_ADD_8]], splat (i64 2)
+; CHECK-NEXT:    [[STEP_ADD_10:%.*]] = add <2 x i64> [[STEP_ADD_9]], splat (i64 2)
+; CHECK-NEXT:    [[STEP_ADD_11:%.*]] = add <2 x i64> [[STEP_ADD_10]], splat (i64 2)
 ; CHECK-NEXT:    [[TMP3:%.*]] = add i64 [[INDEX]], 0
 ; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP3]]
 ; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 0
@@ -52,6 +60,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
 ; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 10
 ; CHECK-NEXT:    [[TMP17:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 12
 ; CHECK-NEXT:    [[TMP18:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 14
+; CHECK-NEXT:    [[TMP68:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 16
+; CHECK-NEXT:    [[TMP69:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 18
+; CHECK-NEXT:    [[TMP70:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 20
+; CHECK-NEXT:    [[TMP71:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 22
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i8>, ptr [[TMP11]], align 1
 ; CHECK-NEXT:    [[WIDE_LOAD25:%.*]] = load <2 x i8>, ptr [[TMP12]], align 1
 ; CHECK-NEXT:    [[WIDE_LOAD26:%.*]] = load <2 x i8>, ptr [[TMP13]], align 1
@@ -60,6 +72,10 @@ define i1 @select_exit_cond(ptr %s...
[truncated]

@lukel97
Copy link
Contributor

lukel97 commented Feb 10, 2025

Nice! I had also been looking into doing this as well to address this issue where we end up with a lot of spilling due to the VF being too high: https://godbolt.org/z/P3j7978Pq

In that example the current non-VP calculateRegisterUsage only sees the widest element type as i8. But because it will end up getting vectorized as a gather, we'll end up generating vectors of pointers which are i64, so it underestimates the register pressure by 8 times.

VPlan models the vector of i64 pointers and so doing calculateRegisterUsage in VPlan allows it to accurately calculate this.

From what I understand calculateRegisterUsage is only used for selectInterleaveCount or getMaximizedVFForTarget when TTI.shouldMaximizeVectorBandwidth is true currently.

But I think with this change we could use it in more places for RISC-V to select a better MaxVF.

// We also search for instructions that are defined outside the loop, but are
// used inside the loop. We need this number separately from the max-interval
// usage number because when we unroll, loop-invariant values do not take
// more register.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

register -> registers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

// recipe that is the key.
using IntervalMap = SmallDenseMap<VPRecipeBase *, unsigned, 16>;

// Maps recipe to its index.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more correct to say it maps the index to the recipe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated thanks

Comment on lines +4935 to +4923
// Ignore non-recipe values such as arguments, constants, etc.
// FIXME: Might need some motivation why these values are ignored. If
// for example an argument is used inside the loop it will increase the
// register pressure (so shouldn't we add it to LoopInvariants).
if (!DefR && (!U->getLiveInIRValue() ||
!isa<Instruction>(U->getLiveInIRValue())))
continue;
Copy link
Collaborator

@SamTebbs33 SamTebbs33 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does adding these values to the invariants list cause lots of extra work or issues that can't be done in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just ports the code + FIXME from the original implementation to keep this as close to NFC as possible. Happy to remove the restriction as follow-up

@SamTebbs33
Copy link
Collaborator

Have you had time to think about my review comments? I'm working on a patch on top of this so it would be good to get it landed soon.

@fhahn
Copy link
Contributor Author

fhahn commented Mar 10, 2025

Have you had time to think about my review comments? I'm working on a patch on top of this so it would be good to get it landed soon.

Thanks, I'll get to them soon, probably later this week. I am at the moment catching up on a number of reviews.

@SamTebbs33
Copy link
Collaborator

Have you had time to think about my review comments? I'm working on a patch on top of this so it would be good to get it landed soon.

Thanks, I'll get to them soon, probably later this week. I am at the moment catching up on a number of reviews.

Thank you, that sounds good to me. In the meantime, would you mind creating a user branch in this repo with this work in it, or alternatively would you mind if I create one under my user name? It would be good to have one so I can create a stacked PR.

@fhahn fhahn force-pushed the vplan-reg-usage-for-interleaving branch from 72e5409 to 07b2a1e Compare March 19, 2025 20:33
SamTebbs33 added a commit to SamTebbs33/llvm-project that referenced this pull request Mar 20, 2025
Based on fhahn's work at llvm#126437 .

This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.

It involves changing some tests, notably removing one from
mve-known-tripcount.ll due to it not being vectorisable thanks to high
register pressure. tail-folding-reduces-vf.ll was modified to reduce its
register pressure but still test what was intended.
Copy link
Collaborator

@SamTebbs33 SamTebbs33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you! I've pushed my PR that removes the old calculateRegisterUsage function here: #132190

Comment on lines +5007 to +4989
if (!Ends.count(R) && !R->mayHaveSideEffects())
continue;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing code here to also skip ephemeral values, which the legacy version handled properly. Still need to figure out the best way to fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled this for now by just checking the exiting ValuesToIgnore, as there doesn't seem to be a nice alternative for matching the behavior directly checking the info in VPlan without much extra work.

SamTebbs33 added a commit to SamTebbs33/llvm-project that referenced this pull request Apr 2, 2025
Based on fhahn's work at llvm#126437 .

This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.

It involves changing some tests, notably removing one from
mve-known-tripcount.ll due to it not being vectorisable thanks to high
register pressure. tail-folding-reduces-vf.ll was modified to reduce its
register pressure but still test what was intended.
@fhahn fhahn force-pushed the vplan-reg-usage-for-interleaving branch from 07b2a1e to 732c916 Compare April 7, 2025 14:01
Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest version includes handling for ephemeral users (like assume). I am planning to merge this once the CI is happy.

Comment on lines +5007 to +4989
if (!Ends.count(R) && !R->mayHaveSideEffects())
continue;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled this for now by just checking the exiting ValuesToIgnore, as there doesn't seem to be a nice alternative for matching the behavior directly checking the info in VPlan without much extra work.

SamTebbs33 added a commit to SamTebbs33/llvm-project that referenced this pull request Apr 7, 2025
Based on fhahn's work at llvm#126437 .

This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.

It involves changing some tests, notably removing one from
mve-known-tripcount.ll due to it not being vectorisable thanks to high
register pressure. tail-folding-reduces-vf.ll was modified to reduce its
register pressure but still test what was intended.
Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating! This looks largely good to me, just have a few minor comments.

fhahn added 3 commits April 7, 2025 21:05
Add a version of calculateRegisterUsage that works estimates register
usage for a VPlan. This mostly just ports the existing code, with some
updates to figure out what recipes will generate vectors vs scalars.

There are number of changes in the computed register usages, but they
should be more accurate w.r.t. to the generated vector code.

There are the following changes:

 * Scalar usage increases in most cases by 1, as we always create a
   scalar canonical IV, which is alive across the loop and is not
   considered by the legacy implementation

 * Output is ordered by insertion, now scalar registers are added first
   due the canonical IV phi.

 * Using the VPlan, we now also more precisely know if an induction will
   be vectorized or scalarized.
@fhahn fhahn force-pushed the vplan-reg-usage-for-interleaving branch from 732c916 to 0c2c8e1 Compare April 7, 2025 20:55
Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments, LGTM!

@fhahn fhahn merged commit 6f92339 into llvm:main Apr 8, 2025
11 checks passed
@fhahn fhahn deleted the vplan-reg-usage-for-interleaving branch April 8, 2025 19:52
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Apr 8, 2025
…26437)

Add a version of calculateRegisterUsage that works estimates register
usage for a VPlan. This mostly just ports the existing code, with some
updates to figure out what recipes will generate vectors vs scalars.

There are number of changes in the computed register usages, but they
should be more accurate w.r.t. to the generated vector code.

There are the following changes:

 * Scalar usage increases in most cases by 1, as we always create a
   scalar canonical IV, which is alive across the loop and is not
   considered by the legacy implementation

 * Output is ordered by insertion, now scalar registers are added first
   due the canonical IV phi.

 * Using the VPlan, we now also more precisely know if an induction will
   be vectorized or scalarized.

Depends on llvm/llvm-project#126415

PR: llvm/llvm-project#126437
@llvm-ci
Copy link
Collaborator

llvm-ci commented Apr 8, 2025

LLVM Buildbot has detected a new failure on builder sanitizer-x86_64-linux-android running on sanitizer-buildbot-android while building llvm at step 2 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/186/builds/8036

Here is the relevant piece of the build log for the reference
Step 2 (annotate) failure: 'python ../sanitizer_buildbot/sanitizers/zorg/buildbot/builders/sanitizers/buildbot_selector.py' (failure)
...
[       OK ] AddressSanitizer.AtoiAndFriendsOOBTest (2252 ms)
[ RUN      ] AddressSanitizer.HasFeatureAddressSanitizerTest
[       OK ] AddressSanitizer.HasFeatureAddressSanitizerTest (0 ms)
[ RUN      ] AddressSanitizer.CallocReturnsZeroMem
[       OK ] AddressSanitizer.CallocReturnsZeroMem (11 ms)
[ DISABLED ] AddressSanitizer.DISABLED_TSDTest
[ RUN      ] AddressSanitizer.IgnoreTest
[       OK ] AddressSanitizer.IgnoreTest (0 ms)
[ RUN      ] AddressSanitizer.SignalTest
[       OK ] AddressSanitizer.SignalTest (207 ms)
[ RUN      ] AddressSanitizer.ReallocTest
[       OK ] AddressSanitizer.ReallocTest (31 ms)
[ RUN      ] AddressSanitizer.WrongFreeTest
[       OK ] AddressSanitizer.WrongFreeTest (138 ms)
[ RUN      ] AddressSanitizer.LongJmpTest
[       OK ] AddressSanitizer.LongJmpTest (0 ms)
[ RUN      ] AddressSanitizer.ThreadStackReuseTest
[       OK ] AddressSanitizer.ThreadStackReuseTest (1 ms)
[ DISABLED ] AddressSanitizer.DISABLED_MemIntrinsicUnalignedAccessTest
[ DISABLED ] AddressSanitizer.DISABLED_LargeFunctionSymbolizeTest
[ DISABLED ] AddressSanitizer.DISABLED_MallocFreeUnwindAndSymbolizeTest
[ RUN      ] AddressSanitizer.UseThenFreeThenUseTest
[       OK ] AddressSanitizer.UseThenFreeThenUseTest (112 ms)
[ RUN      ] AddressSanitizer.FileNameInGlobalReportTest
[       OK ] AddressSanitizer.FileNameInGlobalReportTest (118 ms)
[ DISABLED ] AddressSanitizer.DISABLED_StressStackReuseAndExceptionsTest
[ RUN      ] AddressSanitizer.MlockTest
[       OK ] AddressSanitizer.MlockTest (0 ms)
[ DISABLED ] AddressSanitizer.DISABLED_DemoThreadedTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoStackTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoThreadStackTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowIn
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowLeft
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowRight
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFHigh
[ DISABLED ] AddressSanitizer.DISABLED_DemoOOM
[ DISABLED ] AddressSanitizer.DISABLED_DemoDoubleFreeTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoNullDerefTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoFunctionStaticTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoTooMuchMemoryTest
[ RUN      ] AddressSanitizer.LongDoubleNegativeTest
[       OK ] AddressSanitizer.LongDoubleNegativeTest (0 ms)
[----------] 19 tests from AddressSanitizer (28393 ms total)

[----------] Global test environment tear-down
[==========] 22 tests from 2 test suites ran. (28395 ms total)
[  PASSED  ] 22 tests.

  YOU HAVE 1 DISABLED TEST

Step 9 (run cmake) failure: run cmake (failure)
...
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
-- Compiling and running to test HAVE_POSIX_REGEX
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
-- Compiling and running to test HAVE_POSIX_REGEX
-- Performing Test HAVE_POSIX_REGEX -- compiled but failed to run
CMake Warning at /var/lib/buildbot/sanitizer-buildbot6/sanitizer-x86_64-linux-android/build/llvm-project/third-party/benchmark/CMakeLists.txt:319 (message):
  Using std::regex with exceptions disabled is not fully supported
-- Compiling and running to test HAVE_STEADY_CLOCK
-- Performing Test HAVE_STEADY_CLOCK -- compiled but failed to run
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test HAVE_POSIX_REGEX -- compiled but failed to run
CMake Warning at /var/lib/buildbot/sanitizer-buildbot6/sanitizer-x86_64-linux-android/build/llvm-project/third-party/benchmark/CMakeLists.txt:319 (message):
  Using std::regex with exceptions disabled is not fully supported
-- Compiling and running to test HAVE_STEADY_CLOCK
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Performing Test HAVE_STEADY_CLOCK -- compiled but failed to run
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test HAVE_POSIX_REGEX -- compiled but failed to run
CMake Warning at /var/lib/buildbot/sanitizer-buildbot6/sanitizer-x86_64-linux-android/build/llvm-project/third-party/benchmark/CMakeLists.txt:319 (message):
  Using std::regex with exceptions disabled is not fully supported
-- Compiling and running to test HAVE_STEADY_CLOCK
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE  
-- Compiling and running to test HAVE_PTHREAD_AFFINITY
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Performing Test HAVE_PTHREAD_AFFINITY -- failed to compile
-- Configuring done (23.6s)
-- Performing Test HAVE_STEADY_CLOCK -- compiled but failed to run
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE  
-- Compiling and running to test HAVE_PTHREAD_AFFINITY
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Performing Test HAVE_PTHREAD_AFFINITY -- failed to compile
-- Configuring done (23.9s)
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE  
-- Compiling and running to test HAVE_PTHREAD_AFFINITY
-- Performing Test HAVE_PTHREAD_AFFINITY -- failed to compile
-- Configuring done (24.1s)

-- Generating done (2.8s)
-- Generating done (2.8s)
-- Build files have been written to: /var/lib/buildbot/sanitizer-buildbot6/sanitizer-x86_64-linux-android/build/llvm_build_android_i686
-- Generating done (2.8s)
-- Build files have been written to: /var/lib/buildbot/sanitizer-buildbot6/sanitizer-x86_64-linux-android/build/llvm_build_android_aarch64
-- Build files have been written to: /var/lib/buildbot/sanitizer-buildbot6/sanitizer-x86_64-linux-android/build/llvm_build_android_arm
Step 11 (build android/arm) failure: build android/arm (failure)
...
[621/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/ArchitectureSet.cpp.o
[622/662] Building CXX object lib/TargetParser/CMakeFiles/LLVMTargetParser.dir/SubtargetFeature.cpp.o
[623/662] Building CXX object lib/TargetParser/CMakeFiles/LLVMTargetParser.dir/TargetParser.cpp.o
[624/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/Architecture.cpp.o
[625/662] Building CXX object lib/TargetParser/CMakeFiles/LLVMTargetParser.dir/X86TargetParser.cpp.o
[626/662] Building CXX object lib/TargetParser/CMakeFiles/LLVMTargetParser.dir/RISCVTargetParser.cpp.o
[627/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/PackedVersion.cpp.o
[628/662] Building CXX object lib/TargetParser/CMakeFiles/LLVMTargetParser.dir/Triple.cpp.o
[629/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/Platform.cpp.o
[630/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/TextAPIError.cpp.o
[631/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/RecordVisitor.cpp.o
[632/662] Building Opts.inc...
[633/662] Building CXX object lib/TargetParser/CMakeFiles/LLVMTargetParser.dir/RISCVISAInfo.cpp.o
[634/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/Symbol.cpp.o
[635/662] Linking CXX static library lib/libLLVMTargetParser.a
[636/662] Linking CXX static library lib/libLLVMBinaryFormat.a
[637/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/Target.cpp.o
[638/662] Linking CXX static library lib/libLLVMCore.a
[639/662] Linking CXX static library lib/libLLVMMC.a
[640/662] Linking CXX static library lib/libLLVMBitReader.a
[641/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/SymbolSet.cpp.o
[642/662] Linking CXX static library lib/libLLVMMCParser.a
[643/662] Building CXX object lib/DebugInfo/Symbolize/CMakeFiles/LLVMSymbolize.dir/Symbolize.cpp.o
[644/662] Building CXX object lib/AsmParser/CMakeFiles/LLVMAsmParser.dir/Parser.cpp.o
[645/662] Building CXX object tools/llvm-symbolizer/CMakeFiles/llvm-symbolizer.dir/llvm-symbolizer-driver.cpp.o
[646/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/InterfaceFile.cpp.o
[647/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/RecordsSlice.cpp.o
[648/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/Utils.cpp.o
[649/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/TextStubCommon.cpp.o
[650/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/TextStubV5.cpp.o
[651/662] Building CXX object tools/llvm-symbolizer/CMakeFiles/llvm-symbolizer.dir/llvm-symbolizer.cpp.o
[652/662] Building CXX object lib/TextAPI/CMakeFiles/LLVMTextAPI.dir/TextStub.cpp.o
[653/662] Linking CXX static library lib/libLLVMTextAPI.a
[654/662] Building CXX object lib/AsmParser/CMakeFiles/LLVMAsmParser.dir/LLParser.cpp.o
[655/662] Linking CXX static library lib/libLLVMAsmParser.a
[656/662] Linking CXX static library lib/libLLVMIRReader.a
[657/662] Linking CXX static library lib/libLLVMObject.a
[658/662] Linking CXX static library lib/libLLVMDebugInfoDWARF.a
[659/662] Linking CXX static library lib/libLLVMDebugInfoPDB.a
[660/662] Linking CXX static library lib/libLLVMSymbolize.a
[661/662] Linking CXX static library lib/libLLVMDebuginfod.a
[662/662] Linking CXX executable bin/llvm-symbolizer
ninja: Entering directory `compiler_rt_build_android_arm'
ninja: error: loading 'build.ninja': No such file or directory

How to reproduce locally: https://github.com/google/sanitizers/wiki/SanitizerBotReproduceBuild




Step 14 (run all tests) failure: run all tests (failure)
@@@BUILD_STEP run all tests@@@
skipping tests on arm

How to reproduce locally: https://github.com/google/sanitizers/wiki/SanitizerBotReproduceBuild




Serial 96061FFBA000GW
Step 19 (run instrumented asan tests [aarch64/aosp_coral-userdebug/AOSP.MASTER]) failure: run instrumented asan tests [aarch64/aosp_coral-userdebug/AOSP.MASTER] (failure)
...
[ RUN      ] AddressSanitizer.SignalTest
[       OK ] AddressSanitizer.SignalTest (320 ms)
[ RUN      ] AddressSanitizer.ReallocTest
[       OK ] AddressSanitizer.ReallocTest (22 ms)
[ RUN      ] AddressSanitizer.WrongFreeTest
[       OK ] AddressSanitizer.WrongFreeTest (254 ms)
[ RUN      ] AddressSanitizer.LongJmpTest
[       OK ] AddressSanitizer.LongJmpTest (0 ms)
[ RUN      ] AddressSanitizer.ThreadStackReuseTest
[       OK ] AddressSanitizer.ThreadStackReuseTest (7 ms)
[ DISABLED ] AddressSanitizer.DISABLED_MemIntrinsicUnalignedAccessTest
[ DISABLED ] AddressSanitizer.DISABLED_LargeFunctionSymbolizeTest
[ DISABLED ] AddressSanitizer.DISABLED_MallocFreeUnwindAndSymbolizeTest
[ RUN      ] AddressSanitizer.UseThenFreeThenUseTest
[       OK ] AddressSanitizer.UseThenFreeThenUseTest (320 ms)
[ RUN      ] AddressSanitizer.FileNameInGlobalReportTest
[       OK ] AddressSanitizer.FileNameInGlobalReportTest (309 ms)
[ DISABLED ] AddressSanitizer.DISABLED_StressStackReuseAndExceptionsTest
[ RUN      ] AddressSanitizer.MlockTest
[       OK ] AddressSanitizer.MlockTest (0 ms)
[ DISABLED ] AddressSanitizer.DISABLED_DemoThreadedTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoStackTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoThreadStackTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowIn
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowLeft
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowRight
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFHigh
[ DISABLED ] AddressSanitizer.DISABLED_DemoOOM
[ DISABLED ] AddressSanitizer.DISABLED_DemoDoubleFreeTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoNullDerefTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoFunctionStaticTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoTooMuchMemoryTest
[ RUN      ] AddressSanitizer.LongDoubleNegativeTest
[       OK ] AddressSanitizer.LongDoubleNegativeTest (0 ms)
[----------] 19 tests from AddressSanitizer (71755 ms total)

[----------] Global test environment tear-down
[==========] 22 tests from 2 test suites ran. (71760 ms total)
[  PASSED  ] 22 tests.

  YOU HAVE 1 DISABLED TEST

skipping tests on arm

How to reproduce locally: https://github.com/google/sanitizers/wiki/SanitizerBotReproduceBuild




Serial 17031FQCB00176
Step 24 (run instrumented asan tests [aarch64/bluejay-userdebug/TQ3A.230805.001]) failure: run instrumented asan tests [aarch64/bluejay-userdebug/TQ3A.230805.001] (failure)
...
[ RUN      ] AddressSanitizer.HasFeatureAddressSanitizerTest
[       OK ] AddressSanitizer.HasFeatureAddressSanitizerTest (0 ms)
[ RUN      ] AddressSanitizer.CallocReturnsZeroMem
[       OK ] AddressSanitizer.CallocReturnsZeroMem (11 ms)
[ DISABLED ] AddressSanitizer.DISABLED_TSDTest
[ RUN      ] AddressSanitizer.IgnoreTest
[       OK ] AddressSanitizer.IgnoreTest (0 ms)
[ RUN      ] AddressSanitizer.SignalTest
[       OK ] AddressSanitizer.SignalTest (207 ms)
[ RUN      ] AddressSanitizer.ReallocTest
[       OK ] AddressSanitizer.ReallocTest (31 ms)
[ RUN      ] AddressSanitizer.WrongFreeTest
[       OK ] AddressSanitizer.WrongFreeTest (138 ms)
[ RUN      ] AddressSanitizer.LongJmpTest
[       OK ] AddressSanitizer.LongJmpTest (0 ms)
[ RUN      ] AddressSanitizer.ThreadStackReuseTest
[       OK ] AddressSanitizer.ThreadStackReuseTest (1 ms)
[ DISABLED ] AddressSanitizer.DISABLED_MemIntrinsicUnalignedAccessTest
[ DISABLED ] AddressSanitizer.DISABLED_LargeFunctionSymbolizeTest
[ DISABLED ] AddressSanitizer.DISABLED_MallocFreeUnwindAndSymbolizeTest
[ RUN      ] AddressSanitizer.UseThenFreeThenUseTest
[       OK ] AddressSanitizer.UseThenFreeThenUseTest (112 ms)
[ RUN      ] AddressSanitizer.FileNameInGlobalReportTest
[       OK ] AddressSanitizer.FileNameInGlobalReportTest (118 ms)
[ DISABLED ] AddressSanitizer.DISABLED_StressStackReuseAndExceptionsTest
[ RUN      ] AddressSanitizer.MlockTest
[       OK ] AddressSanitizer.MlockTest (0 ms)
[ DISABLED ] AddressSanitizer.DISABLED_DemoThreadedTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoStackTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoThreadStackTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowIn
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowLeft
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowRight
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFHigh
[ DISABLED ] AddressSanitizer.DISABLED_DemoOOM
[ DISABLED ] AddressSanitizer.DISABLED_DemoDoubleFreeTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoNullDerefTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoFunctionStaticTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoTooMuchMemoryTest
[ RUN      ] AddressSanitizer.LongDoubleNegativeTest
[       OK ] AddressSanitizer.LongDoubleNegativeTest (0 ms)
[----------] 19 tests from AddressSanitizer (28393 ms total)

[----------] Global test environment tear-down
[==========] 22 tests from 2 test suites ran. (28395 ms total)
[  PASSED  ] 22 tests.

  YOU HAVE 1 DISABLED TEST
program finished with exit code 1
elapsedTime=1835.471419

SamTebbs33 added a commit to SamTebbs33/llvm-project that referenced this pull request Apr 10, 2025
Based on fhahn's work at llvm#126437 .

This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.

It involves changing some tests, notably removing one from
mve-known-tripcount.ll due to it not being vectorisable thanks to high
register pressure. tail-folding-reduces-vf.ll was modified to reduce its
register pressure but still test what was intended.
SamTebbs33 added a commit that referenced this pull request Apr 11, 2025
This PR accounts for scaled reductions in `calculateRegisterUsage` to
reflect the fact that the number of lanes in their output is smaller
than the VF.

Depends on #126437
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Apr 11, 2025
This PR accounts for scaled reductions in `calculateRegisterUsage` to
reflect the fact that the number of lanes in their output is smaller
than the VF.

Depends on llvm/llvm-project#126437
SamTebbs33 added a commit to SamTebbs33/llvm-project that referenced this pull request Apr 11, 2025
Based on fhahn's work at llvm#126437 .

This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.

It involves changing some tests, notably removing one from
mve-known-tripcount.ll due to it not being vectorisable thanks to high
register pressure. tail-folding-reduces-vf.ll was modified to reduce its
register pressure but still test what was intended.
SamTebbs33 added a commit to SamTebbs33/llvm-project that referenced this pull request Apr 14, 2025
Based on fhahn's work at llvm#126437 .

This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.

It involves changing some tests, notably removing one from
mve-known-tripcount.ll due to it not being vectorisable thanks to high
register pressure. tail-folding-reduces-vf.ll was modified to reduce its
register pressure but still test what was intended.
var-const pushed a commit to ldionne/llvm-project that referenced this pull request Apr 17, 2025
Add a version of calculateRegisterUsage that works estimates register
usage for a VPlan. This mostly just ports the existing code, with some
updates to figure out what recipes will generate vectors vs scalars.

There are number of changes in the computed register usages, but they
should be more accurate w.r.t. to the generated vector code.

There are the following changes:

 * Scalar usage increases in most cases by 1, as we always create a
   scalar canonical IV, which is alive across the loop and is not
   considered by the legacy implementation

 * Output is ordered by insertion, now scalar registers are added first
   due the canonical IV phi.

 * Using the VPlan, we now also more precisely know if an induction will
   be vectorized or scalarized.

Depends on llvm#126415

PR: llvm#126437
var-const pushed a commit to ldionne/llvm-project that referenced this pull request Apr 17, 2025
This PR accounts for scaled reductions in `calculateRegisterUsage` to
reflect the fact that the number of lanes in their output is smaller
than the VF.

Depends on llvm#126437
SamTebbs33 added a commit to SamTebbs33/llvm-project that referenced this pull request Apr 17, 2025
Based on fhahn's work at llvm#126437 .

This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.

It involves changing some tests, notably removing one from
mve-known-tripcount.ll due to it not being vectorisable thanks to high
register pressure. tail-folding-reduces-vf.ll was modified to reduce its
register pressure but still test what was intended.
SamTebbs33 pushed a commit to SamTebbs33/llvm-project that referenced this pull request May 1, 2025
Add a version of calculateRegisterUsage that works estimates register
usage for a VPlan. This mostly just ports the existing code, with some
updates to figure out what recipes will generate vectors vs scalars.

There are number of changes in the computed register usages, but they
should be more accurate w.r.t. to the generated vector code.

There are the following changes:

 * Scalar usage increases in most cases by 1, as we always create a
   scalar canonical IV, which is alive across the loop and is not
   considered by the legacy implementation

 * Output is ordered by insertion, now scalar registers are added first
   due the canonical IV phi.

 * Using the VPlan, we now also more precisely know if an induction will
   be vectorized or scalarized.

Depends on llvm#126415

PR: llvm#126437
@yashssh
Copy link
Contributor

yashssh commented May 7, 2025

Hi @fhahn, we are seeing a perf regression(around ~7%) in a TSVC benchmark s351 due to this change when compiled with -Ofast on a neoverse-v2 machine.

From the assembly I can see there is less interleaving in the hot loop after this change, maybe that's causing the slowdown? How do you reckon we can go about fixing this?

Thanks!

Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fhahn, we are seeing a perf regression(around ~7%) in a TSVC benchmark s351 due to this change when compiled with -Ofast on a neoverse-v2 machine.

From the assembly I can see there is less interleaving in the hot loop after this change, maybe that's causing the slowdown? How do you reckon we can go about fixing this?

Thanks!

@yashssh thanks for the heads-up. Could you share a full reproducer with the LLVM IR before loop-vectorize?

@yashssh
Copy link
Contributor

yashssh commented May 9, 2025

Sure, here's the IR reproducer to be executed via opt -passes=loop-vectorize reproducer.ll -S -o -
reproducer.txt (changed the file extension to .txt to allow uploading).

The C++ function and the compiler flags used to generate the IR are here https://godbolt.org/z/ebv3nc4hY

@lukel97
Copy link
Contributor

lukel97 commented May 9, 2025

Sure, here's the IR reproducer to be executed via opt -passes=loop-vectorize reproducer.ll -S -o - reproducer.txt (changed the file extension to .txt to allow uploading).

The C++ function and the compiler flags used to generate the IR are here https://godbolt.org/z/ebv3nc4hY

I took a quick look at this and my guess is that it's due to the interleave group recipes hoisting the loads and extending their live ranges. With the old legacy model, the loads are located beside their fmul/fadd uses:

  %indvars.iv = phi i64 [ 0, %.preheader ], [ %indvars.iv.next, %5 ]
  %.118 = phi float [ 0.000000e+00, %.preheader ], [ %39, %5 ]
  %6 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %indvars.iv
  %7 = load float, ptr %6, align 4, !tbaa !8
  %8 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %indvars.iv
  %9 = load float, ptr %8, align 4, !tbaa !8
  %10 = fmul fast float %9, %7
  %11 = fadd fast float %10, %.118
  %12 = add nuw nsw i64 %indvars.iv, 1
  %13 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %12
  %14 = load float, ptr %13, align 4, !tbaa !8
  %15 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %12
  %16 = load float, ptr %15, align 4, !tbaa !8
  %17 = fmul fast float %16, %14
  %18 = fadd fast float %11, %17
  %19 = add nuw nsw i64 %indvars.iv, 2
  %20 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %19
  %21 = load float, ptr %20, align 4, !tbaa !8
  %22 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %19
  %23 = load float, ptr %22, align 4, !tbaa !8
  %24 = fmul fast float %23, %21
  %25 = fadd fast float %18, %24
  %26 = add nuw nsw i64 %indvars.iv, 3
  %27 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %26
  %28 = load float, ptr %27, align 4, !tbaa !8
  %29 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %26
  %30 = load float, ptr %29, align 4, !tbaa !8
  %31 = fmul fast float %30, %28
  %32 = fadd fast float %25, %31
  %33 = add nuw nsw i64 %indvars.iv, 4
  %34 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %33
  %35 = load float, ptr %34, align 4, !tbaa !8
  %36 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %33
  %37 = load float, ptr %36, align 4, !tbaa !8
  %38 = fmul fast float %37, %35
  %39 = fadd fast float %32, %38
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 5
  %40 = icmp samesign ult i64 %indvars.iv, 31995
  br i1 %40, label %5, label %3, !llvm.loop !12

In VPlan the loads are all now in the one place, and there's now 10 loads alive at the same time:

  vector.body:
    EMIT vp<%4> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
    ir<%indvars.iv> = WIDEN-INDUCTION  ir<0>, ir<5>, vp<%0>
    WIDEN-REDUCTION-PHI ir<%.118> = phi ir<0.000000e+00>, ir<%39>
    CLONE ir<%6> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%indvars.iv>
    INTERLEAVE-GROUP with factor 5 at %7, ir<%6>
      ir<%7> = load from index 0
      ir<%14> = load from index 1
      ir<%21> = load from index 2
      ir<%28> = load from index 3
      ir<%35> = load from index 4
    CLONE ir<%8> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%indvars.iv>
    INTERLEAVE-GROUP with factor 5 at %9, ir<%8>
      ir<%9> = load from index 0
      ir<%16> = load from index 1
      ir<%23> = load from index 2
      ir<%30> = load from index 3
      ir<%37> = load from index 4
    WIDEN ir<%10> = fmul fast ir<%9>, ir<%7>
    WIDEN ir<%11> = fadd fast ir<%10>, ir<%.118>
    CLONE ir<%12> = add nuw nsw ir<%indvars.iv>, ir<1>
    CLONE ir<%13> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%12>
    CLONE ir<%15> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%12>
    WIDEN ir<%17> = fmul fast ir<%16>, ir<%14>
    WIDEN ir<%18> = fadd fast ir<%11>, ir<%17>
    CLONE ir<%19> = add nuw nsw ir<%indvars.iv>, ir<2>
    CLONE ir<%20> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%19>
    CLONE ir<%22> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%19>
    WIDEN ir<%24> = fmul fast ir<%23>, ir<%21>
    WIDEN ir<%25> = fadd fast ir<%18>, ir<%24>
    CLONE ir<%26> = add nuw nsw ir<%indvars.iv>, ir<3>
    CLONE ir<%27> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%26>
    CLONE ir<%29> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%26>
    WIDEN ir<%31> = fmul fast ir<%30>, ir<%28>
    WIDEN ir<%32> = fadd fast ir<%25>, ir<%31>
    CLONE ir<%33> = add nuw nsw ir<%indvars.iv>, ir<4>
    CLONE ir<%34> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%33>
    CLONE ir<%36> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%33>
    WIDEN ir<%38> = fmul fast ir<%37>, ir<%35>
    WIDEN ir<%39> = fadd fast ir<%32>, ir<%38>
    CLONE ir<%indvars.iv.next> = add nuw nsw ir<%indvars.iv>, ir<5>
    CLONE ir<%40> = icmp ult ir<%indvars.iv>, ir<31995>

So whilst the legacy model computed a maximum of 3 vector registers used, the VPlan one now calculates 12:

# legacy
LV(REG): Found max usage: 2 item
LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
LV(REG): RegisterClass: Generic::VectorRC, 3 registers
# vplan
LV(REG): Found max usage: 2 item
LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
LV(REG): RegisterClass: Generic::VectorRC, 12 registers

I'm not sure what a good fix for this is.

@yashssh
Copy link
Contributor

yashssh commented May 13, 2025

In VPlan the loads are all now in the one place, and there's now 10 loads alive at the same time:

Can't we instruct VPlan to spread out loads near their uses like the legacy model?

@lukel97
Copy link
Contributor

lukel97 commented May 13, 2025

In VPlan the loads are all now in the one place, and there's now 10 loads alive at the same time:

Can't we instruct VPlan to spread out loads near their uses like the legacy model?

I think that would be quite difficult because an interleave group is a single recipe with multiple values if I understand correctly.

I think that the VPlan model is actually being more accurate in regards to the IR that the loop vectorizer emits, i.e. with the loads bundled together. It’s more of a coincidence that the legacy version happened to match it closer to what it looked like after scheduling/once the loads were sunk?

SamTebbs33 added a commit to SamTebbs33/llvm-project that referenced this pull request May 16, 2025
Based on fhahn's work at llvm#126437 .

This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.

It involves changing some tests, notably removing one from
mve-known-tripcount.ll due to it not being vectorisable thanks to high
register pressure. tail-folding-reduces-vf.ll was modified to reduce its
register pressure but still test what was intended.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants