[LoopVectorizer] Prune VFs based on plan register pressure #132190

SamTebbs33 · 2025-03-20T11:46:54Z

This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.

Depends on #113903

llvmbot · 2025-03-20T11:47:28Z

@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-risc-v

Author: Sam Tebbs (SamTebbs33)

Changes

Based on fhahn's work at #126437.

This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.

It involves changing some tests, notably removing one from
mve-known-tripcount.ll due to it not being vectorisable thanks to high
register pressure. tail-folding-reduces-vf.ll was modified to reduce its
register pressure but still test what was intended.

Patch is 186.33 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/132190.diff

24 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+293-242)
(modified) llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h (+2-1)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+10-4)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product-neon.ll (+50-10)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product.ll (+414)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll (+6-7)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/scalable-call.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll (-191)
(modified) llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll (+11-11)
(modified) llvm/test/Transforms/LoopVectorize/ARM/mve-reductions.ll (+37-37)
(modified) llvm/test/Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll (+4-24)
(modified) llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll (+4-3)
(modified) llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll (+48-12)
(modified) llvm/test/Transforms/LoopVectorize/PowerPC/large-loop-rdx.ll (+77-230)
(modified) llvm/test/Transforms/LoopVectorize/PowerPC/reg-usage.ll (+9-5)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll (+5-5)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll (+10-4)
(modified) llvm/test/Transforms/LoopVectorize/X86/i1-reg-usage.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/X86/pr47437.ll (+21-99)
(modified) llvm/test/Transforms/LoopVectorize/X86/reg-usage.ll (+23-9)
(modified) llvm/test/Transforms/PhaseOrdering/ARM/arm_mean_q7.ll (+18-9)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 89337dc385350..3574aa31195f0 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1018,7 +1018,8 @@ class LoopVectorizationCostModel {
   /// If interleave count has been specified by metadata it will be returned.
   /// Otherwise, the interleave count is computed and returned. VF and LoopCost
   /// are the selected vectorization factor and the cost of the selected VF.
-  unsigned selectInterleaveCount(ElementCount VF, InstructionCost LoopCost);
+  unsigned selectInterleaveCount(VPlan &Plan, ElementCount VF,
+                                 InstructionCost LoopCost);
 
   /// Memory access instruction may be vectorized in more than one way.
   /// Form of instruction after vectorization depends on cost.
@@ -1047,11 +1048,6 @@ class LoopVectorizationCostModel {
     SmallMapVector<unsigned, unsigned, 4> MaxLocalUsers;
   };
 
-  /// \return Returns information about the register usages of the loop for the
-  /// given vectorization factors.
-  SmallVector<RegisterUsage, 8>
-  calculateRegisterUsage(ArrayRef<ElementCount> VFs);
-
   /// Collect values we want to ignore in the cost model.
   void collectValuesToIgnore();
 
@@ -4239,27 +4235,12 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
         ComputeScalableMaxVF);
     MaxVectorElementCountMaxBW = MinVF(MaxVectorElementCountMaxBW, MaxSafeVF);
 
-    // Collect all viable vectorization factors larger than the default MaxVF
-    // (i.e. MaxVectorElementCount).
-    SmallVector<ElementCount, 8> VFs;
+    // Set the max VF to the largest viable vectorization factor less than or
+    // equal to the max vector element count.
     for (ElementCount VS = MaxVectorElementCount * 2;
          ElementCount::isKnownLE(VS, MaxVectorElementCountMaxBW); VS *= 2)
-      VFs.push_back(VS);
-
-    // For each VF calculate its register usage.
-    auto RUs = calculateRegisterUsage(VFs);
+      MaxVF = VS;
 
-    // Select the largest VF which doesn't require more registers than existing
-    // ones.
-    for (int I = RUs.size() - 1; I >= 0; --I) {
-      const auto &MLU = RUs[I].MaxLocalUsers;
-      if (all_of(MLU, [&](decltype(MLU.front()) &LU) {
-            return LU.second <= TTI.getNumberOfRegisters(LU.first);
-          })) {
-        MaxVF = VFs[I];
-        break;
-      }
-    }
     if (ElementCount MinVF =
             TTI.getMinimumVF(SmallestType, ComputeScalableMaxVF)) {
       if (ElementCount::isKnownLT(MaxVF, MinVF)) {
@@ -4885,8 +4866,245 @@ void LoopVectorizationCostModel::collectElementTypesForWidening() {
   }
 }
 
+/// Estimate the register usage for \p Plan and vectorization factors in \p VFs.
+/// Returns the register usage for each VF in \p VFs.
+static SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
+calculateRegisterUsage(VPlan &Plan, ArrayRef<ElementCount> VFs,
+                       const TargetTransformInfo &TTI) {
+  // This function calculates the register usage by measuring the highest number
+  // of values that are alive at a single location. Obviously, this is a very
+  // rough estimation. We scan the loop in a topological order in order and
+  // assign a number to each recipe. We use RPO to ensure that defs are
+  // met before their users. We assume that each recipe that has in-loop
+  // users starts an interval. We record every time that an in-loop value is
+  // used, so we have a list of the first and last occurrences of each
+  // recipe. Next, we transpose this data structure into a multi map that
+  // holds the list of intervals that *end* at a specific location. This multi
+  // map allows us to perform a linear search. We scan the instructions linearly
+  // and record each time that a new interval starts, by placing it in a set.
+  // If we find this value in the multi-map then we remove it from the set.
+  // The max register usage is the maximum size of the set.
+  // We also search for instructions that are defined outside the loop, but are
+  // used inside the loop. We need this number separately from the max-interval
+  // usage number because when we unroll, loop-invariant values do not take
+  // more register.
+  LoopVectorizationCostModel::RegisterUsage RU;
+
+  // Each 'key' in the map opens a new interval. The values
+  // of the map are the index of the 'last seen' usage of the
+  // recipe that is the key.
+  using IntervalMap = SmallDenseMap<VPRecipeBase *, unsigned, 16>;
+
+  // Maps indices to recipes.
+  SmallVector<VPRecipeBase *, 64> Idx2Recipe;
+  // Marks the end of each interval.
+  IntervalMap EndPoint;
+  // Saves the list of recipe indices that are used in the loop.
+  SmallPtrSet<VPRecipeBase *, 8> Ends;
+  // Saves the list of values that are used in the loop but are defined outside
+  // the loop (not including non-recipe values such as arguments and
+  // constants).
+  SmallSetVector<VPValue *, 8> LoopInvariants;
+  LoopInvariants.insert(&Plan.getVectorTripCount());
+
+  ReversePostOrderTraversal<VPBlockDeepTraversalWrapper<VPBlockBase *>> RPOT(
+      Plan.getVectorLoopRegion());
+  for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {
+    if (!VPBB->getParent())
+      break;
+    for (VPRecipeBase &R : *VPBB) {
+      Idx2Recipe.push_back(&R);
+
+      // Save the end location of each USE.
+      for (VPValue *U : R.operands()) {
+        auto *DefR = U->getDefiningRecipe();
+
+        // Ignore non-recipe values such as arguments, constants, etc.
+        // FIXME: Might need some motivation why these values are ignored. If
+        // for example an argument is used inside the loop it will increase the
+        // register pressure (so shouldn't we add it to LoopInvariants).
+        if (!DefR && (!U->getLiveInIRValue() ||
+                      !isa<Instruction>(U->getLiveInIRValue())))
+          continue;
+
+        // If this recipe is outside the loop then record it and continue.
+        if (!DefR) {
+          LoopInvariants.insert(U);
+          continue;
+        }
+
+        // Overwrite previous end points.
+        EndPoint[DefR] = Idx2Recipe.size();
+        Ends.insert(DefR);
+      }
+    }
+    if (VPBB == Plan.getVectorLoopRegion()->getExiting()) {
+      // VPWidenIntOrFpInductionRecipes are used implicitly at the end of the
+      // exiting block, where their increment will get materialized eventually.
+      for (auto &R : Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
+        if (isa<VPWidenIntOrFpInductionRecipe>(&R)) {
+          EndPoint[&R] = Idx2Recipe.size();
+          Ends.insert(&R);
+        }
+      }
+    }
+  }
+
+  // Saves the list of intervals that end with the index in 'key'.
+  using RecipeList = SmallVector<VPRecipeBase *, 2>;
+  SmallDenseMap<unsigned, RecipeList, 16> TransposeEnds;
+
+  // Transpose the EndPoints to a list of values that end at each index.
+  for (auto &Interval : EndPoint)
+    TransposeEnds[Interval.second].push_back(Interval.first);
+
+  SmallPtrSet<VPRecipeBase *, 8> OpenIntervals;
+  SmallVector<LoopVectorizationCostModel::RegisterUsage, 8> RUs(VFs.size());
+  SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
+
+  LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
+
+  VPTypeAnalysis TypeInfo(Plan.getCanonicalIV()->getScalarType());
+
+  const auto &TTICapture = TTI;
+  auto GetRegUsage = [&TTICapture](Type *Ty, ElementCount VF) -> unsigned {
+    if (Ty->isTokenTy() || !VectorType::isValidElementType(Ty) ||
+        (VF.isScalable() &&
+         !TTICapture.isElementTypeLegalForScalableVector(Ty)))
+      return 0;
+    return TTICapture.getRegUsageForType(VectorType::get(Ty, VF));
+  };
+
+  for (unsigned int Idx = 0, Sz = Idx2Recipe.size(); Idx < Sz; ++Idx) {
+    VPRecipeBase *R = Idx2Recipe[Idx];
+
+    //  Remove all of the recipes that end at this location.
+    RecipeList &List = TransposeEnds[Idx];
+    for (VPRecipeBase *ToRemove : List)
+      OpenIntervals.erase(ToRemove);
+
+    // Ignore recipes that are never used within the loop.
+    if (!Ends.count(R) && !R->mayHaveSideEffects())
+      continue;
+
+    // For each VF find the maximum usage of registers.
+    for (unsigned J = 0, E = VFs.size(); J < E; ++J) {
+      // Count the number of registers used, per register class, given all open
+      // intervals.
+      // Note that elements in this SmallMapVector will be default constructed
+      // as 0. So we can use "RegUsage[ClassID] += n" in the code below even if
+      // there is no previous entry for ClassID.
+      SmallMapVector<unsigned, unsigned, 4> RegUsage;
+
+      if (VFs[J].isScalar()) {
+        for (auto *Inst : OpenIntervals) {
+          for (VPValue *DefV : Inst->definedValues()) {
+            unsigned ClassID = TTI.getRegisterClassForType(
+                false, TypeInfo.inferScalarType(DefV));
+            // FIXME: The target might use more than one register for the type
+            // even in the scalar case.
+            RegUsage[ClassID] += 1;
+          }
+        }
+      } else {
+        for (auto *R : OpenIntervals) {
+          if (isa<VPVectorPointerRecipe, VPVectorEndPointerRecipe>(R))
+            continue;
+          if (isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
+                  VPScalarIVStepsRecipe>(R) ||
+              (isa<VPInstruction>(R) &&
+               all_of(cast<VPSingleDefRecipe>(R)->users(), [&](VPUser *U) {
+                 return cast<VPRecipeBase>(U)->usesScalars(
+                     R->getVPSingleValue());
+               }))) {
+            unsigned ClassID = TTI.getRegisterClassForType(
+                false, TypeInfo.inferScalarType(R->getVPSingleValue()));
+            // FIXME: The target might use more than one register for the type
+            // even in the scalar case.
+            RegUsage[ClassID] += 1;
+          } else {
+            // The output from scaled phis and scaled reductions actually have
+            // fewer lanes than the VF.
+            auto VF = VFs[J];
+            if (auto *ReductionR = dyn_cast<VPReductionPHIRecipe>(R))
+              VF = VF.divideCoefficientBy(ReductionR->getVFScaleFactor());
+            else if (auto *PartialReductionR =
+                         dyn_cast<VPPartialReductionRecipe>(R))
+              VF = VF.divideCoefficientBy(PartialReductionR->getScaleFactor());
+            if (VF != VFs[J])
+              LLVM_DEBUG(dbgs() << "LV(REG): Scaled down VF from " << VFs[J]
+                                << " to " << VF << " for ";
+                         R->dump(););
+
+            for (VPValue *DefV : R->definedValues()) {
+              Type *ScalarTy = TypeInfo.inferScalarType(DefV);
+              unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
+              RegUsage[ClassID] += GetRegUsage(ScalarTy, VF);
+            }
+          }
+        }
+      }
+
+      for (const auto &Pair : RegUsage) {
+        auto &Entry = MaxUsages[J][Pair.first];
+        Entry = std::max(Entry, Pair.second);
+      }
+    }
+
+    LLVM_DEBUG(dbgs() << "LV(REG): At #" << Idx << " Interval # "
+                      << OpenIntervals.size() << '\n');
+
+    // Add the current recipe to the list of open intervals.
+    OpenIntervals.insert(R);
+  }
+
+  for (unsigned Idx = 0, End = VFs.size(); Idx < End; ++Idx) {
+    // Note that elements in this SmallMapVector will be default constructed
+    // as 0. So we can use "Invariant[ClassID] += n" in the code below even if
+    // there is no previous entry for ClassID.
+    SmallMapVector<unsigned, unsigned, 4> Invariant;
+
+    for (auto *In : LoopInvariants) {
+      // FIXME: The target might use more than one register for the type
+      // even in the scalar case.
+      bool IsScalar = all_of(In->users(), [&](VPUser *U) {
+        return cast<VPRecipeBase>(U)->usesScalars(In);
+      });
+
+      ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
+      unsigned ClassID = TTI.getRegisterClassForType(
+          VF.isVector(), TypeInfo.inferScalarType(In));
+      Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
+    }
+
+    LLVM_DEBUG({
+      dbgs() << "LV(REG): VF = " << VFs[Idx] << '\n';
+      dbgs() << "LV(REG): Found max usage: " << MaxUsages[Idx].size()
+             << " item\n";
+      for (const auto &pair : MaxUsages[Idx]) {
+        dbgs() << "LV(REG): RegisterClass: "
+               << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+               << " registers\n";
+      }
+      dbgs() << "LV(REG): Found invariant usage: " << Invariant.size()
+             << " item\n";
+      for (const auto &pair : Invariant) {
+        dbgs() << "LV(REG): RegisterClass: "
+               << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+               << " registers\n";
+      }
+    });
+
+    RU.LoopInvariantRegs = Invariant;
+    RU.MaxLocalUsers = MaxUsages[Idx];
+    RUs[Idx] = RU;
+  }
+
+  return RUs;
+}
+
 unsigned
-LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
+LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
                                                   InstructionCost LoopCost) {
   // -- The interleave heuristics --
   // We interleave the loop in order to expose ILP and reduce the loop overhead.
@@ -4936,7 +5154,7 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
       return 1;
   }
 
-  RegisterUsage R = calculateRegisterUsage({VF})[0];
+  RegisterUsage R = ::calculateRegisterUsage(Plan, {VF}, TTI)[0];
   // We divide by these constants so assume that we have at least one
   // instruction that uses at least one register.
   for (auto &Pair : R.MaxLocalUsers) {
@@ -5169,213 +5387,6 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
   return 1;
 }
 
-SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
-LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef<ElementCount> VFs) {
-  // This function calculates the register usage by measuring the highest number
-  // of values that are alive at a single location. Obviously, this is a very
-  // rough estimation. We scan the loop in a topological order in order and
-  // assign a number to each instruction. We use RPO to ensure that defs are
-  // met before their users. We assume that each instruction that has in-loop
-  // users starts an interval. We record every time that an in-loop value is
-  // used, so we have a list of the first and last occurrences of each
-  // instruction. Next, we transpose this data structure into a multi map that
-  // holds the list of intervals that *end* at a specific location. This multi
-  // map allows us to perform a linear search. We scan the instructions linearly
-  // and record each time that a new interval starts, by placing it in a set.
-  // If we find this value in the multi-map then we remove it from the set.
-  // The max register usage is the maximum size of the set.
-  // We also search for instructions that are defined outside the loop, but are
-  // used inside the loop. We need this number separately from the max-interval
-  // usage number because when we unroll, loop-invariant values do not take
-  // more register.
-  LoopBlocksDFS DFS(TheLoop);
-  DFS.perform(LI);
-
-  RegisterUsage RU;
-
-  // Each 'key' in the map opens a new interval. The values
-  // of the map are the index of the 'last seen' usage of the
-  // instruction that is the key.
-  using IntervalMap = SmallDenseMap<Instruction *, unsigned, 16>;
-
-  // Maps instruction to its index.
-  SmallVector<Instruction *, 64> IdxToInstr;
-  // Marks the end of each interval.
-  IntervalMap EndPoint;
-  // Saves the list of instruction indices that are used in the loop.
-  SmallPtrSet<Instruction *, 8> Ends;
-  // Saves the list of values that are used in the loop but are defined outside
-  // the loop (not including non-instruction values such as arguments and
-  // constants).
-  SmallSetVector<Instruction *, 8> LoopInvariants;
-
-  for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) {
-    for (Instruction &I : BB->instructionsWithoutDebug()) {
-      IdxToInstr.push_back(&I);
-
-      // Save the end location of each USE.
-      for (Value *U : I.operands()) {
-        auto *Instr = dyn_cast<Instruction>(U);
-
-        // Ignore non-instruction values such as arguments, constants, etc.
-        // FIXME: Might need some motivation why these values are ignored. If
-        // for example an argument is used inside the loop it will increase the
-        // register pressure (so shouldn't we add it to LoopInvariants).
-        if (!Instr)
-          continue;
-
-        // If this instruction is outside the loop then record it and continue.
-        if (!TheLoop->contains(Instr)) {
-          LoopInvariants.insert(Instr);
-          continue;
-        }
-
-        // Overwrite previous end points.
-        EndPoint[Instr] = IdxToInstr.size();
-        Ends.insert(Instr);
-      }
-    }
-  }
-
-  // Saves the list of intervals that end with the index in 'key'.
-  using InstrList = SmallVector<Instruction *, 2>;
-  SmallDenseMap<unsigned, InstrList, 16> TransposeEnds;
-
-  // Transpose the EndPoints to a list of values that end at each index.
-  for (auto &Interval : EndPoint)
-    TransposeEnds[Interval.second].push_back(Interval.first);
-
-  SmallPtrSet<Instruction *, 8> OpenIntervals;
-  SmallVector<RegisterUsage, 8> RUs(VFs.size());
-  SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
-
-  LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
-
-  const auto &TTICapture = TTI;
-  auto GetRegUsage = [&TTICapture](Type *Ty, ElementCount VF) -> unsigned {
-    if (Ty->isTokenTy() || !VectorType::isValidElementType(Ty) ||
-        (VF.isScalable() &&
-         !TTICapture.isElementTypeLegalForScalableVector(Ty)))
-      return 0;
-    return TTICapture.getRegUsageForType(VectorType::get(Ty, VF));
-  };
-
-  collectInLoopReductions();
-
-  for (unsigned int Idx = 0, Sz = IdxToInstr.size(); Idx < Sz; ++Idx) {
-    Instruction *I = IdxToInstr[Idx];
-
-    // Remove all of the instructions that end at this location.
-    InstrList &List = TransposeEnds[Idx];
-    for (Instruction *ToRemove : List)
-      OpenIntervals.erase(ToRemove);
-
-    // Ignore instructions that are never used within the loop and do not have
-    // side-effects.
-    if (!Ends.count(I) && !I->mayHaveSideEffects())
-      continue;
-
-    // Skip ignored values.
-    if (ValuesToIgnore.count(I))
-      continue;
-
-    // For each VF find the maximum usage of registers.
-    for (unsigned J = 0, E = VFs.size(); J < E; ++J) {
-      // Count the number of registers used, per register class, given all open
-      // intervals.
-      // Note that elements in this SmallMapVector will be default constructed
-      // as 0. So we can use "RegUsage[ClassID] += n" in the code below even if
-      // there is no previous entry for ClassID.
-      SmallMapVector<unsigned, unsigned, 4> RegUsage;
-
-      if (VFs[J].isScalar()) {
-        for (auto *Inst : OpenIntervals) {
-          unsigned ClassID =
-              TTI.getRegisterClassForType(false, Inst->getType());
-          // FIXME: The target might use more than one register for the type
-          // even in the scalar case.
-          RegUsage[ClassID] += 1;
-        }
-      } else {
-        collectUniformsAndScalars(VFs[J]);
-        for (auto *Inst : OpenIntervals) {
-          // Skip ignored values for VF > 1.
-          if (VecValuesToIgnore.count(Inst))
-            continue;
-          if (isScalarAfterVectorization(Inst, VFs[J])) {
-            unsigned ClassID =
-                TTI.getRegisterClassForType(false, Inst->getType());
-            // FIXME: The target might use more than one register for the type
-            // even in the scalar case.
-            RegUsage[ClassID] += 1;
-          } else {
-            unsigned ClassID =
-                TTI.getRegisterClassForType(true, Inst->getType());
-            RegUsage[ClassID] += GetRegUsage(Inst->getType(), VFs[J]);
-          }
-        }
-      }
-
-      for (const auto &Pair : RegUsage) {
-        auto &Entry = MaxUsages[J][Pair.first];
-        Entry = std::max(Entry, Pair.second);
-      }
-    }
-
-   ...
[truncated]

github-actions · 2025-03-20T11:50:45Z

✅ With the latest revision this PR passed the C/C++ code formatter.

david-arm · 2025-03-20T13:44:19Z

llvm/test/Transforms/PhaseOrdering/ARM/arm_mean_q7.ll

-; CHECK-NEXT:    [[TMP6:%.*]] = tail call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP5]])
-; CHECK-NEXT:    [[TMP7:%.*]] = add i32 [[TMP6]], [[SUM_0_LCSSA]]
-; CHECK-NEXT:    br label [[WHILE_END5]]
+; CHECK:       vector.ph:


This looks like a possible regression to me. The previous code was optimal because it removed the loop entirely and hence there was no register pressure due to PHIs.

I've managed to fix this by ignoring in-loop reduction phis.

david-arm · 2025-03-20T13:46:51Z

llvm/test/Transforms/LoopVectorize/PowerPC/large-loop-rdx.ll

@@ -10,72 +10,29 @@ define void @QLA_F3_r_veq_norm2_V(ptr noalias %r, ptr noalias %a, i32 %n) {
 ; CHECK-SAME: ptr noalias [[R:%.*]], ptr noalias [[A:%.*]], i32 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
 ; CHECK-NEXT:    [[CMP24:%.*]] = icmp sgt i32 [[N]], 0
-; CHECK-NEXT:    br i1 [[CMP24]], label %[[ITER_CHECK:.*]], label %[[FOR_END13:.*]]
-; CHECK:       [[ITER_CHECK]]:
+; CHECK-NEXT:    br i1 [[CMP24]], label %[[FOR_COND1_PREHEADER_PREHEADER:.*]], label %[[FOR_END13:.*]]


From the comment above the function it looks like they are expecting it be interleaved 8 times, perhaps for optimal performance?

This actually comes from fhahn's change here https://github.com/llvm/llvm-project/pull/126437/files#diff-67309aee6068e2dd0d01af33e35d15855352bb11009158c554228043cfeaa4ba

david-arm · 2025-03-20T13:47:47Z

llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll

@@ -195,197 +195,6 @@ for.body:                                         ; preds = %entry, %for.body
  br i1 %cmp, label %for.body, label %for.cond.cleanup
 }

-; Trip count of 8 - does get vectorized


Why has this been deleted?

The register pressure was too high for it to be vectorised and I couldn't modify the test to reduce the pressure and still have it vectorise at the expected VF with the same trip count. I'm now trying to find a way to modify the pressure calculation to keep it and vectorise at VF 4.

I've managed to reduce the test so that it doesn't exhibit huge register pressure, do let me know what you think of it.

david-arm · 2025-03-20T13:48:19Z

llvm/test/Transforms/LoopVectorize/AArch64/scalable-call.ll

@@ -8,8 +8,8 @@
 define void @vec_load(i64 %N, ptr nocapture %a, ptr nocapture readonly %b) {
 ; CHECK-LABEL: @vec_load
 ; CHECK: vector.body:
-; CHECK: %[[LOAD:.*]] = load <vscale x 2 x double>, ptr
-; CHECK: call <vscale x 2 x double> @foo_vec(<vscale x 2 x double> %[[LOAD]])
+; CHECK: %[[WIDE_LOAD:.*]] = load <vscale x 2 x double>, ptr


Looks like these changes can be reverted?

Reverted now. I'm not sure why the test was failing before without this change!

MacDue · 2025-03-25T15:49:57Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    // Set the max VF to the largest viable vectorization factor less than or
+    // equal to the max vector element count.
    for (ElementCount VS = MaxVectorElementCount * 2;
         ElementCount::isKnownLE(VS, MaxVectorElementCountMaxBW); VS *= 2)
-      VFs.push_back(VS);
-
-    // For each VF calculate its register usage.
-    auto RUs = calculateRegisterUsage(VFs);
+      MaxVF = VS;


Given these element counts are powers of 2 (it appears bit_floor is used above), I think this is just:

MaxVF = MaxVectorElementCountMaxBW;

(assuming MaxVectorElementCount <= MaxVectorElementCountMaxBW)

I added an assertion checking if MaxVectorElementCount <= MaxVectorElementCountMaxBW and an AArch64 test failed (Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll) so that condition doesn't always hold. So I've replaced the loop with a check and an assignment.

david-arm · 2025-03-25T16:04:10Z

Hi @SamTebbs33, looks like it's failing some RISCV reg usage tests.

david-arm · 2025-03-25T16:07:34Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -4885,8 +4872,248 @@ void LoopVectorizationCostModel::collectElementTypesForWidening() {
  }
 }

+/// Estimate the register usage for \p Plan and vectorization factors in \p VFs.


Is it necessary to move this function? It makes it a lot more difficult to review. If it can't be avoided then perhaps it should be moved in a separate NFC commit within the same PR or done in a different PR?

I had to move it when I was using LoopVectorizationCostModel to check for in loop reductions, but now that I'm using the isInLoop() function from the reduction recipe we don't need these changes. Reverted.

Hmm, the function has still moved so it's still hard to see what's changed. If it's easier you could always abandon the commit history and start again?

david-arm · 2025-03-25T16:15:08Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+            // fewer lanes than the VF.
+            auto VF = VFs[J];
+            if (auto *ReductionR = dyn_cast<VPReductionPHIRecipe>(R))
+              VF = VF.divideCoefficientBy(ReductionR->getVFScaleFactor());


I wonder if it's worth first doing the transition over to computing register usage based on VPlan, followed by the scale factor changes? That way we can see individual effects of each.

Yeah that's a good idea. Done.

davemgreen · 2025-03-26T08:08:47Z

llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll

@@ -12,28 +12,28 @@ define i32 @mla_i32(ptr noalias nocapture readonly %A, ptr noalias nocapture rea
 ; CHECK:       for.body.preheader:
 ; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK:       vector.ph:
-; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i32 [[N]], 15
-; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], 16
+; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i32 [[N]], 7


This should not be changing, it was OK as it was before. https://godbolt.org/z/KnT1xaeTf

Is it changing because of the large reduction registers? Those wouldn't be expected to be in the final code. Perhaps #113903 will help to represent the register usage more correctly.

Thanks Dave, that patch does indeed help and the only change after cherry picking over Elvis's changes is that the first sext is moved next to the second one. Once that PR is merged I'll rebase on top of it to get rid of this regression here.

SamTebbs33 · 2025-04-07T12:18:23Z

Ping

lukel97

Thanks for working on this, this will improve things for RISC-V a lot. E.g. currently we spill a lot on this example due to needing a gather which requires i64 elements, which is only modelled in VPlan and not in the legacy model: https://godbolt.org/z/rrPssc5dx

With this patch we go from VF=vscale x 16 -> VF=vscale x 8, which fixes the spilling:

Before

.LBB0_11:
        vfcvt.f.xu.v    v8, v24
        vfmul.vf        v8, v8, fa0
        vfcvt.rtz.x.f.v v8, v8
        vsetivli        zero, 16, e32, m8, ta, ma
        vslidedown.vi   v16, v8, 16
        csrr    s0, vlenb
        slli    s0, s0, 4
        add     s0, s0, sp
        addi    s0, s0, 16
        vs8r.v  v16, (s0)
        vsetivli        zero, 16, e64, m8, ta, ma
        vmv.v.x v16, s6
        csrr    s0, vlenb
        slli    s0, s0, 3
        sh1add  s0, s0, s0
        add     s0, s0, sp
        addi    s0, s0, 16
        vs8r.v  v16, (s0)
        csrr    s0, vlenb
        slli    s0, s0, 3
        sh1add  s0, s0, s0
        add     s0, s0, sp
        addi    s0, s0, 16
        vl8r.v  v16, (s0)
        vsetvli zero, zero, e32, m4, ta, ma
        vwadd.wv        v0, v16, v8
        addi    s0, sp, 16
        vs8r.v  v0, (s0)
        csrr    s0, vlenb
        slli    s0, s0, 3
        sh1add  s0, s0, s0
        add     s0, s0, sp
        addi    s0, s0, 16
        vl8r.v  v16, (s0)
        csrr    s0, vlenb
        slli    s0, s0, 4
        add     s0, s0, sp
        addi    s0, s0, 16
        vl8r.v  v8, (s0)
        vwadd.wv        v0, v16, v8
        csrr    s0, vlenb
        sh3add  s0, s0, sp
        addi    s0, s0, 16
        vs8r.v  v0, (s0)
        addi    s0, sp, 16
        vl8r.v  v8, (s0)
        vsetvli zero, zero, e64, m8, ta, ma
        vmul.vx v0, v8, t2
        csrr    s0, vlenb
        sh3add  s0, s0, sp
        addi    s0, s0, 16
        vl8r.v  v8, (s0)
        vmul.vx v8, v8, t2
        vsetvli zero, zero, e8, m1, ta, ma
        vluxei64.v      v16, (a4), v0
        vsetvli zero, zero, e64, m8, ta, ma
        vadd.vx v0, v0, a4
        vsetvli zero, zero, e8, m1, ta, ma
        vluxei64.v      v18, (a4), v8
        vsetvli zero, s5, e8, m2, ta, ma
        vslideup.vi     v16, v18, 16
        vsetivli        zero, 16, e8, m1, ta, ma
        vluxei64.v      v18, (t5), v0
        vsetvli zero, zero, e64, m8, ta, ma
        vadd.vx v8, v8, a4
        vsetvli zero, zero, e8, m1, ta, ma
        vluxei64.v      v20, (t5), v8
        vsetvli zero, s5, e8, m2, ta, ma
        vslideup.vi     v18, v20, 16
        vsetivli        zero, 16, e8, m1, ta, ma
        vluxei64.v      v20, (s4), v0
        vluxei64.v      v22, (s4), v8
        vsetvli zero, s5, e8, m2, ta, ma
        vslideup.vi     v20, v22, 16
        vsseg3e8.v      v16, (s1)
        vsetvli zero, zero, e32, m8, ta, ma
        vadd.vx v24, v24, s5
        addi    a3, a3, -32
        addi    s1, s1, 96
        bnez    a3, .LBB0_11
        mv      s7, t3
        beq     a1, t3, .LBB0_3
        j       .LBB0_6

After

.LBB0_11:                               # %vector.body
                                        #   Parent Loop BB0_4 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	vsetvli	zero, zero, e32, m4, ta, ma
	vfcvt.f.xu.v	v24, v20
	vfmul.vf	v24, v24, fa0
	vfcvt.rtz.x.f.v	v4, v24
	vwadd.wv	v24, v8, v4
	vsetvli	zero, zero, e64, m8, ta, ma
	vmul.vx	v24, v24, t2
	vadd.vx	v0, v24, a4
	vsetvli	zero, zero, e8, m1, ta, ma
	vluxei64.v	v16, (a4), v24
	vluxei64.v	v17, (t5), v0
	vluxei64.v	v18, (s3), v0
	vsetvli	zero, zero, e32, m4, ta, ma
	vadd.vx	v20, v20, t4
	addi	a3, a3, -16
	vsetvli	zero, zero, e8, m1, ta, ma
	vsseg3e8.v	v16, (s0)
	addi	s0, s0, 48
	bnez	a3, .LBB0_11

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

lukel97 · 2025-04-08T11:18:38Z

I collected some more data on RISC-V on SPEC CPU 2017, this improves code size by up to 7% on some benchmarks, and no regressions were found: https://lnt.lukelau.me/db_default/v4/nts/399?show_delta=yes&show_previous=yes&show_stddev=yes&show_mad=yes&show_all=yes&show_all_samples=yes&show_sample_counts=yes&show_small_diff=yes&num_comparison_runs=0&test_filter=&test_min_value_filter=&aggregation_fn=min&MW_confidence_lv=0.05&compare_to=401&submit=Update

There's also a significant decrease in vector spilling and reloading. It removes all the spilling entirely on one benchmark so the geomean result is stuck at 100%:

Program                                       riscv-instr-info.NumVRegReloaded                 riscv-instr-info.NumVRegSpilled                
                                              lhs                              rhs     diff    lhs                             rhs     diff   
FP2017rate/508.namd_r/508.namd_r                 6.00                             6.00    0.0%    1.00                            1.00    0.0%
INT2017rat...00.perlbench_r/500.perlbench_r      8.00                             8.00    0.0%    4.00                            4.00    0.0%
INT2017speed/625.x264_s/625.x264_s              35.00                            35.00    0.0%   39.00                           39.00    0.0%
INT2017spe...23.xalancbmk_s/623.xalancbmk_s      6.00                             6.00    0.0%    6.00                            6.00    0.0%
INT2017spe...ed/620.omnetpp_s/620.omnetpp_s      5.00                             5.00    0.0%    4.00                            4.00    0.0%
INT2017speed/602.gcc_s/602.gcc_s                70.00                            70.00    0.0%   64.00                           64.00    0.0%
INT2017spe...00.perlbench_s/600.perlbench_s      8.00                             8.00    0.0%    4.00                            4.00    0.0%
INT2017rate/525.x264_r/525.x264_r               35.00                            35.00    0.0%   39.00                           39.00    0.0%
INT2017rat...23.xalancbmk_r/523.xalancbmk_r      6.00                             6.00    0.0%    6.00                            6.00    0.0%
INT2017rate/520.omnetpp_r/520.omnetpp_r          5.00                             5.00    0.0%    4.00                            4.00    0.0%
INT2017rate/502.gcc_r/502.gcc_r                 70.00                            70.00    0.0%   64.00                           64.00    0.0%
FP2017speed/644.nab_s/644.nab_s                 24.00                            24.00    0.0%   24.00                           24.00    0.0%
FP2017rate/544.nab_r/544.nab_r                  24.00                            24.00    0.0%   24.00                           24.00    0.0%
FP2017rate/511.povray_r/511.povray_r           131.00                           131.00    0.0%   74.00                           74.00    0.0%
FP2017rate/510.parest_r/510.parest_r          1490.00                          1484.00   -0.4% 1231.00                         1225.00   -0.5%
INT2017rat...31.deepsjeng_r/531.deepsjeng_r    248.00                           218.00  -12.1%  134.00                          102.00  -23.9%
INT2017spe...31.deepsjeng_s/631.deepsjeng_s    248.00                           218.00  -12.1%  134.00                          102.00  -23.9%
FP2017rate/526.blender_r/526.blender_r        1210.00                           703.00  -41.9% 1033.00                          654.00  -36.7%
FP2017speed/638.imagick_s/638.imagick_s       7524.00                          1486.00  -80.2% 4813.00                          925.00  -80.8%
FP2017rate/538.imagick_r/538.imagick_r        7524.00                          1486.00  -80.2% 4813.00                          925.00  -80.8%
FP2017speed/619.lbm_s/619.lbm_s                 42.00                             0.00 -100.0%   42.00                                 -100.0%
FP2017rate/519.lbm_r/519.lbm_r                  42.00                             0.00 -100.0%   42.00                                 -100.0%
FP2017rate...97.specrand_fr/997.specrand_fr      0.00                             0.00                                                        
FP2017spee...96.specrand_fs/996.specrand_fs      0.00                             0.00                                                        
INT2017rate/505.mcf_r/505.mcf_r                  0.00                             0.00                                                        
INT2017rate/541.leela_r/541.leela_r              0.00                             0.00                                                        
INT2017rate/557.xz_r/557.xz_r                    0.00                             0.00                                                        
INT2017rat...99.specrand_ir/999.specrand_ir      0.00                             0.00                                                        
INT2017speed/605.mcf_s/605.mcf_s                 0.00                             0.00                                                        
INT2017speed/641.leela_s/641.leela_s             0.00                             0.00                                                        
INT2017speed/657.xz_s/657.xz_s                   0.00                             0.00                                                        
INT2017spe...98.specrand_is/998.specrand_is      0.00                             0.00                                                        
                           Geomean difference                                          -100.0%                                          -19.4%

ElvisWang123 · 2025-04-09T00:21:14Z

Thanks for working on this too!

We have a similar implementation in downstream and works well.
Would you mind I stacked some further improvements based on this patch?

sdesmalen-arm · 2025-04-09T06:50:48Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+        calculateRegisterUsage(getPlanFor(LegacyVF.Width), LegacyVFs, TTI);
+    auto RUs = calculateRegisterUsage(BestPlan, VFs, TTI);
+
+    auto GetMaxUsage = [](SmallMapVector<unsigned, unsigned, 4> MaxLocalUsers) {


MaxLocalUsers is passed by value here, it should be passed by reference.

sdesmalen-arm · 2025-04-09T07:03:08Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+      unsigned Max = 0;
+      for (auto Pair : MaxLocalUsers)
+        if (Pair.second > Max)
+          Max = Pair.second;
+      return Max;


Although rather than creating a GetMaxUsage, you could use max_element for this instead, e.g.

auto CmpGT = [](std::pair<unsigned, unsigned> &A, std::pair<unsigned, unsigned> &B) { return A.second > B.second; }; unsigned MaxLegacyRegUsage = max_element(LegacyRUs[0].MaxLocalUsers, CmpGT)->second; unsigned MaxRegUsage = max_element(RUs[0].MaxLocalUsers, CmpGT)->second;

sdesmalen-arm · 2025-04-09T07:15:36Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+          if (auto *Phi = dyn_cast<VPReductionPHIRecipe>(R);
+              Phi && Phi->isInLoop())
+            continue;


Why is this discarding the reduction PHI?

Also, if you rebase on main, you'll see this FIXME, that I think can now be addressed by this PR?
https://github.com/llvm/llvm-project/pull/126437/files#diff-da321d454a7246f8ae276bf1db2782bf26b5210b8133cb59e4d7fd45d0905decR5009-R5011

sdesmalen-arm · 2025-04-09T07:21:55Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    };
+    unsigned MaxLegacyRegUsage = GetMaxUsage(LegacyRUs[0].MaxLocalUsers);
+    unsigned MaxRegUsage = GetMaxUsage(RUs[0].MaxLocalUsers);
+    RegUsageDeterminedVF = MaxRegUsage <= MaxLegacyRegUsage;


My understanding is that the cost model would make a different choice based on register pressure if and only if the legacy reg-pressure calculation for the chosen VF would exceed the number of available registers.

But because the original cost calculation code is not around anymore (this PR removes it), we still need to figure out why the VFs chosen by the legacy/Vplan cost-models don't match. This code attempts to do that by seeing if the LegacyVF would have a higher register usage than the chosen VF. If so, then I'm not sure this is calculating the same thing.

Should this instead add a flag to calculateRegisterPressure to have a 'legacy' mode, where it doesn't scale the register usage for a PHI/VPReduction node using the scaling factor? And then you can compare the register usage in the old/new mode, to give the conclusive answer.

I had a closer look into this, and ignore my suggestion above. Instead, I think it is sufficient to check that the LegacyVF that would otherwise have been chosen has a register usage that exceeds the maximum number of registers. Otherwise, the cost would have chosen the same VF.

That means, this can be simplified to:

Suggested change

RegUsageDeterminedVF = MaxRegUsage <= MaxLegacyRegUsage;

auto &MLU = LegacyRUs[0].MaxLocalUsers;

RegUsageDeterminedVF = any_of(MLU, [&](decltype(MLU.front()) &LU) {

return LU.second > TTI.getNumberOfRegisters(LU.first);

});

I've managed to get rid of these changes by checking the register pressure as part of the legacy cost model's VF selection, which mirrors the current behaviour. This has had the knock-on effect of changing some debug output for lots of x86 tests since the plan is used for reg pressure checks, rather than the scalar loop.

SamTebbs33 · 2025-05-16T15:47:08Z

Merging #137746 was enough to fix the mve-reduction-types.ll regression so I believe this PR is ready to land. I'll do so in a couple of days.

fhahn

LGTM thanks

john-brawn-arm · 2025-05-20T10:34:44Z

This appears to have caused some performance regressions. With clang --target=arm-none-eabi -mcpu=cortex-m55 -mfloat-abi=hard -O3 we now aren't vectorizing this functions:

void fn(unsigned int n, char *in, char *out)
{
  for (unsigned int i = 0; i < n; i++) {
    char xr = *in++;
    char xg = *in++;
    char xb = *in++;

    char xy = ((((19595) * xr) + ((38470) * xg) + ((7471) * xb) +
                (1 << (16 - 1))) >> 16);
    char xi = ((((32767) * xr) + ((-15119) * xg) + ((-17648) * xb) +
                (1 << (16 - 1))) >> 16);
    char xq = ((((13282) * xr) + ((-32767) * xg) + ((19485) * xb) +
                (1 << (16 - 1))) >> 16);

    *out++ = xy;
    *out++ = xi;
    *out++ = xq;
  }
}

The -mllvm -debug output says LV(REG): Not considering vector loop of width 4 because it uses too many registers, which it didn't before.

john-brawn-arm · 2025-05-20T11:26:21Z

I think what's caused this is previously in LoopVectorizationCostModel::getMaximizedVFForTarget only VFs greater than MaxVectorElementCount could be discarded for excess register usage, but now any VF can be discarded.

SamTebbs33 · 2025-05-20T14:18:37Z

I think what's caused this is previously in LoopVectorizationCostModel::getMaximizedVFForTarget only VFs greater than MaxVectorElementCount could be discarded for excess register usage, but now any VF can be discarded.

Thanks for reporting that John. I'll investigate it.

mikaelholmen · 2025-05-27T08:21:23Z

I think what's caused this is previously in LoopVectorizationCostModel::getMaximizedVFForTarget only VFs greater than MaxVectorElementCount could be discarded for excess register usage, but now any VF can be discarded.

Thanks for reporting that John. I'll investigate it.

Did anything happen regarding this?

We also see performance regressions for our out-of-tree target with this patch, it seems to be too restrictive and stop vectorizing in several cases where it's beneficial for us to vectorize. We've worked around it downstream now but given that there are regressions also for ARM it would be interesting to see how that is handled.

SamTebbs33 · 2025-05-27T14:53:06Z

I think what's caused this is previously in LoopVectorizationCostModel::getMaximizedVFForTarget only VFs greater than MaxVectorElementCount could be discarded for excess register usage, but now any VF can be discarded.

Thanks for reporting that John. I'll investigate it.

Did anything happen regarding this?

We also see performance regressions for our out-of-tree target with this patch, it seems to be too restrictive and stop vectorizing in several cases where it's beneficial for us to vectorize. We've worked around it downstream now but given that there are regressions also for ARM it would be interesting to see how that is handled.

I have a fix that restores previous behaviour, in that register usage is only checked when vector bandwidth isn't being maximised, but I'm looking into if there's something about the reproducer loop whose register usage isn't being costed properly. If I can't think of a cost adjustment soon then I'll push up my fix.

SamTebbs33 · 2025-05-28T10:15:04Z

I've raised a PR to revert to previous behaviour, where the register pressure was only considered if vector bandwidth was being maximised. I've had a look at the assembly output and the vplan for the reproducer, and noticed that the output has a lot of vmlas, so I think a long term solution would be to hide the register usage of the plan's multiplies so that it still gets vectorised when register usage is considered.

SamTebbs33 requested review from fhahn, MacDue, davemgreen, huntergr-arm, sdesmalen-arm, david-arm and NickGuy-Arm March 20, 2025 11:46

llvmbot added backend:RISC-V backend:PowerPC vectorizers llvm:transforms labels Mar 20, 2025

SamTebbs33 mentioned this pull request Mar 20, 2025

[LV] Compute register usage for interleaving on VPlan. #126437

Merged

david-arm reviewed Mar 20, 2025

View reviewed changes

MacDue reviewed Mar 25, 2025

View reviewed changes

david-arm reviewed Mar 25, 2025

View reviewed changes

davemgreen reviewed Mar 26, 2025

View reviewed changes

SamTebbs33 mentioned this pull request Apr 1, 2025

[LV] Reduce register usage for scaled reductions #133090

Merged

SamTebbs33 force-pushed the reg-pressure branch from af01ff3 to 57ffaea Compare April 2, 2025 09:27

lukel97 reviewed Apr 7, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

SamTebbs33 changed the base branch from main to users/SamTebbs33/fhahn-vplan-reg-pressure-interleaving April 7, 2025 13:55

SamTebbs33 force-pushed the reg-pressure branch from 247f2c9 to 9a54b56 Compare April 7, 2025 14:39

sdesmalen-arm reviewed Apr 9, 2025

View reviewed changes

SamTebbs33 force-pushed the users/SamTebbs33/fhahn-vplan-reg-pressure-interleaving branch from bc57c23 to 4f64da1 Compare April 10, 2025 10:23

SamTebbs33 changed the base branch from users/SamTebbs33/fhahn-vplan-reg-pressure-interleaving to main April 10, 2025 10:30

SamTebbs33 added 18 commits May 16, 2025 16:41

Don't add VF to profitable list if uses too many registers

fea958d

Format

a4395b5

Prune high register pressure VFs in legacy cost model

084e513

Format

e8abf0e

Revert neon test changes

42268cd

Forward declare and move back calculate function

bad1cf9

Use ArrayRef

1a9d361

Use auto

43294b9

Use ArrayRef in calculateRegisterUsage

637b0fc

Rebase and add RegisterUsage.anyGreaterThanNumberOfregisters

bcd0608

Add in-loop reduction comment and rename to exceedsMaxNumRegs

1751e7c

NFC: Only capture TTI in lambda

a5cf131

NFC: Assign directly to MaxVF

fa7b725

NFC: Move comment below ifndef NDEBUG

ce446a8

Improve inloop reduction detection

71ac823

Change debug message and add const

6d71404

Fix AVX512F check statements

9511885

Rebase

c9b3d13

SamTebbs33 force-pushed the reg-pressure branch from 146160d to c9b3d13 Compare May 16, 2025 15:43

fhahn approved these changes May 18, 2025

View reviewed changes

SamTebbs33 merged commit 70501ed into llvm:main May 19, 2025
11 checks passed

SamTebbs33 deleted the reg-pressure branch May 19, 2025 12:27

SamTebbs33 mentioned this pull request May 28, 2025

[LV] Fix MVE regression from #132190 #141736

Open

-    RegUsageDeterminedVF = MaxRegUsage <= MaxLegacyRegUsage;
+    auto &MLU = LegacyRUs[0].MaxLocalUsers;
+    RegUsageDeterminedVF = any_of(MLU, [&](decltype(MLU.front()) &LU) {
+      return LU.second > TTI.getNumberOfRegisters(LU.first);
+    });

[LoopVectorizer] Prune VFs based on plan register pressure #132190

[LoopVectorizer] Prune VFs based on plan register pressure #132190

Uh oh!

Conversation

SamTebbs33 commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-arm commented Mar 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SamTebbs33 commented Apr 7, 2025

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lukel97 commented Apr 8, 2025

Uh oh!

ElvisWang123 commented Apr 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SamTebbs33 commented May 16, 2025

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SamTebbs33 commented Mar 20, 2025 •

edited

Loading

llvmbot commented Mar 20, 2025 •

edited

Loading

github-actions bot commented Mar 20, 2025 •

edited

Loading