Skip to content

[LoopVectorizer] Prune VFs based on plan register pressure #132190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
May 19, 2025

Conversation

SamTebbs33
Copy link
Collaborator

@SamTebbs33 SamTebbs33 commented Mar 20, 2025

This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.

Depends on #113903

@llvmbot
Copy link
Member

llvmbot commented Mar 20, 2025

@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-risc-v

Author: Sam Tebbs (SamTebbs33)

Changes

Based on fhahn's work at #126437.

This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.

It involves changing some tests, notably removing one from
mve-known-tripcount.ll due to it not being vectorisable thanks to high
register pressure. tail-folding-reduces-vf.ll was modified to reduce its
register pressure but still test what was intended.


Patch is 186.33 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/132190.diff

24 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+293-242)
  • (modified) llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h (+2-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+10-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product-neon.ll (+50-10)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product.ll (+414)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll (+6-7)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/scalable-call.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll (-191)
  • (modified) llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll (+11-11)
  • (modified) llvm/test/Transforms/LoopVectorize/ARM/mve-reductions.ll (+37-37)
  • (modified) llvm/test/Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll (+4-24)
  • (modified) llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll (+4-3)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll (+48-12)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/large-loop-rdx.ll (+77-230)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/reg-usage.ll (+9-5)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll (+5-5)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll (+10-4)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/i1-reg-usage.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/pr47437.ll (+21-99)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/reg-usage.ll (+23-9)
  • (modified) llvm/test/Transforms/PhaseOrdering/ARM/arm_mean_q7.ll (+18-9)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 89337dc385350..3574aa31195f0 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1018,7 +1018,8 @@ class LoopVectorizationCostModel {
   /// If interleave count has been specified by metadata it will be returned.
   /// Otherwise, the interleave count is computed and returned. VF and LoopCost
   /// are the selected vectorization factor and the cost of the selected VF.
-  unsigned selectInterleaveCount(ElementCount VF, InstructionCost LoopCost);
+  unsigned selectInterleaveCount(VPlan &Plan, ElementCount VF,
+                                 InstructionCost LoopCost);
 
   /// Memory access instruction may be vectorized in more than one way.
   /// Form of instruction after vectorization depends on cost.
@@ -1047,11 +1048,6 @@ class LoopVectorizationCostModel {
     SmallMapVector<unsigned, unsigned, 4> MaxLocalUsers;
   };
 
-  /// \return Returns information about the register usages of the loop for the
-  /// given vectorization factors.
-  SmallVector<RegisterUsage, 8>
-  calculateRegisterUsage(ArrayRef<ElementCount> VFs);
-
   /// Collect values we want to ignore in the cost model.
   void collectValuesToIgnore();
 
@@ -4239,27 +4235,12 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
         ComputeScalableMaxVF);
     MaxVectorElementCountMaxBW = MinVF(MaxVectorElementCountMaxBW, MaxSafeVF);
 
-    // Collect all viable vectorization factors larger than the default MaxVF
-    // (i.e. MaxVectorElementCount).
-    SmallVector<ElementCount, 8> VFs;
+    // Set the max VF to the largest viable vectorization factor less than or
+    // equal to the max vector element count.
     for (ElementCount VS = MaxVectorElementCount * 2;
          ElementCount::isKnownLE(VS, MaxVectorElementCountMaxBW); VS *= 2)
-      VFs.push_back(VS);
-
-    // For each VF calculate its register usage.
-    auto RUs = calculateRegisterUsage(VFs);
+      MaxVF = VS;
 
-    // Select the largest VF which doesn't require more registers than existing
-    // ones.
-    for (int I = RUs.size() - 1; I >= 0; --I) {
-      const auto &MLU = RUs[I].MaxLocalUsers;
-      if (all_of(MLU, [&](decltype(MLU.front()) &LU) {
-            return LU.second <= TTI.getNumberOfRegisters(LU.first);
-          })) {
-        MaxVF = VFs[I];
-        break;
-      }
-    }
     if (ElementCount MinVF =
             TTI.getMinimumVF(SmallestType, ComputeScalableMaxVF)) {
       if (ElementCount::isKnownLT(MaxVF, MinVF)) {
@@ -4885,8 +4866,245 @@ void LoopVectorizationCostModel::collectElementTypesForWidening() {
   }
 }
 
+/// Estimate the register usage for \p Plan and vectorization factors in \p VFs.
+/// Returns the register usage for each VF in \p VFs.
+static SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
+calculateRegisterUsage(VPlan &Plan, ArrayRef<ElementCount> VFs,
+                       const TargetTransformInfo &TTI) {
+  // This function calculates the register usage by measuring the highest number
+  // of values that are alive at a single location. Obviously, this is a very
+  // rough estimation. We scan the loop in a topological order in order and
+  // assign a number to each recipe. We use RPO to ensure that defs are
+  // met before their users. We assume that each recipe that has in-loop
+  // users starts an interval. We record every time that an in-loop value is
+  // used, so we have a list of the first and last occurrences of each
+  // recipe. Next, we transpose this data structure into a multi map that
+  // holds the list of intervals that *end* at a specific location. This multi
+  // map allows us to perform a linear search. We scan the instructions linearly
+  // and record each time that a new interval starts, by placing it in a set.
+  // If we find this value in the multi-map then we remove it from the set.
+  // The max register usage is the maximum size of the set.
+  // We also search for instructions that are defined outside the loop, but are
+  // used inside the loop. We need this number separately from the max-interval
+  // usage number because when we unroll, loop-invariant values do not take
+  // more register.
+  LoopVectorizationCostModel::RegisterUsage RU;
+
+  // Each 'key' in the map opens a new interval. The values
+  // of the map are the index of the 'last seen' usage of the
+  // recipe that is the key.
+  using IntervalMap = SmallDenseMap<VPRecipeBase *, unsigned, 16>;
+
+  // Maps indices to recipes.
+  SmallVector<VPRecipeBase *, 64> Idx2Recipe;
+  // Marks the end of each interval.
+  IntervalMap EndPoint;
+  // Saves the list of recipe indices that are used in the loop.
+  SmallPtrSet<VPRecipeBase *, 8> Ends;
+  // Saves the list of values that are used in the loop but are defined outside
+  // the loop (not including non-recipe values such as arguments and
+  // constants).
+  SmallSetVector<VPValue *, 8> LoopInvariants;
+  LoopInvariants.insert(&Plan.getVectorTripCount());
+
+  ReversePostOrderTraversal<VPBlockDeepTraversalWrapper<VPBlockBase *>> RPOT(
+      Plan.getVectorLoopRegion());
+  for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {
+    if (!VPBB->getParent())
+      break;
+    for (VPRecipeBase &R : *VPBB) {
+      Idx2Recipe.push_back(&R);
+
+      // Save the end location of each USE.
+      for (VPValue *U : R.operands()) {
+        auto *DefR = U->getDefiningRecipe();
+
+        // Ignore non-recipe values such as arguments, constants, etc.
+        // FIXME: Might need some motivation why these values are ignored. If
+        // for example an argument is used inside the loop it will increase the
+        // register pressure (so shouldn't we add it to LoopInvariants).
+        if (!DefR && (!U->getLiveInIRValue() ||
+                      !isa<Instruction>(U->getLiveInIRValue())))
+          continue;
+
+        // If this recipe is outside the loop then record it and continue.
+        if (!DefR) {
+          LoopInvariants.insert(U);
+          continue;
+        }
+
+        // Overwrite previous end points.
+        EndPoint[DefR] = Idx2Recipe.size();
+        Ends.insert(DefR);
+      }
+    }
+    if (VPBB == Plan.getVectorLoopRegion()->getExiting()) {
+      // VPWidenIntOrFpInductionRecipes are used implicitly at the end of the
+      // exiting block, where their increment will get materialized eventually.
+      for (auto &R : Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
+        if (isa<VPWidenIntOrFpInductionRecipe>(&R)) {
+          EndPoint[&R] = Idx2Recipe.size();
+          Ends.insert(&R);
+        }
+      }
+    }
+  }
+
+  // Saves the list of intervals that end with the index in 'key'.
+  using RecipeList = SmallVector<VPRecipeBase *, 2>;
+  SmallDenseMap<unsigned, RecipeList, 16> TransposeEnds;
+
+  // Transpose the EndPoints to a list of values that end at each index.
+  for (auto &Interval : EndPoint)
+    TransposeEnds[Interval.second].push_back(Interval.first);
+
+  SmallPtrSet<VPRecipeBase *, 8> OpenIntervals;
+  SmallVector<LoopVectorizationCostModel::RegisterUsage, 8> RUs(VFs.size());
+  SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
+
+  LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
+
+  VPTypeAnalysis TypeInfo(Plan.getCanonicalIV()->getScalarType());
+
+  const auto &TTICapture = TTI;
+  auto GetRegUsage = [&TTICapture](Type *Ty, ElementCount VF) -> unsigned {
+    if (Ty->isTokenTy() || !VectorType::isValidElementType(Ty) ||
+        (VF.isScalable() &&
+         !TTICapture.isElementTypeLegalForScalableVector(Ty)))
+      return 0;
+    return TTICapture.getRegUsageForType(VectorType::get(Ty, VF));
+  };
+
+  for (unsigned int Idx = 0, Sz = Idx2Recipe.size(); Idx < Sz; ++Idx) {
+    VPRecipeBase *R = Idx2Recipe[Idx];
+
+    //  Remove all of the recipes that end at this location.
+    RecipeList &List = TransposeEnds[Idx];
+    for (VPRecipeBase *ToRemove : List)
+      OpenIntervals.erase(ToRemove);
+
+    // Ignore recipes that are never used within the loop.
+    if (!Ends.count(R) && !R->mayHaveSideEffects())
+      continue;
+
+    // For each VF find the maximum usage of registers.
+    for (unsigned J = 0, E = VFs.size(); J < E; ++J) {
+      // Count the number of registers used, per register class, given all open
+      // intervals.
+      // Note that elements in this SmallMapVector will be default constructed
+      // as 0. So we can use "RegUsage[ClassID] += n" in the code below even if
+      // there is no previous entry for ClassID.
+      SmallMapVector<unsigned, unsigned, 4> RegUsage;
+
+      if (VFs[J].isScalar()) {
+        for (auto *Inst : OpenIntervals) {
+          for (VPValue *DefV : Inst->definedValues()) {
+            unsigned ClassID = TTI.getRegisterClassForType(
+                false, TypeInfo.inferScalarType(DefV));
+            // FIXME: The target might use more than one register for the type
+            // even in the scalar case.
+            RegUsage[ClassID] += 1;
+          }
+        }
+      } else {
+        for (auto *R : OpenIntervals) {
+          if (isa<VPVectorPointerRecipe, VPVectorEndPointerRecipe>(R))
+            continue;
+          if (isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
+                  VPScalarIVStepsRecipe>(R) ||
+              (isa<VPInstruction>(R) &&
+               all_of(cast<VPSingleDefRecipe>(R)->users(), [&](VPUser *U) {
+                 return cast<VPRecipeBase>(U)->usesScalars(
+                     R->getVPSingleValue());
+               }))) {
+            unsigned ClassID = TTI.getRegisterClassForType(
+                false, TypeInfo.inferScalarType(R->getVPSingleValue()));
+            // FIXME: The target might use more than one register for the type
+            // even in the scalar case.
+            RegUsage[ClassID] += 1;
+          } else {
+            // The output from scaled phis and scaled reductions actually have
+            // fewer lanes than the VF.
+            auto VF = VFs[J];
+            if (auto *ReductionR = dyn_cast<VPReductionPHIRecipe>(R))
+              VF = VF.divideCoefficientBy(ReductionR->getVFScaleFactor());
+            else if (auto *PartialReductionR =
+                         dyn_cast<VPPartialReductionRecipe>(R))
+              VF = VF.divideCoefficientBy(PartialReductionR->getScaleFactor());
+            if (VF != VFs[J])
+              LLVM_DEBUG(dbgs() << "LV(REG): Scaled down VF from " << VFs[J]
+                                << " to " << VF << " for ";
+                         R->dump(););
+
+            for (VPValue *DefV : R->definedValues()) {
+              Type *ScalarTy = TypeInfo.inferScalarType(DefV);
+              unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
+              RegUsage[ClassID] += GetRegUsage(ScalarTy, VF);
+            }
+          }
+        }
+      }
+
+      for (const auto &Pair : RegUsage) {
+        auto &Entry = MaxUsages[J][Pair.first];
+        Entry = std::max(Entry, Pair.second);
+      }
+    }
+
+    LLVM_DEBUG(dbgs() << "LV(REG): At #" << Idx << " Interval # "
+                      << OpenIntervals.size() << '\n');
+
+    // Add the current recipe to the list of open intervals.
+    OpenIntervals.insert(R);
+  }
+
+  for (unsigned Idx = 0, End = VFs.size(); Idx < End; ++Idx) {
+    // Note that elements in this SmallMapVector will be default constructed
+    // as 0. So we can use "Invariant[ClassID] += n" in the code below even if
+    // there is no previous entry for ClassID.
+    SmallMapVector<unsigned, unsigned, 4> Invariant;
+
+    for (auto *In : LoopInvariants) {
+      // FIXME: The target might use more than one register for the type
+      // even in the scalar case.
+      bool IsScalar = all_of(In->users(), [&](VPUser *U) {
+        return cast<VPRecipeBase>(U)->usesScalars(In);
+      });
+
+      ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
+      unsigned ClassID = TTI.getRegisterClassForType(
+          VF.isVector(), TypeInfo.inferScalarType(In));
+      Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
+    }
+
+    LLVM_DEBUG({
+      dbgs() << "LV(REG): VF = " << VFs[Idx] << '\n';
+      dbgs() << "LV(REG): Found max usage: " << MaxUsages[Idx].size()
+             << " item\n";
+      for (const auto &pair : MaxUsages[Idx]) {
+        dbgs() << "LV(REG): RegisterClass: "
+               << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+               << " registers\n";
+      }
+      dbgs() << "LV(REG): Found invariant usage: " << Invariant.size()
+             << " item\n";
+      for (const auto &pair : Invariant) {
+        dbgs() << "LV(REG): RegisterClass: "
+               << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+               << " registers\n";
+      }
+    });
+
+    RU.LoopInvariantRegs = Invariant;
+    RU.MaxLocalUsers = MaxUsages[Idx];
+    RUs[Idx] = RU;
+  }
+
+  return RUs;
+}
+
 unsigned
-LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
+LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
                                                   InstructionCost LoopCost) {
   // -- The interleave heuristics --
   // We interleave the loop in order to expose ILP and reduce the loop overhead.
@@ -4936,7 +5154,7 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
       return 1;
   }
 
-  RegisterUsage R = calculateRegisterUsage({VF})[0];
+  RegisterUsage R = ::calculateRegisterUsage(Plan, {VF}, TTI)[0];
   // We divide by these constants so assume that we have at least one
   // instruction that uses at least one register.
   for (auto &Pair : R.MaxLocalUsers) {
@@ -5169,213 +5387,6 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
   return 1;
 }
 
-SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
-LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef<ElementCount> VFs) {
-  // This function calculates the register usage by measuring the highest number
-  // of values that are alive at a single location. Obviously, this is a very
-  // rough estimation. We scan the loop in a topological order in order and
-  // assign a number to each instruction. We use RPO to ensure that defs are
-  // met before their users. We assume that each instruction that has in-loop
-  // users starts an interval. We record every time that an in-loop value is
-  // used, so we have a list of the first and last occurrences of each
-  // instruction. Next, we transpose this data structure into a multi map that
-  // holds the list of intervals that *end* at a specific location. This multi
-  // map allows us to perform a linear search. We scan the instructions linearly
-  // and record each time that a new interval starts, by placing it in a set.
-  // If we find this value in the multi-map then we remove it from the set.
-  // The max register usage is the maximum size of the set.
-  // We also search for instructions that are defined outside the loop, but are
-  // used inside the loop. We need this number separately from the max-interval
-  // usage number because when we unroll, loop-invariant values do not take
-  // more register.
-  LoopBlocksDFS DFS(TheLoop);
-  DFS.perform(LI);
-
-  RegisterUsage RU;
-
-  // Each 'key' in the map opens a new interval. The values
-  // of the map are the index of the 'last seen' usage of the
-  // instruction that is the key.
-  using IntervalMap = SmallDenseMap<Instruction *, unsigned, 16>;
-
-  // Maps instruction to its index.
-  SmallVector<Instruction *, 64> IdxToInstr;
-  // Marks the end of each interval.
-  IntervalMap EndPoint;
-  // Saves the list of instruction indices that are used in the loop.
-  SmallPtrSet<Instruction *, 8> Ends;
-  // Saves the list of values that are used in the loop but are defined outside
-  // the loop (not including non-instruction values such as arguments and
-  // constants).
-  SmallSetVector<Instruction *, 8> LoopInvariants;
-
-  for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) {
-    for (Instruction &I : BB->instructionsWithoutDebug()) {
-      IdxToInstr.push_back(&I);
-
-      // Save the end location of each USE.
-      for (Value *U : I.operands()) {
-        auto *Instr = dyn_cast<Instruction>(U);
-
-        // Ignore non-instruction values such as arguments, constants, etc.
-        // FIXME: Might need some motivation why these values are ignored. If
-        // for example an argument is used inside the loop it will increase the
-        // register pressure (so shouldn't we add it to LoopInvariants).
-        if (!Instr)
-          continue;
-
-        // If this instruction is outside the loop then record it and continue.
-        if (!TheLoop->contains(Instr)) {
-          LoopInvariants.insert(Instr);
-          continue;
-        }
-
-        // Overwrite previous end points.
-        EndPoint[Instr] = IdxToInstr.size();
-        Ends.insert(Instr);
-      }
-    }
-  }
-
-  // Saves the list of intervals that end with the index in 'key'.
-  using InstrList = SmallVector<Instruction *, 2>;
-  SmallDenseMap<unsigned, InstrList, 16> TransposeEnds;
-
-  // Transpose the EndPoints to a list of values that end at each index.
-  for (auto &Interval : EndPoint)
-    TransposeEnds[Interval.second].push_back(Interval.first);
-
-  SmallPtrSet<Instruction *, 8> OpenIntervals;
-  SmallVector<RegisterUsage, 8> RUs(VFs.size());
-  SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
-
-  LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
-
-  const auto &TTICapture = TTI;
-  auto GetRegUsage = [&TTICapture](Type *Ty, ElementCount VF) -> unsigned {
-    if (Ty->isTokenTy() || !VectorType::isValidElementType(Ty) ||
-        (VF.isScalable() &&
-         !TTICapture.isElementTypeLegalForScalableVector(Ty)))
-      return 0;
-    return TTICapture.getRegUsageForType(VectorType::get(Ty, VF));
-  };
-
-  collectInLoopReductions();
-
-  for (unsigned int Idx = 0, Sz = IdxToInstr.size(); Idx < Sz; ++Idx) {
-    Instruction *I = IdxToInstr[Idx];
-
-    // Remove all of the instructions that end at this location.
-    InstrList &List = TransposeEnds[Idx];
-    for (Instruction *ToRemove : List)
-      OpenIntervals.erase(ToRemove);
-
-    // Ignore instructions that are never used within the loop and do not have
-    // side-effects.
-    if (!Ends.count(I) && !I->mayHaveSideEffects())
-      continue;
-
-    // Skip ignored values.
-    if (ValuesToIgnore.count(I))
-      continue;
-
-    // For each VF find the maximum usage of registers.
-    for (unsigned J = 0, E = VFs.size(); J < E; ++J) {
-      // Count the number of registers used, per register class, given all open
-      // intervals.
-      // Note that elements in this SmallMapVector will be default constructed
-      // as 0. So we can use "RegUsage[ClassID] += n" in the code below even if
-      // there is no previous entry for ClassID.
-      SmallMapVector<unsigned, unsigned, 4> RegUsage;
-
-      if (VFs[J].isScalar()) {
-        for (auto *Inst : OpenIntervals) {
-          unsigned ClassID =
-              TTI.getRegisterClassForType(false, Inst->getType());
-          // FIXME: The target might use more than one register for the type
-          // even in the scalar case.
-          RegUsage[ClassID] += 1;
-        }
-      } else {
-        collectUniformsAndScalars(VFs[J]);
-        for (auto *Inst : OpenIntervals) {
-          // Skip ignored values for VF > 1.
-          if (VecValuesToIgnore.count(Inst))
-            continue;
-          if (isScalarAfterVectorization(Inst, VFs[J])) {
-            unsigned ClassID =
-                TTI.getRegisterClassForType(false, Inst->getType());
-            // FIXME: The target might use more than one register for the type
-            // even in the scalar case.
-            RegUsage[ClassID] += 1;
-          } else {
-            unsigned ClassID =
-                TTI.getRegisterClassForType(true, Inst->getType());
-            RegUsage[ClassID] += GetRegUsage(Inst->getType(), VFs[J]);
-          }
-        }
-      }
-
-      for (const auto &Pair : RegUsage) {
-        auto &Entry = MaxUsages[J][Pair.first];
-        Entry = std::max(Entry, Pair.second);
-      }
-    }
-
-   ...
[truncated]

Copy link

github-actions bot commented Mar 20, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

; CHECK-NEXT: [[TMP6:%.*]] = tail call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP5]])
; CHECK-NEXT: [[TMP7:%.*]] = add i32 [[TMP6]], [[SUM_0_LCSSA]]
; CHECK-NEXT: br label [[WHILE_END5]]
; CHECK: vector.ph:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a possible regression to me. The previous code was optimal because it removed the loop entirely and hence there was no register pressure due to PHIs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've managed to fix this by ignoring in-loop reduction phis.

@@ -10,72 +10,29 @@ define void @QLA_F3_r_veq_norm2_V(ptr noalias %r, ptr noalias %a, i32 %n) {
; CHECK-SAME: ptr noalias [[R:%.*]], ptr noalias [[A:%.*]], i32 [[N:%.*]]) {
; CHECK-NEXT: [[ENTRY:.*]]:
; CHECK-NEXT: [[CMP24:%.*]] = icmp sgt i32 [[N]], 0
; CHECK-NEXT: br i1 [[CMP24]], label %[[ITER_CHECK:.*]], label %[[FOR_END13:.*]]
; CHECK: [[ITER_CHECK]]:
; CHECK-NEXT: br i1 [[CMP24]], label %[[FOR_COND1_PREHEADER_PREHEADER:.*]], label %[[FOR_END13:.*]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the comment above the function it looks like they are expecting it be interleaved 8 times, perhaps for optimal performance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -195,197 +195,6 @@ for.body: ; preds = %entry, %for.body
br i1 %cmp, label %for.body, label %for.cond.cleanup
}

; Trip count of 8 - does get vectorized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this been deleted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The register pressure was too high for it to be vectorised and I couldn't modify the test to reduce the pressure and still have it vectorise at the expected VF with the same trip count. I'm now trying to find a way to modify the pressure calculation to keep it and vectorise at VF 4.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've managed to reduce the test so that it doesn't exhibit huge register pressure, do let me know what you think of it.

@@ -8,8 +8,8 @@
define void @vec_load(i64 %N, ptr nocapture %a, ptr nocapture readonly %b) {
; CHECK-LABEL: @vec_load
; CHECK: vector.body:
; CHECK: %[[LOAD:.*]] = load <vscale x 2 x double>, ptr
; CHECK: call <vscale x 2 x double> @foo_vec(<vscale x 2 x double> %[[LOAD]])
; CHECK: %[[WIDE_LOAD:.*]] = load <vscale x 2 x double>, ptr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these changes can be reverted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted now. I'm not sure why the test was failing before without this change!

Comment on lines 4238 to 4242
// Set the max VF to the largest viable vectorization factor less than or
// equal to the max vector element count.
for (ElementCount VS = MaxVectorElementCount * 2;
ElementCount::isKnownLE(VS, MaxVectorElementCountMaxBW); VS *= 2)
VFs.push_back(VS);

// For each VF calculate its register usage.
auto RUs = calculateRegisterUsage(VFs);
MaxVF = VS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given these element counts are powers of 2 (it appears bit_floor is used above), I think this is just:

MaxVF = MaxVectorElementCountMaxBW;

(assuming MaxVectorElementCount <= MaxVectorElementCountMaxBW)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an assertion checking if MaxVectorElementCount <= MaxVectorElementCountMaxBW and an AArch64 test failed (Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll) so that condition doesn't always hold. So I've replaced the loop with a check and an assignment.

@david-arm
Copy link
Contributor

Hi @SamTebbs33, looks like it's failing some RISCV reg usage tests.

@@ -4885,8 +4872,248 @@ void LoopVectorizationCostModel::collectElementTypesForWidening() {
}
}

/// Estimate the register usage for \p Plan and vectorization factors in \p VFs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to move this function? It makes it a lot more difficult to review. If it can't be avoided then perhaps it should be moved in a separate NFC commit within the same PR or done in a different PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to move it when I was using LoopVectorizationCostModel to check for in loop reductions, but now that I'm using the isInLoop() function from the reduction recipe we don't need these changes. Reverted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the function has still moved so it's still hard to see what's changed. If it's easier you could always abandon the commit history and start again?

// fewer lanes than the VF.
auto VF = VFs[J];
if (auto *ReductionR = dyn_cast<VPReductionPHIRecipe>(R))
VF = VF.divideCoefficientBy(ReductionR->getVFScaleFactor());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth first doing the transition over to computing register usage based on VPlan, followed by the scale factor changes? That way we can see individual effects of each.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good idea. Done.

@@ -12,28 +12,28 @@ define i32 @mla_i32(ptr noalias nocapture readonly %A, ptr noalias nocapture rea
; CHECK: for.body.preheader:
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], 16
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be changing, it was OK as it was before. https://godbolt.org/z/KnT1xaeTf

Is it changing because of the large reduction registers? Those wouldn't be expected to be in the final code. Perhaps #113903 will help to represent the register usage more correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dave, that patch does indeed help and the only change after cherry picking over Elvis's changes is that the first sext is moved next to the second one. Once that PR is merged I'll rebase on top of it to get rid of this regression here.

@SamTebbs33
Copy link
Collaborator Author

Ping

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, this will improve things for RISC-V a lot. E.g. currently we spill a lot on this example due to needing a gather which requires i64 elements, which is only modelled in VPlan and not in the legacy model: https://godbolt.org/z/rrPssc5dx

With this patch we go from VF=vscale x 16 -> VF=vscale x 8, which fixes the spilling:

Before

.LBB0_11:
        vfcvt.f.xu.v    v8, v24
        vfmul.vf        v8, v8, fa0
        vfcvt.rtz.x.f.v v8, v8
        vsetivli        zero, 16, e32, m8, ta, ma
        vslidedown.vi   v16, v8, 16
        csrr    s0, vlenb
        slli    s0, s0, 4
        add     s0, s0, sp
        addi    s0, s0, 16
        vs8r.v  v16, (s0)
        vsetivli        zero, 16, e64, m8, ta, ma
        vmv.v.x v16, s6
        csrr    s0, vlenb
        slli    s0, s0, 3
        sh1add  s0, s0, s0
        add     s0, s0, sp
        addi    s0, s0, 16
        vs8r.v  v16, (s0)
        csrr    s0, vlenb
        slli    s0, s0, 3
        sh1add  s0, s0, s0
        add     s0, s0, sp
        addi    s0, s0, 16
        vl8r.v  v16, (s0)
        vsetvli zero, zero, e32, m4, ta, ma
        vwadd.wv        v0, v16, v8
        addi    s0, sp, 16
        vs8r.v  v0, (s0)
        csrr    s0, vlenb
        slli    s0, s0, 3
        sh1add  s0, s0, s0
        add     s0, s0, sp
        addi    s0, s0, 16
        vl8r.v  v16, (s0)
        csrr    s0, vlenb
        slli    s0, s0, 4
        add     s0, s0, sp
        addi    s0, s0, 16
        vl8r.v  v8, (s0)
        vwadd.wv        v0, v16, v8
        csrr    s0, vlenb
        sh3add  s0, s0, sp
        addi    s0, s0, 16
        vs8r.v  v0, (s0)
        addi    s0, sp, 16
        vl8r.v  v8, (s0)
        vsetvli zero, zero, e64, m8, ta, ma
        vmul.vx v0, v8, t2
        csrr    s0, vlenb
        sh3add  s0, s0, sp
        addi    s0, s0, 16
        vl8r.v  v8, (s0)
        vmul.vx v8, v8, t2
        vsetvli zero, zero, e8, m1, ta, ma
        vluxei64.v      v16, (a4), v0
        vsetvli zero, zero, e64, m8, ta, ma
        vadd.vx v0, v0, a4
        vsetvli zero, zero, e8, m1, ta, ma
        vluxei64.v      v18, (a4), v8
        vsetvli zero, s5, e8, m2, ta, ma
        vslideup.vi     v16, v18, 16
        vsetivli        zero, 16, e8, m1, ta, ma
        vluxei64.v      v18, (t5), v0
        vsetvli zero, zero, e64, m8, ta, ma
        vadd.vx v8, v8, a4
        vsetvli zero, zero, e8, m1, ta, ma
        vluxei64.v      v20, (t5), v8
        vsetvli zero, s5, e8, m2, ta, ma
        vslideup.vi     v18, v20, 16
        vsetivli        zero, 16, e8, m1, ta, ma
        vluxei64.v      v20, (s4), v0
        vluxei64.v      v22, (s4), v8
        vsetvli zero, s5, e8, m2, ta, ma
        vslideup.vi     v20, v22, 16
        vsseg3e8.v      v16, (s1)
        vsetvli zero, zero, e32, m8, ta, ma
        vadd.vx v24, v24, s5
        addi    a3, a3, -32
        addi    s1, s1, 96
        bnez    a3, .LBB0_11
        mv      s7, t3
        beq     a1, t3, .LBB0_3
        j       .LBB0_6

After

.LBB0_11:                               # %vector.body
                                        #   Parent Loop BB0_4 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	vsetvli	zero, zero, e32, m4, ta, ma
	vfcvt.f.xu.v	v24, v20
	vfmul.vf	v24, v24, fa0
	vfcvt.rtz.x.f.v	v4, v24
	vwadd.wv	v24, v8, v4
	vsetvli	zero, zero, e64, m8, ta, ma
	vmul.vx	v24, v24, t2
	vadd.vx	v0, v24, a4
	vsetvli	zero, zero, e8, m1, ta, ma
	vluxei64.v	v16, (a4), v24
	vluxei64.v	v17, (t5), v0
	vluxei64.v	v18, (s3), v0
	vsetvli	zero, zero, e32, m4, ta, ma
	vadd.vx	v20, v20, t4
	addi	a3, a3, -16
	vsetvli	zero, zero, e8, m1, ta, ma
	vsseg3e8.v	v16, (s0)
	addi	s0, s0, 48
	bnez	a3, .LBB0_11

@SamTebbs33 SamTebbs33 changed the base branch from main to users/SamTebbs33/fhahn-vplan-reg-pressure-interleaving April 7, 2025 13:55
@lukel97
Copy link
Contributor

lukel97 commented Apr 8, 2025

I collected some more data on RISC-V on SPEC CPU 2017, this improves code size by up to 7% on some benchmarks, and no regressions were found: https://lnt.lukelau.me/db_default/v4/nts/399?show_delta=yes&show_previous=yes&show_stddev=yes&show_mad=yes&show_all=yes&show_all_samples=yes&show_sample_counts=yes&show_small_diff=yes&num_comparison_runs=0&test_filter=&test_min_value_filter=&aggregation_fn=min&MW_confidence_lv=0.05&compare_to=401&submit=Update

There's also a significant decrease in vector spilling and reloading. It removes all the spilling entirely on one benchmark so the geomean result is stuck at 100%:

Program                                       riscv-instr-info.NumVRegReloaded                 riscv-instr-info.NumVRegSpilled                
                                              lhs                              rhs     diff    lhs                             rhs     diff   
FP2017rate/508.namd_r/508.namd_r                 6.00                             6.00    0.0%    1.00                            1.00    0.0%
INT2017rat...00.perlbench_r/500.perlbench_r      8.00                             8.00    0.0%    4.00                            4.00    0.0%
INT2017speed/625.x264_s/625.x264_s              35.00                            35.00    0.0%   39.00                           39.00    0.0%
INT2017spe...23.xalancbmk_s/623.xalancbmk_s      6.00                             6.00    0.0%    6.00                            6.00    0.0%
INT2017spe...ed/620.omnetpp_s/620.omnetpp_s      5.00                             5.00    0.0%    4.00                            4.00    0.0%
INT2017speed/602.gcc_s/602.gcc_s                70.00                            70.00    0.0%   64.00                           64.00    0.0%
INT2017spe...00.perlbench_s/600.perlbench_s      8.00                             8.00    0.0%    4.00                            4.00    0.0%
INT2017rate/525.x264_r/525.x264_r               35.00                            35.00    0.0%   39.00                           39.00    0.0%
INT2017rat...23.xalancbmk_r/523.xalancbmk_r      6.00                             6.00    0.0%    6.00                            6.00    0.0%
INT2017rate/520.omnetpp_r/520.omnetpp_r          5.00                             5.00    0.0%    4.00                            4.00    0.0%
INT2017rate/502.gcc_r/502.gcc_r                 70.00                            70.00    0.0%   64.00                           64.00    0.0%
FP2017speed/644.nab_s/644.nab_s                 24.00                            24.00    0.0%   24.00                           24.00    0.0%
FP2017rate/544.nab_r/544.nab_r                  24.00                            24.00    0.0%   24.00                           24.00    0.0%
FP2017rate/511.povray_r/511.povray_r           131.00                           131.00    0.0%   74.00                           74.00    0.0%
FP2017rate/510.parest_r/510.parest_r          1490.00                          1484.00   -0.4% 1231.00                         1225.00   -0.5%
INT2017rat...31.deepsjeng_r/531.deepsjeng_r    248.00                           218.00  -12.1%  134.00                          102.00  -23.9%
INT2017spe...31.deepsjeng_s/631.deepsjeng_s    248.00                           218.00  -12.1%  134.00                          102.00  -23.9%
FP2017rate/526.blender_r/526.blender_r        1210.00                           703.00  -41.9% 1033.00                          654.00  -36.7%
FP2017speed/638.imagick_s/638.imagick_s       7524.00                          1486.00  -80.2% 4813.00                          925.00  -80.8%
FP2017rate/538.imagick_r/538.imagick_r        7524.00                          1486.00  -80.2% 4813.00                          925.00  -80.8%
FP2017speed/619.lbm_s/619.lbm_s                 42.00                             0.00 -100.0%   42.00                                 -100.0%
FP2017rate/519.lbm_r/519.lbm_r                  42.00                             0.00 -100.0%   42.00                                 -100.0%
FP2017rate...97.specrand_fr/997.specrand_fr      0.00                             0.00                                                        
FP2017spee...96.specrand_fs/996.specrand_fs      0.00                             0.00                                                        
INT2017rate/505.mcf_r/505.mcf_r                  0.00                             0.00                                                        
INT2017rate/541.leela_r/541.leela_r              0.00                             0.00                                                        
INT2017rate/557.xz_r/557.xz_r                    0.00                             0.00                                                        
INT2017rat...99.specrand_ir/999.specrand_ir      0.00                             0.00                                                        
INT2017speed/605.mcf_s/605.mcf_s                 0.00                             0.00                                                        
INT2017speed/641.leela_s/641.leela_s             0.00                             0.00                                                        
INT2017speed/657.xz_s/657.xz_s                   0.00                             0.00                                                        
INT2017spe...98.specrand_is/998.specrand_is      0.00                             0.00                                                        
                           Geomean difference                                          -100.0%                                          -19.4%

@ElvisWang123
Copy link
Contributor

Thanks for working on this too!

We have a similar implementation in downstream and works well.
Would you mind I stacked some further improvements based on this patch?

calculateRegisterUsage(getPlanFor(LegacyVF.Width), LegacyVFs, TTI);
auto RUs = calculateRegisterUsage(BestPlan, VFs, TTI);

auto GetMaxUsage = [](SmallMapVector<unsigned, unsigned, 4> MaxLocalUsers) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxLocalUsers is passed by value here, it should be passed by reference.

Comment on lines 7597 to 7595
unsigned Max = 0;
for (auto Pair : MaxLocalUsers)
if (Pair.second > Max)
Max = Pair.second;
return Max;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although rather than creating a GetMaxUsage, you could use max_element for this instead, e.g.

auto CmpGT = [](std::pair<unsigned, unsigned> &A,
               std::pair<unsigned, unsigned> &B) {
  return A.second > B.second;
};
unsigned MaxLegacyRegUsage =
    max_element(LegacyRUs[0].MaxLocalUsers, CmpGT)->second;
unsigned MaxRegUsage = max_element(RUs[0].MaxLocalUsers, CmpGT)->second;

Comment on lines 5009 to 5011
if (auto *Phi = dyn_cast<VPReductionPHIRecipe>(R);
Phi && Phi->isInLoop())
continue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this discarding the reduction PHI?

Also, if you rebase on main, you'll see this FIXME, that I think can now be addressed by this PR?
https://github.com/llvm/llvm-project/pull/126437/files#diff-da321d454a7246f8ae276bf1db2782bf26b5210b8133cb59e4d7fd45d0905decR5009-R5011

};
unsigned MaxLegacyRegUsage = GetMaxUsage(LegacyRUs[0].MaxLocalUsers);
unsigned MaxRegUsage = GetMaxUsage(RUs[0].MaxLocalUsers);
RegUsageDeterminedVF = MaxRegUsage <= MaxLegacyRegUsage;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the cost model would make a different choice based on register pressure if and only if the legacy reg-pressure calculation for the chosen VF would exceed the number of available registers.

But because the original cost calculation code is not around anymore (this PR removes it), we still need to figure out why the VFs chosen by the legacy/Vplan cost-models don't match. This code attempts to do that by seeing if the LegacyVF would have a higher register usage than the chosen VF. If so, then I'm not sure this is calculating the same thing.

Should this instead add a flag to calculateRegisterPressure to have a 'legacy' mode, where it doesn't scale the register usage for a PHI/VPReduction node using the scaling factor? And then you can compare the register usage in the old/new mode, to give the conclusive answer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a closer look into this, and ignore my suggestion above. Instead, I think it is sufficient to check that the LegacyVF that would otherwise have been chosen has a register usage that exceeds the maximum number of registers. Otherwise, the cost would have chosen the same VF.

That means, this can be simplified to:

Suggested change
RegUsageDeterminedVF = MaxRegUsage <= MaxLegacyRegUsage;
auto &MLU = LegacyRUs[0].MaxLocalUsers;
RegUsageDeterminedVF = any_of(MLU, [&](decltype(MLU.front()) &LU) {
return LU.second > TTI.getNumberOfRegisters(LU.first);
});

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've managed to get rid of these changes by checking the register pressure as part of the legacy cost model's VF selection, which mirrors the current behaviour. This has had the knock-on effect of changing some debug output for lots of x86 tests since the plan is used for reg pressure checks, rather than the scalar loop.

@SamTebbs33 SamTebbs33 force-pushed the users/SamTebbs33/fhahn-vplan-reg-pressure-interleaving branch from bc57c23 to 4f64da1 Compare April 10, 2025 10:23
@SamTebbs33 SamTebbs33 changed the base branch from users/SamTebbs33/fhahn-vplan-reg-pressure-interleaving to main April 10, 2025 10:30
@SamTebbs33
Copy link
Collaborator Author

Merging #137746 was enough to fix the mve-reduction-types.ll regression so I believe this PR is ready to land. I'll do so in a couple of days.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks

@SamTebbs33 SamTebbs33 merged commit 70501ed into llvm:main May 19, 2025
11 checks passed
@SamTebbs33 SamTebbs33 deleted the reg-pressure branch May 19, 2025 12:27
@john-brawn-arm
Copy link
Collaborator

This appears to have caused some performance regressions. With clang --target=arm-none-eabi -mcpu=cortex-m55 -mfloat-abi=hard -O3 we now aren't vectorizing this functions:

void fn(unsigned int n, char *in, char *out)
{
  for (unsigned int i = 0; i < n; i++) {
    char xr = *in++;
    char xg = *in++;
    char xb = *in++;

    char xy = ((((19595) * xr) + ((38470) * xg) + ((7471) * xb) +
                (1 << (16 - 1))) >> 16);
    char xi = ((((32767) * xr) + ((-15119) * xg) + ((-17648) * xb) +
                (1 << (16 - 1))) >> 16);
    char xq = ((((13282) * xr) + ((-32767) * xg) + ((19485) * xb) +
                (1 << (16 - 1))) >> 16);

    *out++ = xy;
    *out++ = xi;
    *out++ = xq;
  }
}

The -mllvm -debug output says LV(REG): Not considering vector loop of width 4 because it uses too many registers, which it didn't before.

@john-brawn-arm
Copy link
Collaborator

I think what's caused this is previously in LoopVectorizationCostModel::getMaximizedVFForTarget only VFs greater than MaxVectorElementCount could be discarded for excess register usage, but now any VF can be discarded.

@SamTebbs33
Copy link
Collaborator Author

I think what's caused this is previously in LoopVectorizationCostModel::getMaximizedVFForTarget only VFs greater than MaxVectorElementCount could be discarded for excess register usage, but now any VF can be discarded.

Thanks for reporting that John. I'll investigate it.

@mikaelholmen
Copy link
Collaborator

I think what's caused this is previously in LoopVectorizationCostModel::getMaximizedVFForTarget only VFs greater than MaxVectorElementCount could be discarded for excess register usage, but now any VF can be discarded.

Thanks for reporting that John. I'll investigate it.

Did anything happen regarding this?

We also see performance regressions for our out-of-tree target with this patch, it seems to be too restrictive and stop vectorizing in several cases where it's beneficial for us to vectorize. We've worked around it downstream now but given that there are regressions also for ARM it would be interesting to see how that is handled.

@SamTebbs33
Copy link
Collaborator Author

I think what's caused this is previously in LoopVectorizationCostModel::getMaximizedVFForTarget only VFs greater than MaxVectorElementCount could be discarded for excess register usage, but now any VF can be discarded.

Thanks for reporting that John. I'll investigate it.

Did anything happen regarding this?

We also see performance regressions for our out-of-tree target with this patch, it seems to be too restrictive and stop vectorizing in several cases where it's beneficial for us to vectorize. We've worked around it downstream now but given that there are regressions also for ARM it would be interesting to see how that is handled.

I have a fix that restores previous behaviour, in that register usage is only checked when vector bandwidth isn't being maximised, but I'm looking into if there's something about the reproducer loop whose register usage isn't being costed properly. If I can't think of a cost adjustment soon then I'll push up my fix.

@SamTebbs33
Copy link
Collaborator Author

I've raised a PR to revert to previous behaviour, where the register pressure was only considered if vector bandwidth was being maximised. I've had a look at the assembly output and the vplan for the reproducer, and noticed that the output has a lot of vmlas, so I think a long term solution would be to hide the register usage of the plan's multiplies so that it still gets vectorised when register usage is considered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.