-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[LV] Compute register usage for interleaving on VPlan. #126437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-risc-v Author: Florian Hahn (fhahn) ChangesAdd a version of calculateRegisterUsage that works estimates register There are number of changes in the computed register usages, but they There are the following changes:
Depends on #126415 Patch is 52.47 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126437.diff 14 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c2d347cf9b7e0c9..9727487a1dd1396 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1020,7 +1020,8 @@ class LoopVectorizationCostModel {
/// If interleave count has been specified by metadata it will be returned.
/// Otherwise, the interleave count is computed and returned. VF and LoopCost
/// are the selected vectorization factor and the cost of the selected VF.
- unsigned selectInterleaveCount(ElementCount VF, InstructionCost LoopCost);
+ unsigned selectInterleaveCount(VPlan &Plan, ElementCount VF,
+ InstructionCost LoopCost);
/// Memory access instruction may be vectorized in more than one way.
/// Form of instruction after vectorization depends on cost.
@@ -4878,8 +4879,232 @@ void LoopVectorizationCostModel::collectElementTypesForWidening() {
}
}
+/// Estimate the register usage for \p Plan and vectorization factors in \p VFs.
+/// Returns the register usage for each VF in \p VFs.
+static SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
+calculateRegisterUsage(VPlan &Plan, ArrayRef<ElementCount> VFs,
+ const TargetTransformInfo &TTI) {
+ // This function calculates the register usage by measuring the highest number
+ // of values that are alive at a single location. Obviously, this is a very
+ // rough estimation. We scan the loop in a topological order in order and
+ // assign a number to each recipe. We use RPO to ensure that defs are
+ // met before their users. We assume that each recipe that has in-loop
+ // users starts an interval. We record every time that an in-loop value is
+ // used, so we have a list of the first and last occurrences of each
+ // recipe. Next, we transpose this data structure into a multi map that
+ // holds the list of intervals that *end* at a specific location. This multi
+ // map allows us to perform a linear search. We scan the instructions linearly
+ // and record each time that a new interval starts, by placing it in a set.
+ // If we find this value in the multi-map then we remove it from the set.
+ // The max register usage is the maximum size of the set.
+ // We also search for instructions that are defined outside the loop, but are
+ // used inside the loop. We need this number separately from the max-interval
+ // usage number because when we unroll, loop-invariant values do not take
+ // more register.
+ LoopVectorizationCostModel::RegisterUsage RU;
+
+ // Each 'key' in the map opens a new interval. The values
+ // of the map are the index of the 'last seen' usage of the
+ // recipe that is the key.
+ using IntervalMap = SmallDenseMap<VPRecipeBase *, unsigned, 16>;
+
+ // Maps recipe to its index.
+ SmallVector<VPRecipeBase *, 64> IdxToRecipe;
+ // Marks the end of each interval.
+ IntervalMap EndPoint;
+ // Saves the list of recipe indices that are used in the loop.
+ SmallPtrSet<VPRecipeBase *, 8> Ends;
+ // Saves the list of values that are used in the loop but are defined outside
+ // the loop (not including non-recipe values such as arguments and
+ // constants).
+ SmallSetVector<VPValue *, 8> LoopInvariants;
+ LoopInvariants.insert(&Plan.getVectorTripCount());
+
+ ReversePostOrderTraversal<VPBlockDeepTraversalWrapper<VPBlockBase *>> RPOT(
+ Plan.getVectorLoopRegion());
+ for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {
+ if (!VPBB->getParent())
+ break;
+ for (VPRecipeBase &R : *VPBB) {
+ IdxToRecipe.push_back(&R);
+
+ // Save the end location of each USE.
+ for (VPValue *U : R.operands()) {
+ auto *DefR = U->getDefiningRecipe();
+
+ // Ignore non-recipe values such as arguments, constants, etc.
+ // FIXME: Might need some motivation why these values are ignored. If
+ // for example an argument is used inside the loop it will increase the
+ // register pressure (so shouldn't we add it to LoopInvariants).
+ if (!DefR && (!U->getLiveInIRValue() ||
+ !isa<Instruction>(U->getLiveInIRValue())))
+ continue;
+
+ // If this recipe is outside the loop then record it and continue.
+ if (!DefR) {
+ LoopInvariants.insert(U);
+ continue;
+ }
+
+ // Overwrite previous end points.
+ EndPoint[DefR] = IdxToRecipe.size();
+ Ends.insert(DefR);
+ }
+ }
+ if (VPBB == Plan.getVectorLoopRegion()->getExiting()) {
+ // VPWidenIntOrFpInductionRecipes are used implicitly at the end of the
+ // exiting block, where their increment will get materialized eventually.
+ for (auto &R : Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
+ if (isa<VPWidenIntOrFpInductionRecipe>(&R)) {
+ EndPoint[&R] = IdxToRecipe.size();
+ Ends.insert(&R);
+ }
+ }
+ }
+ }
+
+ // Saves the list of intervals that end with the index in 'key'.
+ using RecipeList = SmallVector<VPRecipeBase *, 2>;
+ SmallDenseMap<unsigned, RecipeList, 16> TransposeEnds;
+
+ // Transpose the EndPoints to a list of values that end at each index.
+ for (auto &Interval : EndPoint)
+ TransposeEnds[Interval.second].push_back(Interval.first);
+
+ SmallPtrSet<VPRecipeBase *, 8> OpenIntervals;
+ SmallVector<LoopVectorizationCostModel::RegisterUsage, 8> RUs(VFs.size());
+ SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
+
+ LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
+
+ VPTypeAnalysis TypeInfo(Plan.getCanonicalIV()->getScalarType());
+
+ const auto &TTICapture = TTI;
+ auto GetRegUsage = [&TTICapture](Type *Ty, ElementCount VF) -> unsigned {
+ if (Ty->isTokenTy() || !VectorType::isValidElementType(Ty) ||
+ (VF.isScalable() &&
+ !TTICapture.isElementTypeLegalForScalableVector(Ty)))
+ return 0;
+ return TTICapture.getRegUsageForType(VectorType::get(Ty, VF));
+ };
+
+ for (unsigned int Idx = 0, Sz = IdxToRecipe.size(); Idx < Sz; ++Idx) {
+ VPRecipeBase *R = IdxToRecipe[Idx];
+
+ // Remove all of the recipes that end at this location.
+ RecipeList &List = TransposeEnds[Idx];
+ for (VPRecipeBase *ToRemove : List)
+ OpenIntervals.erase(ToRemove);
+
+ // Ignore recipes that are never used within the loop.
+ if (!Ends.count(R) && !R->mayHaveSideEffects())
+ continue;
+
+ // For each VF find the maximum usage of registers.
+ for (unsigned J = 0, E = VFs.size(); J < E; ++J) {
+ // Count the number of registers used, per register class, given all open
+ // intervals.
+ // Note that elements in this SmallMapVector will be default constructed
+ // as 0. So we can use "RegUsage[ClassID] += n" in the code below even if
+ // there is no previous entry for ClassID.
+ SmallMapVector<unsigned, unsigned, 4> RegUsage;
+
+ if (VFs[J].isScalar()) {
+ for (auto *Inst : OpenIntervals) {
+ for (VPValue *DefV : Inst->definedValues()) {
+ unsigned ClassID = TTI.getRegisterClassForType(
+ false, TypeInfo.inferScalarType(DefV));
+ // FIXME: The target might use more than one register for the type
+ // even in the scalar case.
+ RegUsage[ClassID] += 1;
+ }
+ }
+ } else {
+ for (auto *R : OpenIntervals) {
+ if (isa<VPVectorPointerRecipe, VPReverseVectorPointerRecipe>(R))
+ continue;
+ if (isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
+ VPScalarIVStepsRecipe>(R) ||
+ (isa<VPInstruction>(R) &&
+ all_of(cast<VPSingleDefRecipe>(R)->users(), [&](VPUser *U) {
+ return cast<VPRecipeBase>(U)->usesScalars(
+ R->getVPSingleValue());
+ }))) {
+ unsigned ClassID = TTI.getRegisterClassForType(
+ false, TypeInfo.inferScalarType(R->getVPSingleValue()));
+ // FIXME: The target might use more than one register for the type
+ // even in the scalar case.
+ RegUsage[ClassID] += 1;
+ } else {
+ for (VPValue *DefV : R->definedValues()) {
+ Type *ScalarTy = TypeInfo.inferScalarType(DefV);
+ unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
+ RegUsage[ClassID] += GetRegUsage(ScalarTy, VFs[J]);
+ }
+ }
+ }
+ }
+
+ for (const auto &Pair : RegUsage) {
+ auto &Entry = MaxUsages[J][Pair.first];
+ Entry = std::max(Entry, Pair.second);
+ }
+ }
+
+ LLVM_DEBUG(dbgs() << "LV(REG): At #" << Idx << " Interval # "
+ << OpenIntervals.size() << '\n');
+
+ // Add the current recipe to the list of open intervals.
+ OpenIntervals.insert(R);
+ }
+
+ for (unsigned Idx = 0, End = VFs.size(); Idx < End; ++Idx) {
+ // Note that elements in this SmallMapVector will be default constructed
+ // as 0. So we can use "Invariant[ClassID] += n" in the code below even if
+ // there is no previous entry for ClassID.
+ SmallMapVector<unsigned, unsigned, 4> Invariant;
+
+ for (auto *In : LoopInvariants) {
+ // FIXME: The target might use more than one register for the type
+ // even in the scalar case.
+ bool IsScalar = all_of(In->users(), [&](VPUser *U) {
+ return cast<VPRecipeBase>(U)->usesScalars(In);
+ });
+
+ ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
+ unsigned ClassID = TTI.getRegisterClassForType(
+ VF.isVector(), TypeInfo.inferScalarType(In));
+ Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
+ }
+
+ LLVM_DEBUG({
+ dbgs() << "LV(REG): VF = " << VFs[Idx] << '\n';
+ dbgs() << "LV(REG): Found max usage: " << MaxUsages[Idx].size()
+ << " item\n";
+ for (const auto &pair : MaxUsages[Idx]) {
+ dbgs() << "LV(REG): RegisterClass: "
+ << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+ << " registers\n";
+ }
+ dbgs() << "LV(REG): Found invariant usage: " << Invariant.size()
+ << " item\n";
+ for (const auto &pair : Invariant) {
+ dbgs() << "LV(REG): RegisterClass: "
+ << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+ << " registers\n";
+ }
+ });
+
+ RU.LoopInvariantRegs = Invariant;
+ RU.MaxLocalUsers = MaxUsages[Idx];
+ RUs[Idx] = RU;
+ }
+
+ return RUs;
+}
+
unsigned
-LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
+LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
InstructionCost LoopCost) {
// -- The interleave heuristics --
// We interleave the loop in order to expose ILP and reduce the loop overhead.
@@ -4929,7 +5154,7 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
return 1;
}
- RegisterUsage R = calculateRegisterUsage({VF})[0];
+ RegisterUsage R = ::calculateRegisterUsage(Plan, {VF}, TTI)[0];
// We divide by these constants so assume that we have at least one
// instruction that uses at least one register.
for (auto &Pair : R.MaxLocalUsers) {
@@ -5262,7 +5487,7 @@ LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef<ElementCount> VFs) {
OpenIntervals.erase(ToRemove);
// Ignore instructions that are never used within the loop.
- if (!Ends.count(I))
+ if (!Ends.count(I) && !I->mayHaveSideEffects())
continue;
// Skip ignored values.
@@ -10574,7 +10799,7 @@ bool LoopVectorizePass::processLoop(Loop *L) {
AddBranchWeights, CM.CostKind);
if (LVP.hasPlanWithVF(VF.Width)) {
// Select the interleave count.
- IC = CM.selectInterleaveCount(VF.Width, VF.Cost);
+ IC = CM.selectInterleaveCount(LVP.getPlanFor(VF.Width), VF.Width, VF.Cost);
unsigned SelectedIC = std::max(IC, UserIC);
// Optimistically generate runtime checks if they are needed. Drop them if
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll b/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll
index 69af51deea08ea9..0ec90b75002cd0f 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll
@@ -8,8 +8,8 @@ target triple = "aarch64"
; CHECK-LABEL: LV: Checking a loop in 'or_reduction_neon' from <stdin>
; CHECK: LV(REG): VF = 32
; CHECK-NEXT: LV(REG): Found max usage: 2 item
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 72 registers
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 1 registers
define i1 @or_reduction_neon(i32 %arg, ptr %ptr) {
entry:
@@ -31,8 +31,8 @@ loop:
; CHECK-LABEL: LV: Checking a loop in 'or_reduction_sve'
; CHECK: LV(REG): VF = 64
; CHECK-NEXT: LV(REG): Found max usage: 2 item
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 136 registers
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 1 registers
define i1 @or_reduction_sve(i32 %arg, ptr %ptr) vscale_range(2,2) "target-features"="+sve" {
entry:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll b/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll
index 8239d32445c1054..feb0f03caa5033c 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll
@@ -14,11 +14,11 @@
define void @get_invariant_reg_usage(ptr %z) {
; CHECK: LV: Checking a loop in 'get_invariant_reg_usage'
; CHECK: LV(REG): VF = vscale x 16
-; CHECK-NEXT: LV(REG): Found max usage: 1 item
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
-; CHECK-NEXT: LV(REG): Found invariant usage: 2 item
+; CHECK-NEXT: LV(REG): Found max usage: 2 item
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 8 registers
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 1 registers
+; CHECK-NEXT: LV(REG): Found invariant usage: 1 item
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
L.entry:
%0 = load i128, ptr %z, align 16
diff --git a/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll b/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll
index f45a2f0f5b7e8f3..5baf1e013a50f87 100644
--- a/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll
@@ -9,16 +9,18 @@
define void @bar(ptr %A, i32 signext %n) {
; CHECK-LABEL: bar
; CHECK-SCALAR: LV(REG): Found max usage: 2 item
-; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 2 registers
+; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 3 registers
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::FPRRC, 1 registers
; CHECK-SCALAR-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
; CHECK-SCALAR-NEXT: LV: The target has 30 registers of LoongArch::GPRRC register class
; CHECK-SCALAR-NEXT: LV: The target has 32 registers of LoongArch::FPRRC register class
-; CHECK-VECTOR: LV(REG): Found max usage: 1 item
-; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::VRRC, 3 registers
+; CHECK-VECTOR: LV(REG): Found max usage: 2 item
+; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 2 registers
+; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::VRRC, 2 registers
; CHECK-VECTOR-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
+; CHECK-VECTOR-NEXT: LV: The target has 30 registers of LoongArch::GPRRC register class
; CHECK-VECTOR-NEXT: LV: The target has 32 registers of LoongArch::VRRC register class
entry:
diff --git a/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll b/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll
index e5717c4f1d91ab8..cc1e5f9a68593cc 100644
--- a/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll
@@ -18,10 +18,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP2]], 2
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK1]], label %[[VEC_EPILOG_SCALAR_PH:.*]], label %[[VECTOR_MAIN_LOOP_ITER_CHECK:.*]]
; CHECK: [[VECTOR_MAIN_LOOP_ITER_CHECK]]:
-; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 16
-; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[VEC_EPILOG_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK-NEXT: [[MIN_ITERS_CHECK3:%.*]] = icmp ult i64 [[TMP2]], 24
+; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK3]], label %[[VEC_EPILOG_PH:.*]], label %[[VECTOR_PH:.*]]
; CHECK: [[VECTOR_PH]]:
-; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 16
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 24
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
; CHECK: [[VECTOR_BODY]]:
@@ -35,6 +35,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
; CHECK-NEXT: [[VEC_PHI15:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP48:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI16:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP49:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI17:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP50:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI18:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP64:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI19:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP65:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI20:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP66:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI21:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP67:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[STEP_ADD:%.*]] = add <2 x i64> [[VEC_IND]], splat (i64 2)
; CHECK-NEXT: [[STEP_ADD_2:%.*]] = add <2 x i64> [[STEP_ADD]], splat (i64 2)
; CHECK-NEXT: [[STEP_ADD_3:%.*]] = add <2 x i64> [[STEP_ADD_2]], splat (i64 2)
@@ -42,6 +46,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
; CHECK-NEXT: [[STEP_ADD_5:%.*]] = add <2 x i64> [[STEP_ADD_4]], splat (i64 2)
; CHECK-NEXT: [[STEP_ADD_6:%.*]] = add <2 x i64> [[STEP_ADD_5]], splat (i64 2)
; CHECK-NEXT: [[STEP_ADD_7:%.*]] = add <2 x i64> [[STEP_ADD_6]], splat (i64 2)
+; CHECK-NEXT: [[STEP_ADD_8:%.*]] = add <2 x i64> [[STEP_ADD_7]], splat (i64 2)
+; CHECK-NEXT: [[STEP_ADD_9:%.*]] = add <2 x i64> [[STEP_ADD_8]], splat (i64 2)
+; CHECK-NEXT: [[STEP_ADD_10:%.*]] = add <2 x i64> [[STEP_ADD_9]], splat (i64 2)
+; CHECK-NEXT: [[STEP_ADD_11:%.*]] = add <2 x i64> [[STEP_ADD_10]], splat (i64 2)
; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 0
; CHECK-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP3]]
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 0
@@ -52,6 +60,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
; CHECK-NEXT: [[TMP16:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 10
; CHECK-NEXT: [[TMP17:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 12
; CHECK-NEXT: [[TMP18:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 14
+; CHECK-NEXT: [[TMP68:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 16
+; CHECK-NEXT: [[TMP69:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 18
+; CHECK-NEXT: [[TMP70:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 20
+; CHECK-NEXT: [[TMP71:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 22
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <2 x i8>, ptr [[TMP11]], align 1
; CHECK-NEXT: [[WIDE_LOAD25:%.*]] = load <2 x i8>, ptr [[TMP12]], align 1
; CHECK-NEXT: [[WIDE_LOAD26:%.*]] = load <2 x i8>, ptr [[TMP13]], align 1
@@ -60,6 +72,10 @@ define i1 @select_exit_cond(ptr %s...
[truncated]
|
@llvm/pr-subscribers-backend-powerpc Author: Florian Hahn (fhahn) ChangesAdd a version of calculateRegisterUsage that works estimates register There are number of changes in the computed register usages, but they There are the following changes:
Depends on #126415 Patch is 52.47 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126437.diff 14 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c2d347cf9b7e0c9..9727487a1dd1396 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1020,7 +1020,8 @@ class LoopVectorizationCostModel {
/// If interleave count has been specified by metadata it will be returned.
/// Otherwise, the interleave count is computed and returned. VF and LoopCost
/// are the selected vectorization factor and the cost of the selected VF.
- unsigned selectInterleaveCount(ElementCount VF, InstructionCost LoopCost);
+ unsigned selectInterleaveCount(VPlan &Plan, ElementCount VF,
+ InstructionCost LoopCost);
/// Memory access instruction may be vectorized in more than one way.
/// Form of instruction after vectorization depends on cost.
@@ -4878,8 +4879,232 @@ void LoopVectorizationCostModel::collectElementTypesForWidening() {
}
}
+/// Estimate the register usage for \p Plan and vectorization factors in \p VFs.
+/// Returns the register usage for each VF in \p VFs.
+static SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
+calculateRegisterUsage(VPlan &Plan, ArrayRef<ElementCount> VFs,
+ const TargetTransformInfo &TTI) {
+ // This function calculates the register usage by measuring the highest number
+ // of values that are alive at a single location. Obviously, this is a very
+ // rough estimation. We scan the loop in a topological order in order and
+ // assign a number to each recipe. We use RPO to ensure that defs are
+ // met before their users. We assume that each recipe that has in-loop
+ // users starts an interval. We record every time that an in-loop value is
+ // used, so we have a list of the first and last occurrences of each
+ // recipe. Next, we transpose this data structure into a multi map that
+ // holds the list of intervals that *end* at a specific location. This multi
+ // map allows us to perform a linear search. We scan the instructions linearly
+ // and record each time that a new interval starts, by placing it in a set.
+ // If we find this value in the multi-map then we remove it from the set.
+ // The max register usage is the maximum size of the set.
+ // We also search for instructions that are defined outside the loop, but are
+ // used inside the loop. We need this number separately from the max-interval
+ // usage number because when we unroll, loop-invariant values do not take
+ // more register.
+ LoopVectorizationCostModel::RegisterUsage RU;
+
+ // Each 'key' in the map opens a new interval. The values
+ // of the map are the index of the 'last seen' usage of the
+ // recipe that is the key.
+ using IntervalMap = SmallDenseMap<VPRecipeBase *, unsigned, 16>;
+
+ // Maps recipe to its index.
+ SmallVector<VPRecipeBase *, 64> IdxToRecipe;
+ // Marks the end of each interval.
+ IntervalMap EndPoint;
+ // Saves the list of recipe indices that are used in the loop.
+ SmallPtrSet<VPRecipeBase *, 8> Ends;
+ // Saves the list of values that are used in the loop but are defined outside
+ // the loop (not including non-recipe values such as arguments and
+ // constants).
+ SmallSetVector<VPValue *, 8> LoopInvariants;
+ LoopInvariants.insert(&Plan.getVectorTripCount());
+
+ ReversePostOrderTraversal<VPBlockDeepTraversalWrapper<VPBlockBase *>> RPOT(
+ Plan.getVectorLoopRegion());
+ for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {
+ if (!VPBB->getParent())
+ break;
+ for (VPRecipeBase &R : *VPBB) {
+ IdxToRecipe.push_back(&R);
+
+ // Save the end location of each USE.
+ for (VPValue *U : R.operands()) {
+ auto *DefR = U->getDefiningRecipe();
+
+ // Ignore non-recipe values such as arguments, constants, etc.
+ // FIXME: Might need some motivation why these values are ignored. If
+ // for example an argument is used inside the loop it will increase the
+ // register pressure (so shouldn't we add it to LoopInvariants).
+ if (!DefR && (!U->getLiveInIRValue() ||
+ !isa<Instruction>(U->getLiveInIRValue())))
+ continue;
+
+ // If this recipe is outside the loop then record it and continue.
+ if (!DefR) {
+ LoopInvariants.insert(U);
+ continue;
+ }
+
+ // Overwrite previous end points.
+ EndPoint[DefR] = IdxToRecipe.size();
+ Ends.insert(DefR);
+ }
+ }
+ if (VPBB == Plan.getVectorLoopRegion()->getExiting()) {
+ // VPWidenIntOrFpInductionRecipes are used implicitly at the end of the
+ // exiting block, where their increment will get materialized eventually.
+ for (auto &R : Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
+ if (isa<VPWidenIntOrFpInductionRecipe>(&R)) {
+ EndPoint[&R] = IdxToRecipe.size();
+ Ends.insert(&R);
+ }
+ }
+ }
+ }
+
+ // Saves the list of intervals that end with the index in 'key'.
+ using RecipeList = SmallVector<VPRecipeBase *, 2>;
+ SmallDenseMap<unsigned, RecipeList, 16> TransposeEnds;
+
+ // Transpose the EndPoints to a list of values that end at each index.
+ for (auto &Interval : EndPoint)
+ TransposeEnds[Interval.second].push_back(Interval.first);
+
+ SmallPtrSet<VPRecipeBase *, 8> OpenIntervals;
+ SmallVector<LoopVectorizationCostModel::RegisterUsage, 8> RUs(VFs.size());
+ SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
+
+ LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
+
+ VPTypeAnalysis TypeInfo(Plan.getCanonicalIV()->getScalarType());
+
+ const auto &TTICapture = TTI;
+ auto GetRegUsage = [&TTICapture](Type *Ty, ElementCount VF) -> unsigned {
+ if (Ty->isTokenTy() || !VectorType::isValidElementType(Ty) ||
+ (VF.isScalable() &&
+ !TTICapture.isElementTypeLegalForScalableVector(Ty)))
+ return 0;
+ return TTICapture.getRegUsageForType(VectorType::get(Ty, VF));
+ };
+
+ for (unsigned int Idx = 0, Sz = IdxToRecipe.size(); Idx < Sz; ++Idx) {
+ VPRecipeBase *R = IdxToRecipe[Idx];
+
+ // Remove all of the recipes that end at this location.
+ RecipeList &List = TransposeEnds[Idx];
+ for (VPRecipeBase *ToRemove : List)
+ OpenIntervals.erase(ToRemove);
+
+ // Ignore recipes that are never used within the loop.
+ if (!Ends.count(R) && !R->mayHaveSideEffects())
+ continue;
+
+ // For each VF find the maximum usage of registers.
+ for (unsigned J = 0, E = VFs.size(); J < E; ++J) {
+ // Count the number of registers used, per register class, given all open
+ // intervals.
+ // Note that elements in this SmallMapVector will be default constructed
+ // as 0. So we can use "RegUsage[ClassID] += n" in the code below even if
+ // there is no previous entry for ClassID.
+ SmallMapVector<unsigned, unsigned, 4> RegUsage;
+
+ if (VFs[J].isScalar()) {
+ for (auto *Inst : OpenIntervals) {
+ for (VPValue *DefV : Inst->definedValues()) {
+ unsigned ClassID = TTI.getRegisterClassForType(
+ false, TypeInfo.inferScalarType(DefV));
+ // FIXME: The target might use more than one register for the type
+ // even in the scalar case.
+ RegUsage[ClassID] += 1;
+ }
+ }
+ } else {
+ for (auto *R : OpenIntervals) {
+ if (isa<VPVectorPointerRecipe, VPReverseVectorPointerRecipe>(R))
+ continue;
+ if (isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
+ VPScalarIVStepsRecipe>(R) ||
+ (isa<VPInstruction>(R) &&
+ all_of(cast<VPSingleDefRecipe>(R)->users(), [&](VPUser *U) {
+ return cast<VPRecipeBase>(U)->usesScalars(
+ R->getVPSingleValue());
+ }))) {
+ unsigned ClassID = TTI.getRegisterClassForType(
+ false, TypeInfo.inferScalarType(R->getVPSingleValue()));
+ // FIXME: The target might use more than one register for the type
+ // even in the scalar case.
+ RegUsage[ClassID] += 1;
+ } else {
+ for (VPValue *DefV : R->definedValues()) {
+ Type *ScalarTy = TypeInfo.inferScalarType(DefV);
+ unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
+ RegUsage[ClassID] += GetRegUsage(ScalarTy, VFs[J]);
+ }
+ }
+ }
+ }
+
+ for (const auto &Pair : RegUsage) {
+ auto &Entry = MaxUsages[J][Pair.first];
+ Entry = std::max(Entry, Pair.second);
+ }
+ }
+
+ LLVM_DEBUG(dbgs() << "LV(REG): At #" << Idx << " Interval # "
+ << OpenIntervals.size() << '\n');
+
+ // Add the current recipe to the list of open intervals.
+ OpenIntervals.insert(R);
+ }
+
+ for (unsigned Idx = 0, End = VFs.size(); Idx < End; ++Idx) {
+ // Note that elements in this SmallMapVector will be default constructed
+ // as 0. So we can use "Invariant[ClassID] += n" in the code below even if
+ // there is no previous entry for ClassID.
+ SmallMapVector<unsigned, unsigned, 4> Invariant;
+
+ for (auto *In : LoopInvariants) {
+ // FIXME: The target might use more than one register for the type
+ // even in the scalar case.
+ bool IsScalar = all_of(In->users(), [&](VPUser *U) {
+ return cast<VPRecipeBase>(U)->usesScalars(In);
+ });
+
+ ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
+ unsigned ClassID = TTI.getRegisterClassForType(
+ VF.isVector(), TypeInfo.inferScalarType(In));
+ Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
+ }
+
+ LLVM_DEBUG({
+ dbgs() << "LV(REG): VF = " << VFs[Idx] << '\n';
+ dbgs() << "LV(REG): Found max usage: " << MaxUsages[Idx].size()
+ << " item\n";
+ for (const auto &pair : MaxUsages[Idx]) {
+ dbgs() << "LV(REG): RegisterClass: "
+ << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+ << " registers\n";
+ }
+ dbgs() << "LV(REG): Found invariant usage: " << Invariant.size()
+ << " item\n";
+ for (const auto &pair : Invariant) {
+ dbgs() << "LV(REG): RegisterClass: "
+ << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+ << " registers\n";
+ }
+ });
+
+ RU.LoopInvariantRegs = Invariant;
+ RU.MaxLocalUsers = MaxUsages[Idx];
+ RUs[Idx] = RU;
+ }
+
+ return RUs;
+}
+
unsigned
-LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
+LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
InstructionCost LoopCost) {
// -- The interleave heuristics --
// We interleave the loop in order to expose ILP and reduce the loop overhead.
@@ -4929,7 +5154,7 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
return 1;
}
- RegisterUsage R = calculateRegisterUsage({VF})[0];
+ RegisterUsage R = ::calculateRegisterUsage(Plan, {VF}, TTI)[0];
// We divide by these constants so assume that we have at least one
// instruction that uses at least one register.
for (auto &Pair : R.MaxLocalUsers) {
@@ -5262,7 +5487,7 @@ LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef<ElementCount> VFs) {
OpenIntervals.erase(ToRemove);
// Ignore instructions that are never used within the loop.
- if (!Ends.count(I))
+ if (!Ends.count(I) && !I->mayHaveSideEffects())
continue;
// Skip ignored values.
@@ -10574,7 +10799,7 @@ bool LoopVectorizePass::processLoop(Loop *L) {
AddBranchWeights, CM.CostKind);
if (LVP.hasPlanWithVF(VF.Width)) {
// Select the interleave count.
- IC = CM.selectInterleaveCount(VF.Width, VF.Cost);
+ IC = CM.selectInterleaveCount(LVP.getPlanFor(VF.Width), VF.Width, VF.Cost);
unsigned SelectedIC = std::max(IC, UserIC);
// Optimistically generate runtime checks if they are needed. Drop them if
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll b/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll
index 69af51deea08ea9..0ec90b75002cd0f 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll
@@ -8,8 +8,8 @@ target triple = "aarch64"
; CHECK-LABEL: LV: Checking a loop in 'or_reduction_neon' from <stdin>
; CHECK: LV(REG): VF = 32
; CHECK-NEXT: LV(REG): Found max usage: 2 item
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 72 registers
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 1 registers
define i1 @or_reduction_neon(i32 %arg, ptr %ptr) {
entry:
@@ -31,8 +31,8 @@ loop:
; CHECK-LABEL: LV: Checking a loop in 'or_reduction_sve'
; CHECK: LV(REG): VF = 64
; CHECK-NEXT: LV(REG): Found max usage: 2 item
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 136 registers
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 1 registers
define i1 @or_reduction_sve(i32 %arg, ptr %ptr) vscale_range(2,2) "target-features"="+sve" {
entry:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll b/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll
index 8239d32445c1054..feb0f03caa5033c 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll
@@ -14,11 +14,11 @@
define void @get_invariant_reg_usage(ptr %z) {
; CHECK: LV: Checking a loop in 'get_invariant_reg_usage'
; CHECK: LV(REG): VF = vscale x 16
-; CHECK-NEXT: LV(REG): Found max usage: 1 item
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
-; CHECK-NEXT: LV(REG): Found invariant usage: 2 item
+; CHECK-NEXT: LV(REG): Found max usage: 2 item
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
-; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 8 registers
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 1 registers
+; CHECK-NEXT: LV(REG): Found invariant usage: 1 item
+; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
L.entry:
%0 = load i128, ptr %z, align 16
diff --git a/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll b/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll
index f45a2f0f5b7e8f3..5baf1e013a50f87 100644
--- a/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll
@@ -9,16 +9,18 @@
define void @bar(ptr %A, i32 signext %n) {
; CHECK-LABEL: bar
; CHECK-SCALAR: LV(REG): Found max usage: 2 item
-; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 2 registers
+; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 3 registers
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::FPRRC, 1 registers
; CHECK-SCALAR-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
; CHECK-SCALAR-NEXT: LV: The target has 30 registers of LoongArch::GPRRC register class
; CHECK-SCALAR-NEXT: LV: The target has 32 registers of LoongArch::FPRRC register class
-; CHECK-VECTOR: LV(REG): Found max usage: 1 item
-; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::VRRC, 3 registers
+; CHECK-VECTOR: LV(REG): Found max usage: 2 item
+; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 2 registers
+; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::VRRC, 2 registers
; CHECK-VECTOR-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
+; CHECK-VECTOR-NEXT: LV: The target has 30 registers of LoongArch::GPRRC register class
; CHECK-VECTOR-NEXT: LV: The target has 32 registers of LoongArch::VRRC register class
entry:
diff --git a/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll b/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll
index e5717c4f1d91ab8..cc1e5f9a68593cc 100644
--- a/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll
@@ -18,10 +18,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP2]], 2
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK1]], label %[[VEC_EPILOG_SCALAR_PH:.*]], label %[[VECTOR_MAIN_LOOP_ITER_CHECK:.*]]
; CHECK: [[VECTOR_MAIN_LOOP_ITER_CHECK]]:
-; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 16
-; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[VEC_EPILOG_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK-NEXT: [[MIN_ITERS_CHECK3:%.*]] = icmp ult i64 [[TMP2]], 24
+; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK3]], label %[[VEC_EPILOG_PH:.*]], label %[[VECTOR_PH:.*]]
; CHECK: [[VECTOR_PH]]:
-; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 16
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 24
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
; CHECK: [[VECTOR_BODY]]:
@@ -35,6 +35,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
; CHECK-NEXT: [[VEC_PHI15:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP48:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI16:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP49:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI17:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP50:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI18:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP64:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI19:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP65:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI20:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP66:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI21:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP67:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[STEP_ADD:%.*]] = add <2 x i64> [[VEC_IND]], splat (i64 2)
; CHECK-NEXT: [[STEP_ADD_2:%.*]] = add <2 x i64> [[STEP_ADD]], splat (i64 2)
; CHECK-NEXT: [[STEP_ADD_3:%.*]] = add <2 x i64> [[STEP_ADD_2]], splat (i64 2)
@@ -42,6 +46,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
; CHECK-NEXT: [[STEP_ADD_5:%.*]] = add <2 x i64> [[STEP_ADD_4]], splat (i64 2)
; CHECK-NEXT: [[STEP_ADD_6:%.*]] = add <2 x i64> [[STEP_ADD_5]], splat (i64 2)
; CHECK-NEXT: [[STEP_ADD_7:%.*]] = add <2 x i64> [[STEP_ADD_6]], splat (i64 2)
+; CHECK-NEXT: [[STEP_ADD_8:%.*]] = add <2 x i64> [[STEP_ADD_7]], splat (i64 2)
+; CHECK-NEXT: [[STEP_ADD_9:%.*]] = add <2 x i64> [[STEP_ADD_8]], splat (i64 2)
+; CHECK-NEXT: [[STEP_ADD_10:%.*]] = add <2 x i64> [[STEP_ADD_9]], splat (i64 2)
+; CHECK-NEXT: [[STEP_ADD_11:%.*]] = add <2 x i64> [[STEP_ADD_10]], splat (i64 2)
; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 0
; CHECK-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP3]]
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 0
@@ -52,6 +60,10 @@ define i1 @select_exit_cond(ptr %start, ptr %end, i64 %N) {
; CHECK-NEXT: [[TMP16:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 10
; CHECK-NEXT: [[TMP17:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 12
; CHECK-NEXT: [[TMP18:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 14
+; CHECK-NEXT: [[TMP68:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 16
+; CHECK-NEXT: [[TMP69:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 18
+; CHECK-NEXT: [[TMP70:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 20
+; CHECK-NEXT: [[TMP71:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 22
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <2 x i8>, ptr [[TMP11]], align 1
; CHECK-NEXT: [[WIDE_LOAD25:%.*]] = load <2 x i8>, ptr [[TMP12]], align 1
; CHECK-NEXT: [[WIDE_LOAD26:%.*]] = load <2 x i8>, ptr [[TMP13]], align 1
@@ -60,6 +72,10 @@ define i1 @select_exit_cond(ptr %s...
[truncated]
|
Nice! I had also been looking into doing this as well to address this issue where we end up with a lot of spilling due to the VF being too high: https://godbolt.org/z/P3j7978Pq In that example the current non-VP VPlan models the vector of i64 pointers and so doing From what I understand But I think with this change we could use it in more places for RISC-V to select a better MaxVF. |
// We also search for instructions that are defined outside the loop, but are | ||
// used inside the loop. We need this number separately from the max-interval | ||
// usage number because when we unroll, loop-invariant values do not take | ||
// more register. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
register -> registers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done thanks
// recipe that is the key. | ||
using IntervalMap = SmallDenseMap<VPRecipeBase *, unsigned, 16>; | ||
|
||
// Maps recipe to its index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be more correct to say it maps the index to the recipe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated thanks
// Ignore non-recipe values such as arguments, constants, etc. | ||
// FIXME: Might need some motivation why these values are ignored. If | ||
// for example an argument is used inside the loop it will increase the | ||
// register pressure (so shouldn't we add it to LoopInvariants). | ||
if (!DefR && (!U->getLiveInIRValue() || | ||
!isa<Instruction>(U->getLiveInIRValue()))) | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does adding these values to the invariants list cause lots of extra work or issues that can't be done in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just ports the code + FIXME from the original implementation to keep this as close to NFC as possible. Happy to remove the restriction as follow-up
Have you had time to think about my review comments? I'm working on a patch on top of this so it would be good to get it landed soon. |
Thanks, I'll get to them soon, probably later this week. I am at the moment catching up on a number of reviews. |
Thank you, that sounds good to me. In the meantime, would you mind creating a user branch in this repo with this work in it, or alternatively would you mind if I create one under my user name? It would be good to have one so I can create a stacked PR. |
72e5409
to
07b2a1e
Compare
Based on fhahn's work at llvm#126437 . This PR moves the register usage checking to after the plans are created, so that any recipes that optimise register usage (such as partial reductions) can be properly costed and not have their VF pruned unnecessarily. It involves changing some tests, notably removing one from mve-known-tripcount.ll due to it not being vectorisable thanks to high register pressure. tail-folding-reduces-vf.ll was modified to reduce its register pressure but still test what was intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you! I've pushed my PR that removes the old calculateRegisterUsage function here: #132190
if (!Ends.count(R) && !R->mayHaveSideEffects()) | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are missing code here to also skip ephemeral values, which the legacy version handled properly. Still need to figure out the best way to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled this for now by just checking the exiting ValuesToIgnore
, as there doesn't seem to be a nice alternative for matching the behavior directly checking the info in VPlan without much extra work.
Based on fhahn's work at llvm#126437 . This PR moves the register usage checking to after the plans are created, so that any recipes that optimise register usage (such as partial reductions) can be properly costed and not have their VF pruned unnecessarily. It involves changing some tests, notably removing one from mve-known-tripcount.ll due to it not being vectorisable thanks to high register pressure. tail-folding-reduces-vf.ll was modified to reduce its register pressure but still test what was intended.
07b2a1e
to
732c916
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest version includes handling for ephemeral users (like assume). I am planning to merge this once the CI is happy.
if (!Ends.count(R) && !R->mayHaveSideEffects()) | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled this for now by just checking the exiting ValuesToIgnore
, as there doesn't seem to be a nice alternative for matching the behavior directly checking the info in VPlan without much extra work.
Based on fhahn's work at llvm#126437 . This PR moves the register usage checking to after the plans are created, so that any recipes that optimise register usage (such as partial reductions) can be properly costed and not have their VF pruned unnecessarily. It involves changing some tests, notably removing one from mve-known-tripcount.ll due to it not being vectorisable thanks to high register pressure. tail-folding-reduces-vf.ll was modified to reduce its register pressure but still test what was intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating! This looks largely good to me, just have a few minor comments.
Add a version of calculateRegisterUsage that works estimates register usage for a VPlan. This mostly just ports the existing code, with some updates to figure out what recipes will generate vectors vs scalars. There are number of changes in the computed register usages, but they should be more accurate w.r.t. to the generated vector code. There are the following changes: * Scalar usage increases in most cases by 1, as we always create a scalar canonical IV, which is alive across the loop and is not considered by the legacy implementation * Output is ordered by insertion, now scalar registers are added first due the canonical IV phi. * Using the VPlan, we now also more precisely know if an induction will be vectorized or scalarized.
732c916
to
0c2c8e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments, LGTM!
…26437) Add a version of calculateRegisterUsage that works estimates register usage for a VPlan. This mostly just ports the existing code, with some updates to figure out what recipes will generate vectors vs scalars. There are number of changes in the computed register usages, but they should be more accurate w.r.t. to the generated vector code. There are the following changes: * Scalar usage increases in most cases by 1, as we always create a scalar canonical IV, which is alive across the loop and is not considered by the legacy implementation * Output is ordered by insertion, now scalar registers are added first due the canonical IV phi. * Using the VPlan, we now also more precisely know if an induction will be vectorized or scalarized. Depends on llvm/llvm-project#126415 PR: llvm/llvm-project#126437
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/186/builds/8036 Here is the relevant piece of the build log for the reference
|
Based on fhahn's work at llvm#126437 . This PR moves the register usage checking to after the plans are created, so that any recipes that optimise register usage (such as partial reductions) can be properly costed and not have their VF pruned unnecessarily. It involves changing some tests, notably removing one from mve-known-tripcount.ll due to it not being vectorisable thanks to high register pressure. tail-folding-reduces-vf.ll was modified to reduce its register pressure but still test what was intended.
This PR accounts for scaled reductions in `calculateRegisterUsage` to reflect the fact that the number of lanes in their output is smaller than the VF. Depends on #126437
This PR accounts for scaled reductions in `calculateRegisterUsage` to reflect the fact that the number of lanes in their output is smaller than the VF. Depends on llvm/llvm-project#126437
Based on fhahn's work at llvm#126437 . This PR moves the register usage checking to after the plans are created, so that any recipes that optimise register usage (such as partial reductions) can be properly costed and not have their VF pruned unnecessarily. It involves changing some tests, notably removing one from mve-known-tripcount.ll due to it not being vectorisable thanks to high register pressure. tail-folding-reduces-vf.ll was modified to reduce its register pressure but still test what was intended.
Based on fhahn's work at llvm#126437 . This PR moves the register usage checking to after the plans are created, so that any recipes that optimise register usage (such as partial reductions) can be properly costed and not have their VF pruned unnecessarily. It involves changing some tests, notably removing one from mve-known-tripcount.ll due to it not being vectorisable thanks to high register pressure. tail-folding-reduces-vf.ll was modified to reduce its register pressure but still test what was intended.
Add a version of calculateRegisterUsage that works estimates register usage for a VPlan. This mostly just ports the existing code, with some updates to figure out what recipes will generate vectors vs scalars. There are number of changes in the computed register usages, but they should be more accurate w.r.t. to the generated vector code. There are the following changes: * Scalar usage increases in most cases by 1, as we always create a scalar canonical IV, which is alive across the loop and is not considered by the legacy implementation * Output is ordered by insertion, now scalar registers are added first due the canonical IV phi. * Using the VPlan, we now also more precisely know if an induction will be vectorized or scalarized. Depends on llvm#126415 PR: llvm#126437
This PR accounts for scaled reductions in `calculateRegisterUsage` to reflect the fact that the number of lanes in their output is smaller than the VF. Depends on llvm#126437
Based on fhahn's work at llvm#126437 . This PR moves the register usage checking to after the plans are created, so that any recipes that optimise register usage (such as partial reductions) can be properly costed and not have their VF pruned unnecessarily. It involves changing some tests, notably removing one from mve-known-tripcount.ll due to it not being vectorisable thanks to high register pressure. tail-folding-reduces-vf.ll was modified to reduce its register pressure but still test what was intended.
Add a version of calculateRegisterUsage that works estimates register usage for a VPlan. This mostly just ports the existing code, with some updates to figure out what recipes will generate vectors vs scalars. There are number of changes in the computed register usages, but they should be more accurate w.r.t. to the generated vector code. There are the following changes: * Scalar usage increases in most cases by 1, as we always create a scalar canonical IV, which is alive across the loop and is not considered by the legacy implementation * Output is ordered by insertion, now scalar registers are added first due the canonical IV phi. * Using the VPlan, we now also more precisely know if an induction will be vectorized or scalarized. Depends on llvm#126415 PR: llvm#126437
Hi @fhahn, we are seeing a perf regression(around ~7%) in a TSVC benchmark s351 due to this change when compiled with From the assembly I can see there is less interleaving in the hot loop after this change, maybe that's causing the slowdown? How do you reckon we can go about fixing this? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fhahn, we are seeing a perf regression(around ~7%) in a TSVC benchmark s351 due to this change when compiled with
-Ofast
on aneoverse-v2
machine.From the assembly I can see there is less interleaving in the hot loop after this change, maybe that's causing the slowdown? How do you reckon we can go about fixing this?
Thanks!
@yashssh thanks for the heads-up. Could you share a full reproducer with the LLVM IR before loop-vectorize?
Sure, here's the IR reproducer to be executed via The C++ function and the compiler flags used to generate the IR are here https://godbolt.org/z/ebv3nc4hY |
I took a quick look at this and my guess is that it's due to the interleave group recipes hoisting the loads and extending their live ranges. With the old legacy model, the loads are located beside their fmul/fadd uses: %indvars.iv = phi i64 [ 0, %.preheader ], [ %indvars.iv.next, %5 ]
%.118 = phi float [ 0.000000e+00, %.preheader ], [ %39, %5 ]
%6 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %indvars.iv
%7 = load float, ptr %6, align 4, !tbaa !8
%8 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %indvars.iv
%9 = load float, ptr %8, align 4, !tbaa !8
%10 = fmul fast float %9, %7
%11 = fadd fast float %10, %.118
%12 = add nuw nsw i64 %indvars.iv, 1
%13 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %12
%14 = load float, ptr %13, align 4, !tbaa !8
%15 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %12
%16 = load float, ptr %15, align 4, !tbaa !8
%17 = fmul fast float %16, %14
%18 = fadd fast float %11, %17
%19 = add nuw nsw i64 %indvars.iv, 2
%20 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %19
%21 = load float, ptr %20, align 4, !tbaa !8
%22 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %19
%23 = load float, ptr %22, align 4, !tbaa !8
%24 = fmul fast float %23, %21
%25 = fadd fast float %18, %24
%26 = add nuw nsw i64 %indvars.iv, 3
%27 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %26
%28 = load float, ptr %27, align 4, !tbaa !8
%29 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %26
%30 = load float, ptr %29, align 4, !tbaa !8
%31 = fmul fast float %30, %28
%32 = fadd fast float %25, %31
%33 = add nuw nsw i64 %indvars.iv, 4
%34 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %33
%35 = load float, ptr %34, align 4, !tbaa !8
%36 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %33
%37 = load float, ptr %36, align 4, !tbaa !8
%38 = fmul fast float %37, %35
%39 = fadd fast float %32, %38
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 5
%40 = icmp samesign ult i64 %indvars.iv, 31995
br i1 %40, label %5, label %3, !llvm.loop !12 In VPlan the loads are all now in the one place, and there's now 10 loads alive at the same time: vector.body:
EMIT vp<%4> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
ir<%indvars.iv> = WIDEN-INDUCTION ir<0>, ir<5>, vp<%0>
WIDEN-REDUCTION-PHI ir<%.118> = phi ir<0.000000e+00>, ir<%39>
CLONE ir<%6> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%indvars.iv>
INTERLEAVE-GROUP with factor 5 at %7, ir<%6>
ir<%7> = load from index 0
ir<%14> = load from index 1
ir<%21> = load from index 2
ir<%28> = load from index 3
ir<%35> = load from index 4
CLONE ir<%8> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%indvars.iv>
INTERLEAVE-GROUP with factor 5 at %9, ir<%8>
ir<%9> = load from index 0
ir<%16> = load from index 1
ir<%23> = load from index 2
ir<%30> = load from index 3
ir<%37> = load from index 4
WIDEN ir<%10> = fmul fast ir<%9>, ir<%7>
WIDEN ir<%11> = fadd fast ir<%10>, ir<%.118>
CLONE ir<%12> = add nuw nsw ir<%indvars.iv>, ir<1>
CLONE ir<%13> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%12>
CLONE ir<%15> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%12>
WIDEN ir<%17> = fmul fast ir<%16>, ir<%14>
WIDEN ir<%18> = fadd fast ir<%11>, ir<%17>
CLONE ir<%19> = add nuw nsw ir<%indvars.iv>, ir<2>
CLONE ir<%20> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%19>
CLONE ir<%22> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%19>
WIDEN ir<%24> = fmul fast ir<%23>, ir<%21>
WIDEN ir<%25> = fadd fast ir<%18>, ir<%24>
CLONE ir<%26> = add nuw nsw ir<%indvars.iv>, ir<3>
CLONE ir<%27> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%26>
CLONE ir<%29> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%26>
WIDEN ir<%31> = fmul fast ir<%30>, ir<%28>
WIDEN ir<%32> = fadd fast ir<%25>, ir<%31>
CLONE ir<%33> = add nuw nsw ir<%indvars.iv>, ir<4>
CLONE ir<%34> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%33>
CLONE ir<%36> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%33>
WIDEN ir<%38> = fmul fast ir<%37>, ir<%35>
WIDEN ir<%39> = fadd fast ir<%32>, ir<%38>
CLONE ir<%indvars.iv.next> = add nuw nsw ir<%indvars.iv>, ir<5>
CLONE ir<%40> = icmp ult ir<%indvars.iv>, ir<31995> So whilst the legacy model computed a maximum of 3 vector registers used, the VPlan one now calculates 12:
I'm not sure what a good fix for this is. |
Can't we instruct VPlan to spread out loads near their uses like the legacy model? |
I think that would be quite difficult because an interleave group is a single recipe with multiple values if I understand correctly. I think that the VPlan model is actually being more accurate in regards to the IR that the loop vectorizer emits, i.e. with the loads bundled together. It’s more of a coincidence that the legacy version happened to match it closer to what it looked like after scheduling/once the loads were sunk? |
Based on fhahn's work at llvm#126437 . This PR moves the register usage checking to after the plans are created, so that any recipes that optimise register usage (such as partial reductions) can be properly costed and not have their VF pruned unnecessarily. It involves changing some tests, notably removing one from mve-known-tripcount.ll due to it not being vectorisable thanks to high register pressure. tail-folding-reduces-vf.ll was modified to reduce its register pressure but still test what was intended.
Add a version of calculateRegisterUsage that works estimates register
usage for a VPlan. This mostly just ports the existing code, with some
updates to figure out what recipes will generate vectors vs scalars.
There are number of changes in the computed register usages, but they
should be more accurate w.r.t. to the generated vector code.
There are the following changes:
Scalar usage increases in most cases by 1, as we always create a
scalar canonical IV, which is alive across the loop and is not
considered by the legacy implementation
Output is ordered by insertion, now scalar registers are added first
due the canonical IV phi.
Using the VPlan, we now also more precisely know if an induction will
be vectorized or scalarized.
Depends on #126415