-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[LoopVectorizer] Prune VFs based on plan register pressure #132190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-powerpc @llvm/pr-subscribers-backend-risc-v Author: Sam Tebbs (SamTebbs33) ChangesBased on fhahn's work at #126437. This PR moves the register usage checking to after the plans are It involves changing some tests, notably removing one from Patch is 186.33 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/132190.diff 24 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 89337dc385350..3574aa31195f0 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1018,7 +1018,8 @@ class LoopVectorizationCostModel {
/// If interleave count has been specified by metadata it will be returned.
/// Otherwise, the interleave count is computed and returned. VF and LoopCost
/// are the selected vectorization factor and the cost of the selected VF.
- unsigned selectInterleaveCount(ElementCount VF, InstructionCost LoopCost);
+ unsigned selectInterleaveCount(VPlan &Plan, ElementCount VF,
+ InstructionCost LoopCost);
/// Memory access instruction may be vectorized in more than one way.
/// Form of instruction after vectorization depends on cost.
@@ -1047,11 +1048,6 @@ class LoopVectorizationCostModel {
SmallMapVector<unsigned, unsigned, 4> MaxLocalUsers;
};
- /// \return Returns information about the register usages of the loop for the
- /// given vectorization factors.
- SmallVector<RegisterUsage, 8>
- calculateRegisterUsage(ArrayRef<ElementCount> VFs);
-
/// Collect values we want to ignore in the cost model.
void collectValuesToIgnore();
@@ -4239,27 +4235,12 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
ComputeScalableMaxVF);
MaxVectorElementCountMaxBW = MinVF(MaxVectorElementCountMaxBW, MaxSafeVF);
- // Collect all viable vectorization factors larger than the default MaxVF
- // (i.e. MaxVectorElementCount).
- SmallVector<ElementCount, 8> VFs;
+ // Set the max VF to the largest viable vectorization factor less than or
+ // equal to the max vector element count.
for (ElementCount VS = MaxVectorElementCount * 2;
ElementCount::isKnownLE(VS, MaxVectorElementCountMaxBW); VS *= 2)
- VFs.push_back(VS);
-
- // For each VF calculate its register usage.
- auto RUs = calculateRegisterUsage(VFs);
+ MaxVF = VS;
- // Select the largest VF which doesn't require more registers than existing
- // ones.
- for (int I = RUs.size() - 1; I >= 0; --I) {
- const auto &MLU = RUs[I].MaxLocalUsers;
- if (all_of(MLU, [&](decltype(MLU.front()) &LU) {
- return LU.second <= TTI.getNumberOfRegisters(LU.first);
- })) {
- MaxVF = VFs[I];
- break;
- }
- }
if (ElementCount MinVF =
TTI.getMinimumVF(SmallestType, ComputeScalableMaxVF)) {
if (ElementCount::isKnownLT(MaxVF, MinVF)) {
@@ -4885,8 +4866,245 @@ void LoopVectorizationCostModel::collectElementTypesForWidening() {
}
}
+/// Estimate the register usage for \p Plan and vectorization factors in \p VFs.
+/// Returns the register usage for each VF in \p VFs.
+static SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
+calculateRegisterUsage(VPlan &Plan, ArrayRef<ElementCount> VFs,
+ const TargetTransformInfo &TTI) {
+ // This function calculates the register usage by measuring the highest number
+ // of values that are alive at a single location. Obviously, this is a very
+ // rough estimation. We scan the loop in a topological order in order and
+ // assign a number to each recipe. We use RPO to ensure that defs are
+ // met before their users. We assume that each recipe that has in-loop
+ // users starts an interval. We record every time that an in-loop value is
+ // used, so we have a list of the first and last occurrences of each
+ // recipe. Next, we transpose this data structure into a multi map that
+ // holds the list of intervals that *end* at a specific location. This multi
+ // map allows us to perform a linear search. We scan the instructions linearly
+ // and record each time that a new interval starts, by placing it in a set.
+ // If we find this value in the multi-map then we remove it from the set.
+ // The max register usage is the maximum size of the set.
+ // We also search for instructions that are defined outside the loop, but are
+ // used inside the loop. We need this number separately from the max-interval
+ // usage number because when we unroll, loop-invariant values do not take
+ // more register.
+ LoopVectorizationCostModel::RegisterUsage RU;
+
+ // Each 'key' in the map opens a new interval. The values
+ // of the map are the index of the 'last seen' usage of the
+ // recipe that is the key.
+ using IntervalMap = SmallDenseMap<VPRecipeBase *, unsigned, 16>;
+
+ // Maps indices to recipes.
+ SmallVector<VPRecipeBase *, 64> Idx2Recipe;
+ // Marks the end of each interval.
+ IntervalMap EndPoint;
+ // Saves the list of recipe indices that are used in the loop.
+ SmallPtrSet<VPRecipeBase *, 8> Ends;
+ // Saves the list of values that are used in the loop but are defined outside
+ // the loop (not including non-recipe values such as arguments and
+ // constants).
+ SmallSetVector<VPValue *, 8> LoopInvariants;
+ LoopInvariants.insert(&Plan.getVectorTripCount());
+
+ ReversePostOrderTraversal<VPBlockDeepTraversalWrapper<VPBlockBase *>> RPOT(
+ Plan.getVectorLoopRegion());
+ for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {
+ if (!VPBB->getParent())
+ break;
+ for (VPRecipeBase &R : *VPBB) {
+ Idx2Recipe.push_back(&R);
+
+ // Save the end location of each USE.
+ for (VPValue *U : R.operands()) {
+ auto *DefR = U->getDefiningRecipe();
+
+ // Ignore non-recipe values such as arguments, constants, etc.
+ // FIXME: Might need some motivation why these values are ignored. If
+ // for example an argument is used inside the loop it will increase the
+ // register pressure (so shouldn't we add it to LoopInvariants).
+ if (!DefR && (!U->getLiveInIRValue() ||
+ !isa<Instruction>(U->getLiveInIRValue())))
+ continue;
+
+ // If this recipe is outside the loop then record it and continue.
+ if (!DefR) {
+ LoopInvariants.insert(U);
+ continue;
+ }
+
+ // Overwrite previous end points.
+ EndPoint[DefR] = Idx2Recipe.size();
+ Ends.insert(DefR);
+ }
+ }
+ if (VPBB == Plan.getVectorLoopRegion()->getExiting()) {
+ // VPWidenIntOrFpInductionRecipes are used implicitly at the end of the
+ // exiting block, where their increment will get materialized eventually.
+ for (auto &R : Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
+ if (isa<VPWidenIntOrFpInductionRecipe>(&R)) {
+ EndPoint[&R] = Idx2Recipe.size();
+ Ends.insert(&R);
+ }
+ }
+ }
+ }
+
+ // Saves the list of intervals that end with the index in 'key'.
+ using RecipeList = SmallVector<VPRecipeBase *, 2>;
+ SmallDenseMap<unsigned, RecipeList, 16> TransposeEnds;
+
+ // Transpose the EndPoints to a list of values that end at each index.
+ for (auto &Interval : EndPoint)
+ TransposeEnds[Interval.second].push_back(Interval.first);
+
+ SmallPtrSet<VPRecipeBase *, 8> OpenIntervals;
+ SmallVector<LoopVectorizationCostModel::RegisterUsage, 8> RUs(VFs.size());
+ SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
+
+ LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
+
+ VPTypeAnalysis TypeInfo(Plan.getCanonicalIV()->getScalarType());
+
+ const auto &TTICapture = TTI;
+ auto GetRegUsage = [&TTICapture](Type *Ty, ElementCount VF) -> unsigned {
+ if (Ty->isTokenTy() || !VectorType::isValidElementType(Ty) ||
+ (VF.isScalable() &&
+ !TTICapture.isElementTypeLegalForScalableVector(Ty)))
+ return 0;
+ return TTICapture.getRegUsageForType(VectorType::get(Ty, VF));
+ };
+
+ for (unsigned int Idx = 0, Sz = Idx2Recipe.size(); Idx < Sz; ++Idx) {
+ VPRecipeBase *R = Idx2Recipe[Idx];
+
+ // Remove all of the recipes that end at this location.
+ RecipeList &List = TransposeEnds[Idx];
+ for (VPRecipeBase *ToRemove : List)
+ OpenIntervals.erase(ToRemove);
+
+ // Ignore recipes that are never used within the loop.
+ if (!Ends.count(R) && !R->mayHaveSideEffects())
+ continue;
+
+ // For each VF find the maximum usage of registers.
+ for (unsigned J = 0, E = VFs.size(); J < E; ++J) {
+ // Count the number of registers used, per register class, given all open
+ // intervals.
+ // Note that elements in this SmallMapVector will be default constructed
+ // as 0. So we can use "RegUsage[ClassID] += n" in the code below even if
+ // there is no previous entry for ClassID.
+ SmallMapVector<unsigned, unsigned, 4> RegUsage;
+
+ if (VFs[J].isScalar()) {
+ for (auto *Inst : OpenIntervals) {
+ for (VPValue *DefV : Inst->definedValues()) {
+ unsigned ClassID = TTI.getRegisterClassForType(
+ false, TypeInfo.inferScalarType(DefV));
+ // FIXME: The target might use more than one register for the type
+ // even in the scalar case.
+ RegUsage[ClassID] += 1;
+ }
+ }
+ } else {
+ for (auto *R : OpenIntervals) {
+ if (isa<VPVectorPointerRecipe, VPVectorEndPointerRecipe>(R))
+ continue;
+ if (isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
+ VPScalarIVStepsRecipe>(R) ||
+ (isa<VPInstruction>(R) &&
+ all_of(cast<VPSingleDefRecipe>(R)->users(), [&](VPUser *U) {
+ return cast<VPRecipeBase>(U)->usesScalars(
+ R->getVPSingleValue());
+ }))) {
+ unsigned ClassID = TTI.getRegisterClassForType(
+ false, TypeInfo.inferScalarType(R->getVPSingleValue()));
+ // FIXME: The target might use more than one register for the type
+ // even in the scalar case.
+ RegUsage[ClassID] += 1;
+ } else {
+ // The output from scaled phis and scaled reductions actually have
+ // fewer lanes than the VF.
+ auto VF = VFs[J];
+ if (auto *ReductionR = dyn_cast<VPReductionPHIRecipe>(R))
+ VF = VF.divideCoefficientBy(ReductionR->getVFScaleFactor());
+ else if (auto *PartialReductionR =
+ dyn_cast<VPPartialReductionRecipe>(R))
+ VF = VF.divideCoefficientBy(PartialReductionR->getScaleFactor());
+ if (VF != VFs[J])
+ LLVM_DEBUG(dbgs() << "LV(REG): Scaled down VF from " << VFs[J]
+ << " to " << VF << " for ";
+ R->dump(););
+
+ for (VPValue *DefV : R->definedValues()) {
+ Type *ScalarTy = TypeInfo.inferScalarType(DefV);
+ unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
+ RegUsage[ClassID] += GetRegUsage(ScalarTy, VF);
+ }
+ }
+ }
+ }
+
+ for (const auto &Pair : RegUsage) {
+ auto &Entry = MaxUsages[J][Pair.first];
+ Entry = std::max(Entry, Pair.second);
+ }
+ }
+
+ LLVM_DEBUG(dbgs() << "LV(REG): At #" << Idx << " Interval # "
+ << OpenIntervals.size() << '\n');
+
+ // Add the current recipe to the list of open intervals.
+ OpenIntervals.insert(R);
+ }
+
+ for (unsigned Idx = 0, End = VFs.size(); Idx < End; ++Idx) {
+ // Note that elements in this SmallMapVector will be default constructed
+ // as 0. So we can use "Invariant[ClassID] += n" in the code below even if
+ // there is no previous entry for ClassID.
+ SmallMapVector<unsigned, unsigned, 4> Invariant;
+
+ for (auto *In : LoopInvariants) {
+ // FIXME: The target might use more than one register for the type
+ // even in the scalar case.
+ bool IsScalar = all_of(In->users(), [&](VPUser *U) {
+ return cast<VPRecipeBase>(U)->usesScalars(In);
+ });
+
+ ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
+ unsigned ClassID = TTI.getRegisterClassForType(
+ VF.isVector(), TypeInfo.inferScalarType(In));
+ Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
+ }
+
+ LLVM_DEBUG({
+ dbgs() << "LV(REG): VF = " << VFs[Idx] << '\n';
+ dbgs() << "LV(REG): Found max usage: " << MaxUsages[Idx].size()
+ << " item\n";
+ for (const auto &pair : MaxUsages[Idx]) {
+ dbgs() << "LV(REG): RegisterClass: "
+ << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+ << " registers\n";
+ }
+ dbgs() << "LV(REG): Found invariant usage: " << Invariant.size()
+ << " item\n";
+ for (const auto &pair : Invariant) {
+ dbgs() << "LV(REG): RegisterClass: "
+ << TTI.getRegisterClassName(pair.first) << ", " << pair.second
+ << " registers\n";
+ }
+ });
+
+ RU.LoopInvariantRegs = Invariant;
+ RU.MaxLocalUsers = MaxUsages[Idx];
+ RUs[Idx] = RU;
+ }
+
+ return RUs;
+}
+
unsigned
-LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
+LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
InstructionCost LoopCost) {
// -- The interleave heuristics --
// We interleave the loop in order to expose ILP and reduce the loop overhead.
@@ -4936,7 +5154,7 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
return 1;
}
- RegisterUsage R = calculateRegisterUsage({VF})[0];
+ RegisterUsage R = ::calculateRegisterUsage(Plan, {VF}, TTI)[0];
// We divide by these constants so assume that we have at least one
// instruction that uses at least one register.
for (auto &Pair : R.MaxLocalUsers) {
@@ -5169,213 +5387,6 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
return 1;
}
-SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
-LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef<ElementCount> VFs) {
- // This function calculates the register usage by measuring the highest number
- // of values that are alive at a single location. Obviously, this is a very
- // rough estimation. We scan the loop in a topological order in order and
- // assign a number to each instruction. We use RPO to ensure that defs are
- // met before their users. We assume that each instruction that has in-loop
- // users starts an interval. We record every time that an in-loop value is
- // used, so we have a list of the first and last occurrences of each
- // instruction. Next, we transpose this data structure into a multi map that
- // holds the list of intervals that *end* at a specific location. This multi
- // map allows us to perform a linear search. We scan the instructions linearly
- // and record each time that a new interval starts, by placing it in a set.
- // If we find this value in the multi-map then we remove it from the set.
- // The max register usage is the maximum size of the set.
- // We also search for instructions that are defined outside the loop, but are
- // used inside the loop. We need this number separately from the max-interval
- // usage number because when we unroll, loop-invariant values do not take
- // more register.
- LoopBlocksDFS DFS(TheLoop);
- DFS.perform(LI);
-
- RegisterUsage RU;
-
- // Each 'key' in the map opens a new interval. The values
- // of the map are the index of the 'last seen' usage of the
- // instruction that is the key.
- using IntervalMap = SmallDenseMap<Instruction *, unsigned, 16>;
-
- // Maps instruction to its index.
- SmallVector<Instruction *, 64> IdxToInstr;
- // Marks the end of each interval.
- IntervalMap EndPoint;
- // Saves the list of instruction indices that are used in the loop.
- SmallPtrSet<Instruction *, 8> Ends;
- // Saves the list of values that are used in the loop but are defined outside
- // the loop (not including non-instruction values such as arguments and
- // constants).
- SmallSetVector<Instruction *, 8> LoopInvariants;
-
- for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) {
- for (Instruction &I : BB->instructionsWithoutDebug()) {
- IdxToInstr.push_back(&I);
-
- // Save the end location of each USE.
- for (Value *U : I.operands()) {
- auto *Instr = dyn_cast<Instruction>(U);
-
- // Ignore non-instruction values such as arguments, constants, etc.
- // FIXME: Might need some motivation why these values are ignored. If
- // for example an argument is used inside the loop it will increase the
- // register pressure (so shouldn't we add it to LoopInvariants).
- if (!Instr)
- continue;
-
- // If this instruction is outside the loop then record it and continue.
- if (!TheLoop->contains(Instr)) {
- LoopInvariants.insert(Instr);
- continue;
- }
-
- // Overwrite previous end points.
- EndPoint[Instr] = IdxToInstr.size();
- Ends.insert(Instr);
- }
- }
- }
-
- // Saves the list of intervals that end with the index in 'key'.
- using InstrList = SmallVector<Instruction *, 2>;
- SmallDenseMap<unsigned, InstrList, 16> TransposeEnds;
-
- // Transpose the EndPoints to a list of values that end at each index.
- for (auto &Interval : EndPoint)
- TransposeEnds[Interval.second].push_back(Interval.first);
-
- SmallPtrSet<Instruction *, 8> OpenIntervals;
- SmallVector<RegisterUsage, 8> RUs(VFs.size());
- SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
-
- LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
-
- const auto &TTICapture = TTI;
- auto GetRegUsage = [&TTICapture](Type *Ty, ElementCount VF) -> unsigned {
- if (Ty->isTokenTy() || !VectorType::isValidElementType(Ty) ||
- (VF.isScalable() &&
- !TTICapture.isElementTypeLegalForScalableVector(Ty)))
- return 0;
- return TTICapture.getRegUsageForType(VectorType::get(Ty, VF));
- };
-
- collectInLoopReductions();
-
- for (unsigned int Idx = 0, Sz = IdxToInstr.size(); Idx < Sz; ++Idx) {
- Instruction *I = IdxToInstr[Idx];
-
- // Remove all of the instructions that end at this location.
- InstrList &List = TransposeEnds[Idx];
- for (Instruction *ToRemove : List)
- OpenIntervals.erase(ToRemove);
-
- // Ignore instructions that are never used within the loop and do not have
- // side-effects.
- if (!Ends.count(I) && !I->mayHaveSideEffects())
- continue;
-
- // Skip ignored values.
- if (ValuesToIgnore.count(I))
- continue;
-
- // For each VF find the maximum usage of registers.
- for (unsigned J = 0, E = VFs.size(); J < E; ++J) {
- // Count the number of registers used, per register class, given all open
- // intervals.
- // Note that elements in this SmallMapVector will be default constructed
- // as 0. So we can use "RegUsage[ClassID] += n" in the code below even if
- // there is no previous entry for ClassID.
- SmallMapVector<unsigned, unsigned, 4> RegUsage;
-
- if (VFs[J].isScalar()) {
- for (auto *Inst : OpenIntervals) {
- unsigned ClassID =
- TTI.getRegisterClassForType(false, Inst->getType());
- // FIXME: The target might use more than one register for the type
- // even in the scalar case.
- RegUsage[ClassID] += 1;
- }
- } else {
- collectUniformsAndScalars(VFs[J]);
- for (auto *Inst : OpenIntervals) {
- // Skip ignored values for VF > 1.
- if (VecValuesToIgnore.count(Inst))
- continue;
- if (isScalarAfterVectorization(Inst, VFs[J])) {
- unsigned ClassID =
- TTI.getRegisterClassForType(false, Inst->getType());
- // FIXME: The target might use more than one register for the type
- // even in the scalar case.
- RegUsage[ClassID] += 1;
- } else {
- unsigned ClassID =
- TTI.getRegisterClassForType(true, Inst->getType());
- RegUsage[ClassID] += GetRegUsage(Inst->getType(), VFs[J]);
- }
- }
- }
-
- for (const auto &Pair : RegUsage) {
- auto &Entry = MaxUsages[J][Pair.first];
- Entry = std::max(Entry, Pair.second);
- }
- }
-
- ...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
; CHECK-NEXT: [[TMP6:%.*]] = tail call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP5]]) | ||
; CHECK-NEXT: [[TMP7:%.*]] = add i32 [[TMP6]], [[SUM_0_LCSSA]] | ||
; CHECK-NEXT: br label [[WHILE_END5]] | ||
; CHECK: vector.ph: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a possible regression to me. The previous code was optimal because it removed the loop entirely and hence there was no register pressure due to PHIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've managed to fix this by ignoring in-loop reduction phis.
@@ -10,72 +10,29 @@ define void @QLA_F3_r_veq_norm2_V(ptr noalias %r, ptr noalias %a, i32 %n) { | |||
; CHECK-SAME: ptr noalias [[R:%.*]], ptr noalias [[A:%.*]], i32 [[N:%.*]]) { | |||
; CHECK-NEXT: [[ENTRY:.*]]: | |||
; CHECK-NEXT: [[CMP24:%.*]] = icmp sgt i32 [[N]], 0 | |||
; CHECK-NEXT: br i1 [[CMP24]], label %[[ITER_CHECK:.*]], label %[[FOR_END13:.*]] | |||
; CHECK: [[ITER_CHECK]]: | |||
; CHECK-NEXT: br i1 [[CMP24]], label %[[FOR_COND1_PREHEADER_PREHEADER:.*]], label %[[FOR_END13:.*]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the comment above the function it looks like they are expecting it be interleaved 8 times, perhaps for optimal performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually comes from fhahn's change here https://github.com/llvm/llvm-project/pull/126437/files#diff-67309aee6068e2dd0d01af33e35d15855352bb11009158c554228043cfeaa4ba
@@ -195,197 +195,6 @@ for.body: ; preds = %entry, %for.body | |||
br i1 %cmp, label %for.body, label %for.cond.cleanup | |||
} | |||
|
|||
; Trip count of 8 - does get vectorized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has this been deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The register pressure was too high for it to be vectorised and I couldn't modify the test to reduce the pressure and still have it vectorise at the expected VF with the same trip count. I'm now trying to find a way to modify the pressure calculation to keep it and vectorise at VF 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've managed to reduce the test so that it doesn't exhibit huge register pressure, do let me know what you think of it.
@@ -8,8 +8,8 @@ | |||
define void @vec_load(i64 %N, ptr nocapture %a, ptr nocapture readonly %b) { | |||
; CHECK-LABEL: @vec_load | |||
; CHECK: vector.body: | |||
; CHECK: %[[LOAD:.*]] = load <vscale x 2 x double>, ptr | |||
; CHECK: call <vscale x 2 x double> @foo_vec(<vscale x 2 x double> %[[LOAD]]) | |||
; CHECK: %[[WIDE_LOAD:.*]] = load <vscale x 2 x double>, ptr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these changes can be reverted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted now. I'm not sure why the test was failing before without this change!
// Set the max VF to the largest viable vectorization factor less than or | ||
// equal to the max vector element count. | ||
for (ElementCount VS = MaxVectorElementCount * 2; | ||
ElementCount::isKnownLE(VS, MaxVectorElementCountMaxBW); VS *= 2) | ||
VFs.push_back(VS); | ||
|
||
// For each VF calculate its register usage. | ||
auto RUs = calculateRegisterUsage(VFs); | ||
MaxVF = VS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given these element counts are powers of 2 (it appears bit_floor
is used above), I think this is just:
MaxVF = MaxVectorElementCountMaxBW;
(assuming MaxVectorElementCount <= MaxVectorElementCountMaxBW
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an assertion checking if MaxVectorElementCount <= MaxVectorElementCountMaxBW
and an AArch64 test failed (Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll) so that condition doesn't always hold. So I've replaced the loop with a check and an assignment.
Hi @SamTebbs33, looks like it's failing some RISCV reg usage tests. |
@@ -4885,8 +4872,248 @@ void LoopVectorizationCostModel::collectElementTypesForWidening() { | |||
} | |||
} | |||
|
|||
/// Estimate the register usage for \p Plan and vectorization factors in \p VFs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to move this function? It makes it a lot more difficult to review. If it can't be avoided then perhaps it should be moved in a separate NFC commit within the same PR or done in a different PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to move it when I was using LoopVectorizationCostModel
to check for in loop reductions, but now that I'm using the isInLoop()
function from the reduction recipe we don't need these changes. Reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the function has still moved so it's still hard to see what's changed. If it's easier you could always abandon the commit history and start again?
// fewer lanes than the VF. | ||
auto VF = VFs[J]; | ||
if (auto *ReductionR = dyn_cast<VPReductionPHIRecipe>(R)) | ||
VF = VF.divideCoefficientBy(ReductionR->getVFScaleFactor()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's worth first doing the transition over to computing register usage based on VPlan, followed by the scale factor changes? That way we can see individual effects of each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good idea. Done.
@@ -12,28 +12,28 @@ define i32 @mla_i32(ptr noalias nocapture readonly %A, ptr noalias nocapture rea | |||
; CHECK: for.body.preheader: | |||
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]] | |||
; CHECK: vector.ph: | |||
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15 | |||
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], 16 | |||
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be changing, it was OK as it was before. https://godbolt.org/z/KnT1xaeTf
Is it changing because of the large reduction registers? Those wouldn't be expected to be in the final code. Perhaps #113903 will help to represent the register usage more correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Dave, that patch does indeed help and the only change after cherry picking over Elvis's changes is that the first sext
is moved next to the second one. Once that PR is merged I'll rebase on top of it to get rid of this regression here.
Ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, this will improve things for RISC-V a lot. E.g. currently we spill a lot on this example due to needing a gather which requires i64 elements, which is only modelled in VPlan and not in the legacy model: https://godbolt.org/z/rrPssc5dx
With this patch we go from VF=vscale x 16 -> VF=vscale x 8, which fixes the spilling:
Before
.LBB0_11:
vfcvt.f.xu.v v8, v24
vfmul.vf v8, v8, fa0
vfcvt.rtz.x.f.v v8, v8
vsetivli zero, 16, e32, m8, ta, ma
vslidedown.vi v16, v8, 16
csrr s0, vlenb
slli s0, s0, 4
add s0, s0, sp
addi s0, s0, 16
vs8r.v v16, (s0)
vsetivli zero, 16, e64, m8, ta, ma
vmv.v.x v16, s6
csrr s0, vlenb
slli s0, s0, 3
sh1add s0, s0, s0
add s0, s0, sp
addi s0, s0, 16
vs8r.v v16, (s0)
csrr s0, vlenb
slli s0, s0, 3
sh1add s0, s0, s0
add s0, s0, sp
addi s0, s0, 16
vl8r.v v16, (s0)
vsetvli zero, zero, e32, m4, ta, ma
vwadd.wv v0, v16, v8
addi s0, sp, 16
vs8r.v v0, (s0)
csrr s0, vlenb
slli s0, s0, 3
sh1add s0, s0, s0
add s0, s0, sp
addi s0, s0, 16
vl8r.v v16, (s0)
csrr s0, vlenb
slli s0, s0, 4
add s0, s0, sp
addi s0, s0, 16
vl8r.v v8, (s0)
vwadd.wv v0, v16, v8
csrr s0, vlenb
sh3add s0, s0, sp
addi s0, s0, 16
vs8r.v v0, (s0)
addi s0, sp, 16
vl8r.v v8, (s0)
vsetvli zero, zero, e64, m8, ta, ma
vmul.vx v0, v8, t2
csrr s0, vlenb
sh3add s0, s0, sp
addi s0, s0, 16
vl8r.v v8, (s0)
vmul.vx v8, v8, t2
vsetvli zero, zero, e8, m1, ta, ma
vluxei64.v v16, (a4), v0
vsetvli zero, zero, e64, m8, ta, ma
vadd.vx v0, v0, a4
vsetvli zero, zero, e8, m1, ta, ma
vluxei64.v v18, (a4), v8
vsetvli zero, s5, e8, m2, ta, ma
vslideup.vi v16, v18, 16
vsetivli zero, 16, e8, m1, ta, ma
vluxei64.v v18, (t5), v0
vsetvli zero, zero, e64, m8, ta, ma
vadd.vx v8, v8, a4
vsetvli zero, zero, e8, m1, ta, ma
vluxei64.v v20, (t5), v8
vsetvli zero, s5, e8, m2, ta, ma
vslideup.vi v18, v20, 16
vsetivli zero, 16, e8, m1, ta, ma
vluxei64.v v20, (s4), v0
vluxei64.v v22, (s4), v8
vsetvli zero, s5, e8, m2, ta, ma
vslideup.vi v20, v22, 16
vsseg3e8.v v16, (s1)
vsetvli zero, zero, e32, m8, ta, ma
vadd.vx v24, v24, s5
addi a3, a3, -32
addi s1, s1, 96
bnez a3, .LBB0_11
mv s7, t3
beq a1, t3, .LBB0_3
j .LBB0_6
After
.LBB0_11: # %vector.body
# Parent Loop BB0_4 Depth=1
# => This Inner Loop Header: Depth=2
vsetvli zero, zero, e32, m4, ta, ma
vfcvt.f.xu.v v24, v20
vfmul.vf v24, v24, fa0
vfcvt.rtz.x.f.v v4, v24
vwadd.wv v24, v8, v4
vsetvli zero, zero, e64, m8, ta, ma
vmul.vx v24, v24, t2
vadd.vx v0, v24, a4
vsetvli zero, zero, e8, m1, ta, ma
vluxei64.v v16, (a4), v24
vluxei64.v v17, (t5), v0
vluxei64.v v18, (s3), v0
vsetvli zero, zero, e32, m4, ta, ma
vadd.vx v20, v20, t4
addi a3, a3, -16
vsetvli zero, zero, e8, m1, ta, ma
vsseg3e8.v v16, (s0)
addi s0, s0, 48
bnez a3, .LBB0_11
I collected some more data on RISC-V on SPEC CPU 2017, this improves code size by up to 7% on some benchmarks, and no regressions were found: https://lnt.lukelau.me/db_default/v4/nts/399?show_delta=yes&show_previous=yes&show_stddev=yes&show_mad=yes&show_all=yes&show_all_samples=yes&show_sample_counts=yes&show_small_diff=yes&num_comparison_runs=0&test_filter=&test_min_value_filter=&aggregation_fn=min&MW_confidence_lv=0.05&compare_to=401&submit=Update There's also a significant decrease in vector spilling and reloading. It removes all the spilling entirely on one benchmark so the geomean result is stuck at 100%:
|
Thanks for working on this too! We have a similar implementation in downstream and works well. |
calculateRegisterUsage(getPlanFor(LegacyVF.Width), LegacyVFs, TTI); | ||
auto RUs = calculateRegisterUsage(BestPlan, VFs, TTI); | ||
|
||
auto GetMaxUsage = [](SmallMapVector<unsigned, unsigned, 4> MaxLocalUsers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaxLocalUsers
is passed by value here, it should be passed by reference.
unsigned Max = 0; | ||
for (auto Pair : MaxLocalUsers) | ||
if (Pair.second > Max) | ||
Max = Pair.second; | ||
return Max; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although rather than creating a GetMaxUsage
, you could use max_element
for this instead, e.g.
auto CmpGT = [](std::pair<unsigned, unsigned> &A,
std::pair<unsigned, unsigned> &B) {
return A.second > B.second;
};
unsigned MaxLegacyRegUsage =
max_element(LegacyRUs[0].MaxLocalUsers, CmpGT)->second;
unsigned MaxRegUsage = max_element(RUs[0].MaxLocalUsers, CmpGT)->second;
if (auto *Phi = dyn_cast<VPReductionPHIRecipe>(R); | ||
Phi && Phi->isInLoop()) | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this discarding the reduction PHI?
Also, if you rebase on main, you'll see this FIXME, that I think can now be addressed by this PR?
https://github.com/llvm/llvm-project/pull/126437/files#diff-da321d454a7246f8ae276bf1db2782bf26b5210b8133cb59e4d7fd45d0905decR5009-R5011
}; | ||
unsigned MaxLegacyRegUsage = GetMaxUsage(LegacyRUs[0].MaxLocalUsers); | ||
unsigned MaxRegUsage = GetMaxUsage(RUs[0].MaxLocalUsers); | ||
RegUsageDeterminedVF = MaxRegUsage <= MaxLegacyRegUsage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the cost model would make a different choice based on register pressure if and only if the legacy reg-pressure calculation for the chosen VF would exceed the number of available registers.
But because the original cost calculation code is not around anymore (this PR removes it), we still need to figure out why the VFs chosen by the legacy/Vplan cost-models don't match. This code attempts to do that by seeing if the LegacyVF would have a higher register usage than the chosen VF. If so, then I'm not sure this is calculating the same thing.
Should this instead add a flag to calculateRegisterPressure
to have a 'legacy' mode, where it doesn't scale the register usage for a PHI/VPReduction node using the scaling factor? And then you can compare the register usage in the old/new mode, to give the conclusive answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a closer look into this, and ignore my suggestion above. Instead, I think it is sufficient to check that the LegacyVF that would otherwise have been chosen has a register usage that exceeds the maximum number of registers. Otherwise, the cost would have chosen the same VF.
That means, this can be simplified to:
RegUsageDeterminedVF = MaxRegUsage <= MaxLegacyRegUsage; | |
auto &MLU = LegacyRUs[0].MaxLocalUsers; | |
RegUsageDeterminedVF = any_of(MLU, [&](decltype(MLU.front()) &LU) { | |
return LU.second > TTI.getNumberOfRegisters(LU.first); | |
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've managed to get rid of these changes by checking the register pressure as part of the legacy cost model's VF selection, which mirrors the current behaviour. This has had the knock-on effect of changing some debug output for lots of x86 tests since the plan is used for reg pressure checks, rather than the scalar loop.
bc57c23
to
4f64da1
Compare
Merging #137746 was enough to fix the mve-reduction-types.ll regression so I believe this PR is ready to land. I'll do so in a couple of days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks
This appears to have caused some performance regressions. With
The |
I think what's caused this is previously in LoopVectorizationCostModel::getMaximizedVFForTarget only VFs greater than MaxVectorElementCount could be discarded for excess register usage, but now any VF can be discarded. |
Thanks for reporting that John. I'll investigate it. |
Did anything happen regarding this? We also see performance regressions for our out-of-tree target with this patch, it seems to be too restrictive and stop vectorizing in several cases where it's beneficial for us to vectorize. We've worked around it downstream now but given that there are regressions also for ARM it would be interesting to see how that is handled. |
I have a fix that restores previous behaviour, in that register usage is only checked when vector bandwidth isn't being maximised, but I'm looking into if there's something about the reproducer loop whose register usage isn't being costed properly. If I can't think of a cost adjustment soon then I'll push up my fix. |
I've raised a PR to revert to previous behaviour, where the register pressure was only considered if vector bandwidth was being maximised. I've had a look at the assembly output and the vplan for the reproducer, and noticed that the output has a lot of |
This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.
Depends on #113903