Skip to content

[LV] Change getSmallBestKnownTC to return an ElementCount (NFC) #141793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions llvm/include/llvm/Analysis/ScalarEvolution.h
Original file line number Diff line number Diff line change
Expand Up @@ -823,6 +823,10 @@ class ScalarEvolution {
/// than the backedge taken count for the loop.
LLVM_ABI unsigned getSmallConstantTripCount(const Loop *L);

/// A version of getSmallConstantTripCount that returns as an ElementCount to
/// include loops whose trip count is a function of llvm.vscale().
ElementCount getSmallConstantRuntimeTripCount(const Loop *L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this need the same LLVM_ABI prefix as above.


/// Return the exact trip count for this loop if we exit through ExitingBlock.
/// '0' is used to represent an unknown or non-constant trip count. Note
/// that a trip count is simply one more than the backedge taken count for
Expand Down
4 changes: 4 additions & 0 deletions llvm/lib/Analysis/ScalarEvolution.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8217,6 +8217,10 @@ unsigned ScalarEvolution::getSmallConstantTripCount(const Loop *L) {
return getConstantTripCount(ExitCount);
}

ElementCount ScalarEvolution::getSmallConstantRuntimeTripCount(const Loop *L) {
return ElementCount::getFixed(getSmallConstantTripCount(L));
}

unsigned
ScalarEvolution::getSmallConstantTripCount(const Loop *L,
const BasicBlock *ExitingBlock) {
Expand Down
43 changes: 23 additions & 20 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -427,24 +427,24 @@ static bool hasIrregularType(Type *Ty, const DataLayout &DL) {
/// 2) Returns expected trip count according to profile data if any.
/// 3) Returns upper bound estimate if known, and if \p CanUseConstantMax.
/// 4) Returns std::nullopt if all of the above failed.
static std::optional<unsigned>
static std::optional<ElementCount>
getSmallBestKnownTC(PredicatedScalarEvolution &PSE, Loop *L,
bool CanUseConstantMax = true) {
Comment on lines 431 to 432
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it sufficient to change getSmallBestKnownTC to return ElementCount, and not touch SCEV?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because there are other places within LoopVectorize where getSmallConstantTripCount is called that also require the ElementCount returning variant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to see the follow-up patch, to see what this patch will be used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of a chain so I'll see about pulling it out.

My objective is to allow multiples of vscale to be returned so that when vectorising a loop of the form for (int i = 0; i < llvm.vscale() * N; ++i) using a scalable VF the vectoriser can make sensible interleaving decisions. For example, today for the N=1 case LoopVectorize picks the default interleaving factor of 2 which means we never enter the vector loop.

In the same vein, representing vscale base trip counts also allows the vectoriser to better reason about whether the scalar loop becomes dead after vectorisation.

I did investigate the possibility of changing the existing interface but felt the impact was too high, especially as the majority of uses do not care about vscale based trip counts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Maybe we can move getSmallConstantRuntimeTC into LV, as nobody else cares about vscale?

// Check if exact trip count is known.
if (unsigned ExpectedTC = PSE.getSE()->getSmallConstantTripCount(L))
if (auto ExpectedTC = PSE.getSE()->getSmallConstantRuntimeTripCount(L))
return ExpectedTC;

// Check if there is an expected trip count available from profile data.
if (LoopVectorizeWithBlockFrequency)
if (auto EstimatedTC = getLoopEstimatedTripCount(L))
return *EstimatedTC;
return ElementCount::getFixed(*EstimatedTC);

if (!CanUseConstantMax)
return std::nullopt;

// Check if upper bound estimate is known.
if (unsigned ExpectedTC = PSE.getSmallConstantMaxTripCount())
return ExpectedTC;
return ElementCount::getFixed(ExpectedTC);

return std::nullopt;
}
Expand Down Expand Up @@ -1977,7 +1977,8 @@ class GeneratedRTChecks {
// Get the best known TC estimate.
if (auto EstimatedTC = getSmallBestKnownTC(
PSE, OuterLoop, /* CanUseConstantMax = */ false))
BestTripCount = *EstimatedTC;
if (EstimatedTC->isFixed())
BestTripCount = EstimatedTC->getFixedValue();

InstructionCost NewMemCheckCost = MemCheckCost / BestTripCount;

Expand Down Expand Up @@ -3751,12 +3752,12 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
}

ScalarEvolution *SE = PSE.getSE();
unsigned TC = SE->getSmallConstantTripCount(TheLoop);
ElementCount TC = SE->getSmallConstantRuntimeTripCount(TheLoop);
unsigned MaxTC = PSE.getSmallConstantMaxTripCount();
LLVM_DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');
if (TC != MaxTC)
if (TC != ElementCount::getFixed(MaxTC))
LLVM_DEBUG(dbgs() << "LV: Found maximum trip count: " << MaxTC << '\n');
if (TC == 1) {
if (TC.isScalar()) {
reportVectorizationFailure("Single iteration (non) loop",
"loop trip count is one, irrelevant for vectorization",
"SingleIterationLoop", ORE, TheLoop);
Expand Down Expand Up @@ -3870,7 +3871,9 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
}

auto ExpectedTC = getSmallBestKnownTC(PSE, TheLoop);
if (ExpectedTC && ExpectedTC <= TTI.getMinTripCountTailFoldingThreshold()) {
if (ExpectedTC && ExpectedTC->isFixed() &&
ExpectedTC->getFixedValue() <=
TTI.getMinTripCountTailFoldingThreshold()) {
if (MaxPowerOf2RuntimeVF > 0u) {
// If we have a low-trip-count, and the fixed-width VF is known to divide
// the trip count but the scalable factor does not, use the fixed-width
Expand Down Expand Up @@ -3928,7 +3931,7 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
return FixedScalableVFPair::getNone();
}

if (TC == 0) {
if (TC.isZero()) {
reportVectorizationFailure(
"unable to calculate the loop count due to complex control flow",
"UnknownLoopCountComplexCFG", ORE, TheLoop);
Expand Down Expand Up @@ -5071,13 +5074,13 @@ LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
// At least one iteration must be scalar when this constraint holds. So the
// maximum available iterations for interleaving is one less.
unsigned AvailableTC = requiresScalarEpilogue(VF.isVector())
? (*BestKnownTC) - 1
: *BestKnownTC;
? BestKnownTC->getFixedValue() - 1
: BestKnownTC->getFixedValue();

unsigned InterleaveCountLB = bit_floor(std::max(
1u, std::min(AvailableTC / (EstimatedVF * 2), MaxInterleaveCount)));

if (PSE.getSE()->getSmallConstantTripCount(TheLoop) > 0) {
if (PSE.getSE()->getSmallConstantRuntimeTripCount(TheLoop).isNonZero()) {
// If the best known trip count is exact, we select between two
// prospective ICs, where
//
Expand Down Expand Up @@ -5437,8 +5440,8 @@ InstructionCost LoopVectorizationCostModel::expectedCost(ElementCount VF) {
// costs of comparison and induction instructions, as they'll get simplified
// away.
SmallPtrSet<Instruction *, 2> ValuesToIgnoreForVF;
auto TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
if (VF.isFixed() && TC == VF.getFixedValue() && !foldTailByMasking())
auto TC = PSE.getSE()->getSmallConstantRuntimeTripCount(TheLoop);
if (TC == VF && !foldTailByMasking())
addFullyUnrolledInstructionsToIgnore(TheLoop, Legal->getInductionVars(),
ValuesToIgnoreForVF);

Expand Down Expand Up @@ -7134,8 +7137,8 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
// simplified away.
// TODO: Remove this code after stepping away from the legacy cost model and
// adding code to simplify VPlans before calculating their costs.
auto TC = PSE.getSE()->getSmallConstantTripCount(OrigLoop);
if (VF.isFixed() && TC == VF.getFixedValue() && !CM.foldTailByMasking())
auto TC = PSE.getSE()->getSmallConstantRuntimeTripCount(OrigLoop);
if (TC == VF && !CM.foldTailByMasking())
addFullyUnrolledInstructionsToIgnore(OrigLoop, Legal->getInductionVars(),
CostCtx.SkipCostComputation);

Expand Down Expand Up @@ -9942,8 +9945,7 @@ static bool isOutsideLoopWorkProfitable(GeneratedRTChecks &Checks,
// Skip vectorization if the expected trip count is less than the minimum
// required trip count.
if (auto ExpectedTC = getSmallBestKnownTC(PSE, L)) {
if (ElementCount::isKnownLT(ElementCount::getFixed(*ExpectedTC),
VF.MinProfitableTripCount)) {
if (ElementCount::isKnownLT(*ExpectedTC, VF.MinProfitableTripCount)) {
LLVM_DEBUG(dbgs() << "LV: Vectorization is not beneficial: expected "
"trip count < minimum profitable VF ("
<< *ExpectedTC << " < " << VF.MinProfitableTripCount
Expand Down Expand Up @@ -10300,7 +10302,8 @@ bool LoopVectorizePass::processLoop(Loop *L) {
// Check the loop for a trip count threshold: vectorize loops with a tiny trip
// count by optimizing for size, to minimize overheads.
auto ExpectedTC = getSmallBestKnownTC(PSE, L);
if (ExpectedTC && *ExpectedTC < TinyTripCountVectorThreshold) {
if (ExpectedTC && ExpectedTC->isFixed() &&
ExpectedTC->getFixedValue() < TinyTripCountVectorThreshold) {
LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
<< "This loop is worth vectorizing only if no scalar "
<< "iteration overheads are incurred.");
Expand Down
Loading