Skip to content

Commit d3fd28a

Browse files
authored
[RISCV][TTI] Properly model odd vector sized LD/ST operations (llvm#100436)
The motivation for this change is the costing of a LD or ST with nearly power of 2 vectors (e.g. <3 x i32> or <7 x i32>) on V. There's an experimental option in SLP to allow emitting these if the cost model says they're profitable. This really helps with e.g. RGB vectors. Our actual lowering for these depends on whether a wider container type is known available. If so, we use a vle or vse on the wider type with a restricted VL. If not, we split until a legal type is found, and then apply the vle/vse on the sub-pieces. This change is intentionally restricted to only the case where promotion (widening w/VL predication) is involved. We appear to have at least one bug in our splitting lowering (see discussion on review), and to avoid exposing this more widely, I chose to not adjust costs for the splitting case. The current splitting costing assumes scalarization (which is not true of the actual lowering), but that has the effect of biasing vectorization away from such cases strongly. For the widening case, the true cost scales with the next largest legal type. The default implementation assumes that such a type is scalarized. Changing that brings our cost in line with our actual lowering decision. Note that since scalarization is not possible for scalable types, the prior costing falsely returned Invalid for that case.
1 parent 18dee70 commit d3fd28a

File tree

3 files changed

+48
-32
lines changed

3 files changed

+48
-32
lines changed

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1390,14 +1390,32 @@ InstructionCost RISCVTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,
13901390
InstructionCost Cost = 0;
13911391
if (Opcode == Instruction::Store && OpInfo.isConstant())
13921392
Cost += getStoreImmCost(Src, OpInfo, CostKind);
1393-
InstructionCost BaseCost =
1394-
BaseT::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace,
1395-
CostKind, OpInfo, I);
1393+
1394+
std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Src);
1395+
1396+
InstructionCost BaseCost = [&]() {
1397+
InstructionCost Cost = LT.first;
1398+
if (CostKind != TTI::TCK_RecipThroughput)
1399+
return Cost;
1400+
1401+
// Our actual lowering for the case where a wider legal type is available
1402+
// uses the a VL predicated load on the wider type. This is reflected in
1403+
// the result of getTypeLegalizationCost, but BasicTTI assumes the
1404+
// widened cases are scalarized.
1405+
const DataLayout &DL = this->getDataLayout();
1406+
if (Src->isVectorTy() && LT.second.isVector() &&
1407+
TypeSize::isKnownLT(DL.getTypeStoreSizeInBits(Src),
1408+
LT.second.getSizeInBits()))
1409+
return Cost;
1410+
1411+
return BaseT::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace,
1412+
CostKind, OpInfo, I);
1413+
}();
1414+
13961415
// Assume memory ops cost scale with the number of vector registers
13971416
// possible accessed by the instruction. Note that BasicTTI already
13981417
// handles the LT.first term for us.
1399-
if (std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Src);
1400-
LT.second.isVector() && CostKind != TTI::TCK_CodeSize)
1418+
if (LT.second.isVector() && CostKind != TTI::TCK_CodeSize)
14011419
BaseCost *= TLI->getLMULCost(LT.second);
14021420
return Cost + BaseCost;
14031421

llvm/test/Analysis/CostModel/RISCV/rvv-load-store.ll

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -592,16 +592,16 @@ define void @load_oddsize_vectors(ptr %p) {
592592
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %13 = load <32 x i1>, ptr %p, align 4
593593
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %14 = load <1 x i32>, ptr %p, align 4
594594
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %15 = load <2 x i32>, ptr %p, align 8
595-
; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %16 = load <3 x i32>, ptr %p, align 16
595+
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %16 = load <3 x i32>, ptr %p, align 16
596596
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %17 = load <4 x i32>, ptr %p, align 16
597-
; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %18 = load <5 x i32>, ptr %p, align 32
598-
; CHECK-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %19 = load <6 x i32>, ptr %p, align 32
599-
; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %20 = load <7 x i32>, ptr %p, align 32
597+
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %18 = load <5 x i32>, ptr %p, align 32
598+
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %19 = load <6 x i32>, ptr %p, align 32
599+
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %20 = load <7 x i32>, ptr %p, align 32
600600
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %21 = load <8 x i32>, ptr %p, align 32
601-
; CHECK-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %22 = load <9 x i32>, ptr %p, align 64
602-
; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: %23 = load <15 x i32>, ptr %p, align 64
601+
; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %22 = load <9 x i32>, ptr %p, align 64
602+
; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %23 = load <15 x i32>, ptr %p, align 64
603603
; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %24 = load <16 x i32>, ptr %p, align 64
604-
; CHECK-NEXT: Cost Model: Found an estimated cost of 256 for instruction: %25 = load <31 x i32>, ptr %p, align 128
604+
; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %25 = load <31 x i32>, ptr %p, align 128
605605
; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %26 = load <32 x i32>, ptr %p, align 128
606606
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
607607
;
@@ -682,15 +682,15 @@ define void @store_oddsize_vectors(ptr %p) {
682682
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <32 x i1> undef, ptr %p, align 4
683683
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <1 x i32> undef, ptr %p, align 4
684684
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <2 x i32> undef, ptr %p, align 8
685-
; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: store <3 x i32> undef, ptr %p, align 16
685+
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <3 x i32> undef, ptr %p, align 16
686686
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <4 x i32> undef, ptr %p, align 16
687-
; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <5 x i32> undef, ptr %p, align 32
688-
; CHECK-NEXT: Cost Model: Found an estimated cost of 14 for instruction: store <6 x i32> undef, ptr %p, align 32
689-
; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: store <7 x i32> undef, ptr %p, align 32
687+
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <5 x i32> undef, ptr %p, align 32
688+
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <6 x i32> undef, ptr %p, align 32
689+
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <7 x i32> undef, ptr %p, align 32
690690
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <8 x i32> undef, ptr %p, align 32
691-
; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: store <15 x i32> undef, ptr %p, align 64
691+
; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: store <15 x i32> undef, ptr %p, align 64
692692
; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: store <16 x i32> undef, ptr %p, align 64
693-
; CHECK-NEXT: Cost Model: Found an estimated cost of 256 for instruction: store <31 x i32> undef, ptr %p, align 128
693+
; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: store <31 x i32> undef, ptr %p, align 128
694694
; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: store <32 x i32> undef, ptr %p, align 128
695695
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
696696
;

llvm/test/Transforms/LoopVectorize/RISCV/short-trip-count.ll

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,24 +7,22 @@ define void @small_trip_count_min_vlen_128(ptr nocapture %a) nounwind vscale_ran
77
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
88
; CHECK: vector.ph:
99
; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.vscale.i32()
10-
; CHECK-NEXT: [[TMP1:%.*]] = mul i32 [[TMP0]], 2
11-
; CHECK-NEXT: [[TMP2:%.*]] = sub i32 [[TMP1]], 1
12-
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 4, [[TMP2]]
13-
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], [[TMP1]]
10+
; CHECK-NEXT: [[TMP1:%.*]] = sub i32 [[TMP0]], 1
11+
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 4, [[TMP1]]
12+
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], [[TMP0]]
1413
; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[N_RND_UP]], [[N_MOD_VF]]
15-
; CHECK-NEXT: [[TMP3:%.*]] = call i32 @llvm.vscale.i32()
16-
; CHECK-NEXT: [[TMP4:%.*]] = mul i32 [[TMP3]], 2
14+
; CHECK-NEXT: [[TMP2:%.*]] = call i32 @llvm.vscale.i32()
1715
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
1816
; CHECK: vector.body:
1917
; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
20-
; CHECK-NEXT: [[TMP5:%.*]] = add i32 [[INDEX]], 0
21-
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i32(i32 [[TMP5]], i32 4)
22-
; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i32 [[TMP5]]
23-
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[TMP6]], i32 0
24-
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 2 x i32> @llvm.masked.load.nxv2i32.p0(ptr [[TMP7]], i32 4, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i32> poison)
25-
; CHECK-NEXT: [[TMP8:%.*]] = add nsw <vscale x 2 x i32> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 2 x i32> insertelement (<vscale x 2 x i32> poison, i32 1, i64 0), <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer)
26-
; CHECK-NEXT: call void @llvm.masked.store.nxv2i32.p0(<vscale x 2 x i32> [[TMP8]], ptr [[TMP7]], i32 4, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
27-
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], [[TMP4]]
18+
; CHECK-NEXT: [[TMP3:%.*]] = add i32 [[INDEX]], 0
19+
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 1 x i1> @llvm.get.active.lane.mask.nxv1i1.i32(i32 [[TMP3]], i32 4)
20+
; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i32 [[TMP3]]
21+
; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[TMP4]], i32 0
22+
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 1 x i32> @llvm.masked.load.nxv1i32.p0(ptr [[TMP5]], i32 4, <vscale x 1 x i1> [[ACTIVE_LANE_MASK]], <vscale x 1 x i32> poison)
23+
; CHECK-NEXT: [[TMP6:%.*]] = add nsw <vscale x 1 x i32> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 1 x i32> insertelement (<vscale x 1 x i32> poison, i32 1, i64 0), <vscale x 1 x i32> poison, <vscale x 1 x i32> zeroinitializer)
24+
; CHECK-NEXT: call void @llvm.masked.store.nxv1i32.p0(<vscale x 1 x i32> [[TMP6]], ptr [[TMP5]], i32 4, <vscale x 1 x i1> [[ACTIVE_LANE_MASK]])
25+
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], [[TMP2]]
2826
; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
2927
; CHECK: middle.block:
3028
; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]

0 commit comments

Comments
 (0)