Skip to content

[CodeLayout] Size-aware machine block placement #109711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 2, 2024

Conversation

spupyrev
Copy link
Contributor

@spupyrev spupyrev commented Sep 23, 2024

This is an implementation of a new "size-aware" machine block placement. The
idea is to reorder blocks so that the number of fall-through jumps is maximized.
Observe that profile data is ignored for the optimization, and it is applied only
for instances with hasOptSize()=true.
This strategy has two benefits:
(i) it eliminates jump instructions, which results in smaller text size;
(ii) we avoid using profile data while reordering blocks, which yields more
"uniform" functions, thus helping ICF and machine outliner/merger.

For large (mobile) apps, the size benefits of (i) and (ii) are roughly the same,
combined providing up to 0.5% uncompressed and up to 1% compressed savings size
on top of the current solution.

The optimization is turned off by default.

Copy link

github-actions bot commented Sep 23, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@spupyrev spupyrev marked this pull request as ready for review September 23, 2024 20:30
@llvmbot
Copy link
Member

llvmbot commented Sep 23, 2024

@llvm/pr-subscribers-backend-x86

Author: None (spupyrev)

Changes

This is an implementation of a new "size-aware" machine block placement. The
idea is to reorder blocks so that the number of fall-through jumps is maximized.
Observe that profile data is ignored for the optimization, and it is applied only
for instances with hasOptSize()=true.
This strategy has two benefits:
(i) it eliminates jump instructions and hence, smaller binaries;
(ii) we avoid using profile data while reordering blocks, which yields more
"uniform" functions, thus helping ICF and machine outliner/merger.

For large (mobile) apps, the size benefits of (i) and (ii) are roughly the same,
each providing up to 0.5% uncompressed and up to 1% compressed savings size on
top of the current solution.

The optimization is turned off by default.


Patch is 53.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/109711.diff

2 Files Affected:

  • (modified) llvm/lib/CodeGen/MachineBlockPlacement.cpp (+354-314)
  • (added) llvm/test/CodeGen/X86/code_placement_ext_tsp_size.ll (+134)
diff --git a/llvm/lib/CodeGen/MachineBlockPlacement.cpp b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
index be783bc4e29738..3677818c8f08df 100644
--- a/llvm/lib/CodeGen/MachineBlockPlacement.cpp
+++ b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
@@ -119,10 +119,10 @@ static cl::opt<unsigned> LoopToColdBlockRatio(
              "(frequency of block) is greater than this ratio"),
     cl::init(5), cl::Hidden);
 
-static cl::opt<bool> ForceLoopColdBlock(
-    "force-loop-cold-block",
-    cl::desc("Force outlining cold blocks from loops."),
-    cl::init(false), cl::Hidden);
+static cl::opt<bool>
+    ForceLoopColdBlock("force-loop-cold-block",
+                       cl::desc("Force outlining cold blocks from loops."),
+                       cl::init(false), cl::Hidden);
 
 static cl::opt<bool>
     PreciseRotationCost("precise-rotation-cost",
@@ -147,43 +147,43 @@ static cl::opt<unsigned> JumpInstCost("jump-inst-cost",
                                       cl::desc("Cost of jump instructions."),
                                       cl::init(1), cl::Hidden);
 static cl::opt<bool>
-TailDupPlacement("tail-dup-placement",
-              cl::desc("Perform tail duplication during placement. "
-                       "Creates more fallthrough opportunites in "
-                       "outline branches."),
-              cl::init(true), cl::Hidden);
+    TailDupPlacement("tail-dup-placement",
+                     cl::desc("Perform tail duplication during placement. "
+                              "Creates more fallthrough opportunites in "
+                              "outline branches."),
+                     cl::init(true), cl::Hidden);
 
 static cl::opt<bool>
-BranchFoldPlacement("branch-fold-placement",
-              cl::desc("Perform branch folding during placement. "
-                       "Reduces code size."),
-              cl::init(true), cl::Hidden);
+    BranchFoldPlacement("branch-fold-placement",
+                        cl::desc("Perform branch folding during placement. "
+                                 "Reduces code size."),
+                        cl::init(true), cl::Hidden);
 
 // Heuristic for tail duplication.
 static cl::opt<unsigned> TailDupPlacementThreshold(
     "tail-dup-placement-threshold",
     cl::desc("Instruction cutoff for tail duplication during layout. "
              "Tail merging during layout is forced to have a threshold "
-             "that won't conflict."), cl::init(2),
-    cl::Hidden);
+             "that won't conflict."),
+    cl::init(2), cl::Hidden);
 
 // Heuristic for aggressive tail duplication.
 static cl::opt<unsigned> TailDupPlacementAggressiveThreshold(
     "tail-dup-placement-aggressive-threshold",
     cl::desc("Instruction cutoff for aggressive tail duplication during "
              "layout. Used at -O3. Tail merging during layout is forced to "
-             "have a threshold that won't conflict."), cl::init(4),
-    cl::Hidden);
+             "have a threshold that won't conflict."),
+    cl::init(4), cl::Hidden);
 
 // Heuristic for tail duplication.
 static cl::opt<unsigned> TailDupPlacementPenalty(
     "tail-dup-placement-penalty",
-    cl::desc("Cost penalty for blocks that can avoid breaking CFG by copying. "
-             "Copying can increase fallthrough, but it also increases icache "
-             "pressure. This parameter controls the penalty to account for that. "
-             "Percent as integer."),
-    cl::init(2),
-    cl::Hidden);
+    cl::desc(
+        "Cost penalty for blocks that can avoid breaking CFG by copying. "
+        "Copying can increase fallthrough, but it also increases icache "
+        "pressure. This parameter controls the penalty to account for that. "
+        "Percent as integer."),
+    cl::init(2), cl::Hidden);
 
 // Heuristic for tail duplication if profile count is used in cost model.
 static cl::opt<unsigned> TailDupProfilePercentThreshold(
@@ -198,8 +198,7 @@ static cl::opt<unsigned> TriangleChainCount(
     "triangle-chain-count",
     cl::desc("Number of triangle-shaped-CFG's that need to be in a row for the "
              "triangle tail duplication heuristic to kick in. 0 to disable."),
-    cl::init(2),
-    cl::Hidden);
+    cl::init(2), cl::Hidden);
 
 // Use case: When block layout is visualized after MBP pass, the basic blocks
 // are labeled in layout order; meanwhile blocks could be numbered in a
@@ -219,6 +218,11 @@ static cl::opt<unsigned> ExtTspBlockPlacementMaxBlocks(
              "block placement."),
     cl::init(UINT_MAX), cl::Hidden);
 
+// Apply the ext-tsp algorithm minimizing the size of a binary.
+static cl::opt<bool>
+    ApplyExtTspForSize("apply-ext-tsp-for-size", cl::init(false), cl::Hidden,
+                       cl::desc("Use ext-tsp for size-aware block placement."));
+
 namespace llvm {
 extern cl::opt<bool> EnableExtTspBlockPlacement;
 extern cl::opt<bool> ApplyExtTspWithoutProfile;
@@ -292,8 +296,8 @@ class BlockChain {
   iterator end() { return Blocks.end(); }
   const_iterator end() const { return Blocks.end(); }
 
-  bool remove(MachineBasicBlock* BB) {
-    for(iterator i = begin(); i != end(); ++i) {
+  bool remove(MachineBasicBlock *BB) {
+    for (iterator i = begin(); i != end(); ++i) {
       if (*i == BB) {
         Blocks.erase(i);
         return true;
@@ -405,6 +409,8 @@ class MachineBlockPlacement : public MachineFunctionPass {
 
   ProfileSummaryInfo *PSI = nullptr;
 
+  TargetPassConfig *PassConfig = nullptr;
+
   /// Duplicator used to duplicate tails during placement.
   ///
   /// Placement decisions can open up new tail duplication opportunities, but
@@ -415,6 +421,8 @@ class MachineBlockPlacement : public MachineFunctionPass {
   /// Partial tail duplication threshold.
   BlockFrequency DupThreshold;
 
+  unsigned TailDupSize;
+
   /// True:  use block profile count to compute tail duplication cost.
   /// False: use block frequency to compute tail duplication cost.
   bool UseProfileCount = false;
@@ -459,26 +467,24 @@ class MachineBlockPlacement : public MachineFunctionPass {
 
   /// Scale the DupThreshold according to basic block size.
   BlockFrequency scaleThreshold(MachineBasicBlock *BB);
-  void initDupThreshold();
+  void initTailDupThreshold();
 
   /// Decrease the UnscheduledPredecessors count for all blocks in chain, and
   /// if the count goes to 0, add them to the appropriate work list.
-  void markChainSuccessors(
-      const BlockChain &Chain, const MachineBasicBlock *LoopHeaderBB,
-      const BlockFilterSet *BlockFilter = nullptr);
+  void markChainSuccessors(const BlockChain &Chain,
+                           const MachineBasicBlock *LoopHeaderBB,
+                           const BlockFilterSet *BlockFilter = nullptr);
 
   /// Decrease the UnscheduledPredecessors count for a single block, and
   /// if the count goes to 0, add them to the appropriate work list.
-  void markBlockSuccessors(
-      const BlockChain &Chain, const MachineBasicBlock *BB,
-      const MachineBasicBlock *LoopHeaderBB,
-      const BlockFilterSet *BlockFilter = nullptr);
+  void markBlockSuccessors(const BlockChain &Chain, const MachineBasicBlock *BB,
+                           const MachineBasicBlock *LoopHeaderBB,
+                           const BlockFilterSet *BlockFilter = nullptr);
 
   BranchProbability
-  collectViableSuccessors(
-      const MachineBasicBlock *BB, const BlockChain &Chain,
-      const BlockFilterSet *BlockFilter,
-      SmallVector<MachineBasicBlock *, 4> &Successors);
+  collectViableSuccessors(const MachineBasicBlock *BB, const BlockChain &Chain,
+                          const BlockFilterSet *BlockFilter,
+                          SmallVector<MachineBasicBlock *, 4> &Successors);
   bool isBestSuccessor(MachineBasicBlock *BB, MachineBasicBlock *Pred,
                        BlockFilterSet *BlockFilter);
   void findDuplicateCandidates(SmallVectorImpl<MachineBasicBlock *> &Candidates,
@@ -496,16 +502,19 @@ class MachineBlockPlacement : public MachineFunctionPass {
                           MachineFunction::iterator &PrevUnplacedBlockIt,
                           BlockFilterSet::iterator &PrevUnplacedBlockInFilterIt,
                           bool &DuplicatedToLPred);
-  bool hasBetterLayoutPredecessor(
-      const MachineBasicBlock *BB, const MachineBasicBlock *Succ,
-      const BlockChain &SuccChain, BranchProbability SuccProb,
-      BranchProbability RealSuccProb, const BlockChain &Chain,
-      const BlockFilterSet *BlockFilter);
-  BlockAndTailDupResult selectBestSuccessor(
-      const MachineBasicBlock *BB, const BlockChain &Chain,
-      const BlockFilterSet *BlockFilter);
-  MachineBasicBlock *selectBestCandidateBlock(
-      const BlockChain &Chain, SmallVectorImpl<MachineBasicBlock *> &WorkList);
+  bool hasBetterLayoutPredecessor(const MachineBasicBlock *BB,
+                                  const MachineBasicBlock *Succ,
+                                  const BlockChain &SuccChain,
+                                  BranchProbability SuccProb,
+                                  BranchProbability RealSuccProb,
+                                  const BlockChain &Chain,
+                                  const BlockFilterSet *BlockFilter);
+  BlockAndTailDupResult selectBestSuccessor(const MachineBasicBlock *BB,
+                                            const BlockChain &Chain,
+                                            const BlockFilterSet *BlockFilter);
+  MachineBasicBlock *
+  selectBestCandidateBlock(const BlockChain &Chain,
+                           SmallVectorImpl<MachineBasicBlock *> &WorkList);
   MachineBasicBlock *
   getFirstUnplacedBlock(const BlockChain &PlacedChain,
                         MachineFunction::iterator &PrevUnplacedBlockIt);
@@ -536,20 +545,19 @@ class MachineBlockPlacement : public MachineFunctionPass {
                                   const MachineBasicBlock *ExitBB,
                                   const BlockFilterSet &LoopBlockSet);
   MachineBasicBlock *findBestLoopTopHelper(MachineBasicBlock *OldTop,
-      const MachineLoop &L, const BlockFilterSet &LoopBlockSet);
-  MachineBasicBlock *findBestLoopTop(
-      const MachineLoop &L, const BlockFilterSet &LoopBlockSet);
-  MachineBasicBlock *findBestLoopExit(
-      const MachineLoop &L, const BlockFilterSet &LoopBlockSet,
-      BlockFrequency &ExitFreq);
+                                           const MachineLoop &L,
+                                           const BlockFilterSet &LoopBlockSet);
+  MachineBasicBlock *findBestLoopTop(const MachineLoop &L,
+                                     const BlockFilterSet &LoopBlockSet);
+  MachineBasicBlock *findBestLoopExit(const MachineLoop &L,
+                                      const BlockFilterSet &LoopBlockSet,
+                                      BlockFrequency &ExitFreq);
   BlockFilterSet collectLoopBlockSet(const MachineLoop &L);
   void buildLoopChains(const MachineLoop &L);
-  void rotateLoop(
-      BlockChain &LoopChain, const MachineBasicBlock *ExitingBB,
-      BlockFrequency ExitFreq, const BlockFilterSet &LoopBlockSet);
-  void rotateLoopWithProfile(
-      BlockChain &LoopChain, const MachineLoop &L,
-      const BlockFilterSet &LoopBlockSet);
+  void rotateLoop(BlockChain &LoopChain, const MachineBasicBlock *ExitingBB,
+                  BlockFrequency ExitFreq, const BlockFilterSet &LoopBlockSet);
+  void rotateLoopWithProfile(BlockChain &LoopChain, const MachineLoop &L,
+                             const BlockFilterSet &LoopBlockSet);
   void buildCFGChains();
   void optimizeBranches();
   void alignBlocks();
@@ -558,10 +566,10 @@ class MachineBlockPlacement : public MachineFunctionPass {
   bool shouldTailDuplicate(MachineBasicBlock *BB);
   /// Check the edge frequencies to see if tail duplication will increase
   /// fallthroughs.
-  bool isProfitableToTailDup(
-    const MachineBasicBlock *BB, const MachineBasicBlock *Succ,
-    BranchProbability QProb,
-    const BlockChain &Chain, const BlockFilterSet *BlockFilter);
+  bool isProfitableToTailDup(const MachineBasicBlock *BB,
+                             const MachineBasicBlock *Succ,
+                             BranchProbability QProb, const BlockChain &Chain,
+                             const BlockFilterSet *BlockFilter);
 
   /// Check for a trellis layout.
   bool isTrellis(const MachineBasicBlock *BB,
@@ -582,16 +590,17 @@ class MachineBlockPlacement : public MachineFunctionPass {
 
   /// Returns true if a block can tail duplicate into all unplaced
   /// predecessors. Filters based on loop.
-  bool canTailDuplicateUnplacedPreds(
-      const MachineBasicBlock *BB, MachineBasicBlock *Succ,
-      const BlockChain &Chain, const BlockFilterSet *BlockFilter);
+  bool canTailDuplicateUnplacedPreds(const MachineBasicBlock *BB,
+                                     MachineBasicBlock *Succ,
+                                     const BlockChain &Chain,
+                                     const BlockFilterSet *BlockFilter);
 
   /// Find chains of triangles to tail-duplicate where a global analysis works,
   /// but a local analysis would not find them.
   void precomputeTriangleChains();
 
   /// Apply a post-processing step optimizing block placement.
-  void applyExtTsp();
+  void applyExtTsp(bool OptForSize);
 
   /// Modify the existing block placement in the function and adjust all jumps.
   void assignBlockOrder(const std::vector<const MachineBasicBlock *> &NewOrder);
@@ -802,8 +811,8 @@ bool MachineBlockPlacement::shouldTailDuplicate(MachineBasicBlock *BB) {
 /// Compare 2 BlockFrequency's with a small penalty for \p A.
 /// In order to be conservative, we apply a X% penalty to account for
 /// increased icache pressure and static heuristics. For small frequencies
-/// we use only the numerators to improve accuracy. For simplicity, we assume the
-/// penalty is less than 100%
+/// we use only the numerators to improve accuracy. For simplicity, we assume
+/// the penalty is less than 100%
 /// TODO(iteratee): Use 64-bit fixed point edge frequencies everywhere.
 static bool greaterWithBias(BlockFrequency A, BlockFrequency B,
                             BlockFrequency EntryFreq) {
@@ -819,8 +828,8 @@ static bool greaterWithBias(BlockFrequency A, BlockFrequency B,
 /// considering duplication.
 bool MachineBlockPlacement::isProfitableToTailDup(
     const MachineBasicBlock *BB, const MachineBasicBlock *Succ,
-    BranchProbability QProb,
-    const BlockChain &Chain, const BlockFilterSet *BlockFilter) {
+    BranchProbability QProb, const BlockChain &Chain,
+    const BlockFilterSet *BlockFilter) {
   // We need to do a probability calculation to make sure this is profitable.
   // First: does succ have a successor that post-dominates? This affects the
   // calculation. The 2 relevant cases are:
@@ -876,12 +885,12 @@ bool MachineBlockPlacement::isProfitableToTailDup(
   // from BB.
   auto SuccBestPred = BlockFrequency(0);
   for (MachineBasicBlock *SuccPred : Succ->predecessors()) {
-    if (SuccPred == Succ || SuccPred == BB
-        || BlockToChain[SuccPred] == &Chain
-        || (BlockFilter && !BlockFilter->count(SuccPred)))
+    if (SuccPred == Succ || SuccPred == BB ||
+        BlockToChain[SuccPred] == &Chain ||
+        (BlockFilter && !BlockFilter->count(SuccPred)))
       continue;
-    auto Freq = MBFI->getBlockFreq(SuccPred)
-        * MBPI->getEdgeProbability(SuccPred, Succ);
+    auto Freq =
+        MBFI->getBlockFreq(SuccPred) * MBPI->getEdgeProbability(SuccPred, Succ);
     if (Freq > SuccBestPred)
       SuccBestPred = Freq;
   }
@@ -1137,7 +1146,7 @@ MachineBlockPlacement::getBestTrellisSuccessor(
   }
   // We have already computed the optimal edge for the other side of the
   // trellis.
-  ComputedEdges[BestB.Src] = { BestB.Dest, false };
+  ComputedEdges[BestB.Src] = {BestB.Dest, false};
 
   auto TrellisSucc = BestA.Dest;
   LLVM_DEBUG(BranchProbability SuccProb = getAdjustedProbability(
@@ -1169,8 +1178,8 @@ bool MachineBlockPlacement::canTailDuplicateUnplacedPreds(
     // Make sure all unplaced and unfiltered predecessors can be
     // tail-duplicated into.
     // Skip any blocks that are already placed or not in this loop.
-    if (Pred == BB || (BlockFilter && !BlockFilter->count(Pred))
-        || (BlockToChain[Pred] == &Chain && !Succ->succ_empty()))
+    if (Pred == BB || (BlockFilter && !BlockFilter->count(Pred)) ||
+        (BlockToChain[Pred] == &Chain && !Succ->succ_empty()))
       continue;
     if (!TailDup.canTailDuplicate(Succ, Pred)) {
       if (Successors.size() > 1 && hasSameSuccessors(*Pred, Successors))
@@ -1289,9 +1298,7 @@ void MachineBlockPlacement::precomputeTriangleChains() {
 
     unsigned count() const { return Edges.size() - 1; }
 
-    MachineBasicBlock *getKey() const {
-      return Edges.back();
-    }
+    MachineBasicBlock *getKey() const { return Edges.back(); }
   };
 
   if (TriangleChainCount == 0)
@@ -1326,7 +1333,7 @@ void MachineBlockPlacement::precomputeTriangleChains() {
     bool CanTailDuplicate = true;
     // If PDom can't tail-duplicate into it's non-BB predecessors, then this
     // isn't the kind of triangle we're looking for.
-    for (MachineBasicBlock* Pred : PDom->predecessors()) {
+    for (MachineBasicBlock *Pred : PDom->predecessors()) {
       if (Pred == &BB)
         continue;
       if (!TailDup.canTailDuplicate(PDom, Pred)) {
@@ -1386,8 +1393,8 @@ void MachineBlockPlacement::precomputeTriangleChains() {
 
 // When profile is not present, return the StaticLikelyProb.
 // When profile is available, we need to handle the triangle-shape CFG.
-static BranchProbability getLayoutSuccessorProbThreshold(
-      const MachineBasicBlock *BB) {
+static BranchProbability
+getLayoutSuccessorProbThreshold(const MachineBasicBlock *BB) {
   if (!BB->getParent()->getFunction().hasProfileData())
     return BranchProbability(StaticLikelyProb, 100);
   if (BB->succ_size() == 2) {
@@ -1551,8 +1558,8 @@ bool MachineBlockPlacement::hasBetterLayoutPredecessor(
   for (MachineBasicBlock *Pred : Succ->predecessors()) {
     BlockChain *PredChain = BlockToChain[Pred];
     if (Pred == Succ || PredChain == &SuccChain ||
-        (BlockFilter && !BlockFilter->count(Pred)) ||
-        PredChain == &Chain || Pred != *std::prev(PredChain->end()) ||
+        (BlockFilter && !BlockFilter->count(Pred)) || PredChain == &Chain ||
+        Pred != *std::prev(PredChain->end()) ||
         // This check is redundant except for look ahead. This function is
         // called for lookahead by isProfitableToTailDup when BB hasn't been
         // placed yet.
@@ -1599,12 +1606,12 @@ bool MachineBlockPlacement::hasBetterLayoutPredecessor(
 /// \returns The best successor block found, or null if none are viable, along
 /// with a boolean indicating if tail duplication is necessary.
 MachineBlockPlacement::BlockAndTailDupResult
-MachineBlockPlacement::selectBestSuccessor(
-    const MachineBasicBlock *BB, const BlockChain &Chain,
-    const BlockFilterSet *BlockFilter) {
+MachineBlockPlacement::selectBestSuccessor(const MachineBasicBlock *BB,
+                                           const BlockChain &Chain,
+                                           const BlockFilterSet *BlockFilter) {
   const BranchProbability HotProb(StaticLikelyProb, 100);
 
-  BlockAndTailDupResult BestSucc = { nullptr, false };
+  BlockAndTailDupResult BestSucc = {nullptr, false};
   auto BestProb = BranchProbability::getZero();
 
   SmallVector<MachineBasicBlock *, 4> Successors;
@@ -1684,8 +1691,8 @@ MachineBlockPlacement::selectBestSuccessor(
     std::tie(DupProb, Succ) = Tup;
     if (DupProb < BestProb)
       break;
-    if (canTailDuplicateUnplacedPreds(BB, Succ, Chain, BlockFilter)
-        && (isProfitableToTailDup(BB, Succ, BestProb, Chain, BlockFilter))) {
+    if (canTailDuplicateUnplacedPreds(BB, Succ, Chain, BlockFilter) &&
+        (isProfitableToTailDup(BB, Succ, BestProb, Chain, BlockFilter))) {
       LLVM_DEBUG(dbgs() << "    Candidate: " << getBlockName(Succ)
                         << ", probability: " << DupProb
                         << " (Tail Duplicate)\n");
@@ -1822,8 +1829,7 @@ MachineBasicBlock *MachineBlockPlacement::getFirstUnplacedBlock(
 }
 
 void MachineBlockPlacement::fillWorkLists(
-    const MachineBasicBlock *MBB,
-    SmallPtrSetImpl<BlockChain *> &UpdatedPreds,
+    const MachineBasicBlock *MBB, SmallPtrSetImpl<BlockChain *> &UpdatedPreds,
     const BlockFilterSet *Block...
[truncated]

@spupyrev
Copy link
Contributor Author

Please add other folks to review the diff, who might be interested

Copy link
Contributor

@david-xl david-xl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split out the format change first?

Copy link
Contributor

@ellishg ellishg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you'll want to rebase on top of your formatting changes to make it easier to see the changes only for this patch.

@spupyrev
Copy link
Contributor Author

Rebased and modified tests, as suggested

createCFGChainExtTsp();
} else if (UseExtTspForSize) {
applyExtTsp(true);
createCFGChainExtTsp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factorize this call out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry not sure if i understand the suggestion; there is an if-else here, so the call isn't always executed, but perhaps you meant something else?

Copy link
Contributor

@ellishg ellishg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for improving block layout for codesize!

@spupyrev
Copy link
Contributor Author

spupyrev commented Oct 1, 2024

rebased to latest; will wait a few more days before landing in case there are other comments/suggestions

if (MF.size() >= 3) {
if (EnableExtTspBlockPlacement &&
(ApplyExtTspWithoutProfile || MF.getFunction().hasProfileData()) &&
MF.size() <= ExtTspBlockPlacementMaxBlocks) {
Copy link
Contributor

@kyulee-com kyulee-com Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this flag is used to avoid a large function that impacts the build time when running ExtTsp. Can it be a concern when applying for size?
I see we're only interested in MF.size() >= 3 in this check, but it's also checked from UseExtTspForSize. I would say these size constraints should apply for both size and perf case
Similarly, can we set UseExtTspForPerf early like below so that we can simplify this part?

bool UseExtTspForPerf = false;
bool UseExtTspForSize = false;
bool IsWithinSizeRange = (MF.size() >= 3) && (MF.size() <= ExtTspBlockPlacementMaxBlocks);
if (IsWithinSizeRange) {
    UseExtTspForPerf = EnableExtTspBlockPlacement &&
                       (ApplyExtTspWithoutProfile || MF.getFunction().hasProfileData());
    UseExtTspForSize = !UseExtTspForPerf && OptForSize && ApplyExtTspForSize;
}

Then, this part will be like:

if (UseExtTspForPerf || UseExtTspForSize) {
  assert(!(UseExtTspForPerf && UseExtTspForSize) && "Both UseExtTspForPerf and UseExtTspForSize can't be set");
  applyExtTsp(/*OptForSize=*/UseExtTspForSize);
  createCFGChainExtTsp();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this flag is used to avoid a large function that impacts the build time when running ExtTsp. Can it be a concern when applying for size?

I think the flag is pretty much obsolete for size, since there is no aggressive inlining in that case. The build time might be affected only when there are instances with ~10K+ of hot basic blocks. When that happens, we should probably investigate what's going on instead of silently ignoring optimizations.

I adjusted the code as you suggested.

@spupyrev spupyrev merged commit 9016f27 into llvm:main Oct 2, 2024
3 of 5 checks passed
Sterling-Augustine pushed a commit to Sterling-Augustine/llvm-project that referenced this pull request Oct 3, 2024
This is an implementation of a new "size-aware" machine block placement.
The
idea is to reorder blocks so that the number of fall-through jumps is
maximized.
Observe that profile data is ignored for the optimization, and it is
applied only
for instances with hasOptSize()=true.
This strategy has two benefits:
(i) it eliminates jump instructions, which results in smaller text size;
(ii) we avoid using profile data while reordering blocks, which yields
more
"uniform" functions, thus helping ICF and machine outliner/merger.

For large (mobile) apps, the size benefits of (i) and (ii) are roughly
the same,
combined providing up to 0.5% uncompressed and up to 1% compressed
savings size
on top of the current solution.

The optimization is turned off by default.
@nocchijiang
Copy link
Contributor

nocchijiang commented Oct 12, 2024

I tested this patch on a Swift codebase and noticed that it tries to move statically-known cold block (0% probability) right after its (only) predecessor, instead of keeping it at the end of the function. I'm not sure if it is intended, but it looks counter-intuitive, brings no code size benefit, and may even hurt compression ratio.

To add more background to it, Swift compiler tends to generate a lot of runtime failure traps at the end of a function that looks like a series of brk #1 instructions (arm64). These traps are made of a non-mergeable basic block with 0% branch probability. The compiler emits a check to certain condition and a jump to the trap if the condition does not hold.

Here is an example that I extracted from a bad case:

Before MBP:

bb.10 (%ir-block.40):
; predecessors: %bb.9
  successors: %bb.13(0x00000800), %bb.11(0x7ffff800); %bb.13(0.00%), %bb.11(100.00%)

  BL @objc_retainAutoreleasedReturnValue, <regmask $fp $lr $wzr $xzr $b8 $b9 $b10 $b11 $b12 $b13 $b14 $b15 $d8 $d9 $d10 $d11 $d12 $d13 $d14 $d15 $h8 $h9 $h10 $h11 $h12 $h13 $h14 $h15 $s8 $s9 $s10 $s11 $s12 and 55 more...>, implicit-def dead $lr, implicit $sp, implicit $x0, implicit-def $sp, implicit-def $x0
  CBZX $x0, %bb.13

bb.11 (%ir-block.47):
; predecessors: %bb.10
  successors: %bb.12(0x80000000); %bb.12(100.00%)
  liveins: $x0:0x0000000000000008

  BL @objc_release, <regmask $fp $lr $wzr $xzr $b8 $b9 $b10 $b11 $b12 $b13 $b14 $b15 $d8 $d9 $d10 $d11 $d12 $d13 $d14 $d15 $h8 $h9 $h10 $h11 $h12 $h13 $h14 $h15 $s8 $s9 $s10 $s11 $s12 and 55 more...>, implicit-def dead $lr, implicit $sp, implicit $x0, implicit-def $sp

bb.12 (%ir-block.49):
; predecessors: %bb.8, %bb.9, %bb.11

  RET undef $lr

bb.13 (%ir-block.50):
; predecessors: %bb.10

  INLINEASM &"" [sideeffect] [attdialect], $0:[imm], 0
  BRK 1

After MBP:

bb.10 (%ir-block.40):
; predecessors: %bb.9
  successors: %bb.13(0x00000800), %bb.11(0x7ffff800); %bb.13(0.00%), %bb.11(100.00%)

  BL @objc_retainAutoreleasedReturnValue, <regmask $fp $lr $wzr $xzr $b8 $b9 $b10 $b11 $b12 $b13 $b14 $b15 $d8 $d9 $d10 $d11 $d12 $d13 $d14 $d15 $h8 $h9 $h10 $h11 $h12 $h13 $h14 $h15 $s8 $s9 $s10 $s11 $s12 and 55 more...>, implicit-def dead $lr, implicit $sp, implicit $x0, implicit-def $sp, implicit-def $x0
  CBZNX $x0, %bb.11

bb.13 (%ir-block.50):
; predecessors: %bb.10

  INLINEASM &"" [sideeffect] [attdialect], $0:[imm], 0
  BRK 1

bb.11 (%ir-block.47):
; predecessors: %bb.10
  successors: %bb.12(0x80000000); %bb.12(100.00%)
  liveins: $x0:0x0000000000000008

  BL @objc_release, <regmask $fp $lr $wzr $xzr $b8 $b9 $b10 $b11 $b12 $b13 $b14 $b15 $d8 $d9 $d10 $d11 $d12 $d13 $d14 $d15 $h8 $h9 $h10 $h11 $h12 $h13 $h14 $h15 $s8 $s9 $s10 $s11 $s12 and 55 more...>, implicit-def dead $lr, implicit $sp, implicit $x0, implicit-def $sp

bb.12 (%ir-block.49):
; predecessors: %bb.8, %bb.9, %bb.11

  RET undef $lr

@spupyrev
Copy link
Contributor Author

Thanks for testing this out!

and may even hurt compression ratio

Can you elaborate on this? I don't see a size regression in the provided example.

To clarify, this approach is intentionally designed to discard branch frequencies (ie the "0% probability") in order to get size wins, so I don't immediately see a "problem" here. The motivation is to get a better compression of cold functions, where non-perf-friendly layout isn't a concern.
We could potentially adjust the optimization so as it tries to respect certain branch probabilities, though it might result in less compressible code. Can you explain in more details the Swift use case and would one care about the placement of cold blocks?

@nocchijiang
Copy link
Contributor

nocchijiang commented Oct 14, 2024

Can you elaborate on this? I don't see a size regression in the provided example.

If there are multiple traps in a single function, they will be laid out at the end of it which looks like this:

...
ret
brk #1
brk #1
brk #1
brk #1
brk #1
brk #1
; end-of-function

With ApplyExtTspForSize enabled, they will no longer be "grouped" together. As compression algorithm loves similar inputs I guess the compression ratio would take a hit. But you can definitely say that it is a wild guess. Actually I noticed a real code size regression caused by falling through to the traps (i.e. cold blocks) yesterday, so this is more straightforward to illustrate as a potential chance of improvement.

Good:

10df62ab4:     	mov	x5, x30
10df62ab8:     	bl	0x10df7aa60
10df62abc:     	mov	x30, x5
10df62ac0:     	stp	x29, x30, [sp, #64]
10df62ac4:     	add	x29, sp, #64
10df62ac8:     	bl	0x107c73c6c
10df62acc:     	mov	x21, x0
10df62ad0:     	tbz	w2, #0, 0x10df62aec
10df62ad4:     	bl	0x107c749d4
10df62ad8:     	b.ge	0x10df62af0
10df62adc:     	bl	0x107c749d0
10df62ae0:     	b.mi	0x10df62b78
10df62ae4:     	bl	0x107c749cc
10df62ae8:     	b	0x10df62af0
10df62aec:     	mov	x8, x1
10df62af0:     	bl	0x107c749c8
10df62af4:     	cbz	x23, 0x10df62b1c
10df62af8:     	adrp	x0, 0x115984000
10df62afc:     	add	x0, x0, #3328
10df62b00:     	bl	0x107c24cb0
10df62b04:     	lsl	x8, x23, #4
10df62b08:     	add	x1, x8, #32
10df62b0c:     	bl	0x107c73850
10df62b10:     	bl	0x107c749c4
10df62b14:     	bl	0x108ac85d8
10df62b18:     	b	0x10df62b28
10df62b1c:     	adrp	x23, 0x111b71000
10df62b20:     	ldr	x23, [x23, #1616]
10df62b24:     	bl	0x107c73d1c
10df62b28:     	add	x24, x23, #32
10df62b2c:     	add	x25, x20, #32
10df62b30:     	tbz	w21, #0, 0x10df62b58
10df62b34:     	add	x8, x25, x22, lsl #4
10df62b38:     	cmp	x23, x20
10df62b3c:     	ccmp	x8, x24, #0, eq
10df62b40:     	b.hi	0x10df62b50
10df62b44:     	lsl	x2, x22, #4
10df62b48:     	bl	0x107c73ae8
10df62b4c:     	bl	0x10fea94a4
10df62b50:     	str	xzr, [x20, #16]
10df62b54:     	b	0x10df62b68
10df62b58:     	adrp	x0, 0x115880000
10df62b5c:     	add	x0, x0, #512
10df62b60:     	bl	0x107c24cb0
10df62b64:     	bl	0x107739f6c
10df62b68:     	bl	0x107c749c0
10df62b6c:     	mov	x0, x23
10df62b70:     	ldp	x29, x30, [sp, #64]
10df62b74:     	b	0x107c743f0
10df62b78:     	brk	#0x1

Bad:

10def43bc:     	mov	x5, x30
10def43c0:     	bl	0x10df0c250
10def43c4:     	mov	x30, x5
10def43c8:     	stp	x29, x30, [sp, #64]
10def43cc:     	add	x29, sp, #64
10def43d0:     	bl	0x107c72c58
10def43d4:     	mov	x21, x0
10def43d8:     	tbnz	w2, #0, 0x10def43e4
10def43dc:     	mov	x8, x1
10def43e0:     	b	0x10def4400
10def43e4:     	bl	0x107c739c4
10def43e8:     	b.ge	0x10def4400
10def43ec:     	bl	0x107c739c0
10def43f0:     	b.pl	0x10def43f8
10def43f4:     	brk	#0x1
10def43f8:     	bl	0x107c739bc
10def43fc:     	csel	x8, x8, x1, gt
10def4400:     	bl	0x107c739b8
10def4404:     	cbz	x23, 0x10def442c
10def4408:     	adrp	x0, 0x1158f0000
10def440c:     	add	x0, x0, #3328
10def4410:     	bl	0x107c24d74
10def4414:     	lsl	x8, x23, #4
10def4418:     	add	x1, x8, #32
10def441c:     	bl	0x107c7283c
10def4420:     	bl	0x107c739b4
10def4424:     	bl	0x108aa160c
10def4428:     	b	0x10def4438
10def442c:     	adrp	x23, 0x111add000
10def4430:     	ldr	x23, [x23, #1616]
10def4434:     	bl	0x107c72d04
10def4438:     	add	x24, x23, #32
10def443c:     	add	x25, x20, #32
10def4440:     	tbz	w21, #0, 0x10def4468
10def4444:     	add	x8, x25, x22, lsl #4
10def4448:     	cmp	x23, x20
10def444c:     	ccmp	x8, x24, #0, eq
10def4450:     	b.hi	0x10def4460
10def4454:     	lsl	x2, x22, #4
10def4458:     	bl	0x107c72adc
10def445c:     	bl	0x10fe1508c
10def4460:     	str	xzr, [x20, #16]
10def4464:     	b	0x10def4478
10def4468:     	adrp	x0, 0x1157ec000
10def446c:     	add	x0, x0, #512
10def4470:     	bl	0x107c24d74
10def4474:     	bl	0x1077245ec
10def4478:     	bl	0x107c739b0
10def447c:     	mov	x0, x23
10def4480:     	ldp	x29, x30, [sp, #64]
10def4484:     	b	0x107c733dc

In the "Bad" assembly you can see that at 10def43fc an additional csel is emitted as a consequence of moving brk.

Edit: this "regression" is caused by random outlining decision, csel is outlined in the "Good" one.

I am wondering if it is possible that we do not move statically 0% probability cold blocks (or at least making it an option for Swift inputs)?

@spupyrev
Copy link
Contributor Author

@nocchijiang we tried to modify the optimization so that it places cold blocks at the end of the layout. Unfortunately, there were no compressed/uncompressed size improvements over a set of large apps, including Swift-heavy ones. I did observe the behavior you describe (cold brk in the middle of the layout) but placing them at the end doesn't seem to be beneficial. (Theoretically, there might be some value of integrating block hotness into the placement, and i'd be happy to review diffs, if you find a good way of doing so)

@nocchijiang
Copy link
Contributor

@nocchijiang we tried to modify the optimization so that it places cold blocks at the end of the layout. Unfortunately, there were no compressed/uncompressed size improvements over a set of large apps, including Swift-heavy ones. I did observe the behavior you describe (cold brk in the middle of the layout) but placing them at the end doesn't seem to be beneficial. (Theoretically, there might be some value of integrating block hotness into the placement, and i'd be happy to review diffs, if you find a good way of doing so)

@spupyrev Would you mind sharing the modification as a draft so that I could test it against our codebase? It may be related to the code quality/style that some poorly written code are more likely to result in excessive brks. FYI, ~0.5% of the binary is made up of brk in one of our Swift apps.

@spupyrev
Copy link
Contributor Author

You could try something along these lines. I haven't fine-tuned or polished the code, so be careful.

       // optimization; prioritize slightly jumps with a single successor, since
       // the corresponding jump instruction will be removed from the binary.
       const uint64_t Freq = Succs.size() == 1 ? 110 : 100;
-      for (const MachineBasicBlock *Succ : Succs)
-        JumpCounts.push_back({BlockIndex[&MBB], BlockIndex[Succ], Freq});
+      if (Succs.size() == 2) {
+        // NEW
+        BranchProbability BP0 = MBPI->getEdgeProbability(&MBB, Succs[0]);
+        BlockFrequency BF0 = BlockFrequency(100) * BP0;
+        size_t Pct0 = BF0.getFrequency();
+        assert(Pct0 <= 100);
+        BranchProbability BP1 = MBPI->getEdgeProbability(&MBB, Succs[1]);
+        BlockFrequency BF1 = BlockFrequency(100) * BP1;
+        size_t Pct1 = BF1.getFrequency();
+        assert(Pct1 <= 100);
+
+        if (Pct0 <= 5) {
+          JumpCounts.push_back({BlockIndex[&MBB], BlockIndex[Succs[0]], 10});
+          JumpCounts.push_back({BlockIndex[&MBB], BlockIndex[Succs[1]], Freq});
+        } else if (Pct1 <= 5) {
+          JumpCounts.push_back({BlockIndex[&MBB], BlockIndex[Succs[1]], 10});
+          JumpCounts.push_back({BlockIndex[&MBB], BlockIndex[Succs[0]], Freq});
+        } else {
+          // In case of ties, the last element of JumpCounts is prioritized.
+          // Thus, make sure that Succs[1] is the original fall-through.
+          if (BlockIndex[&MBB] + 1 == BlockIndex[Succs[0]])
+            std::swap(Succs[0], Succs[1]);
+          JumpCounts.push_back({BlockIndex[&MBB], BlockIndex[Succs[0]], Freq});
+          JumpCounts.push_back({BlockIndex[&MBB], BlockIndex[Succs[1]], Freq});
+        }
+      } else {
+        // OLD
+        for (const MachineBasicBlock *Succ : Succs)
+          JumpCounts.push_back({BlockIndex[&MBB], BlockIndex[Succ], Freq});
+      }
     } else {
       for (MachineBasicBlock *Succ : MBB.successors()) {
         auto EP = MBPI->getEdgeProbability(&MBB, Succ);

@nocchijiang
Copy link
Contributor

@spupyrev Thanks for the demo code! I tested it on the codebase I mentioned earlier and it appears to regress the compression ratio significantly (~40%). It turns out it is indeed a WILD guess.

ellishg added a commit that referenced this pull request Nov 18, 2024
)

#109711 disables
`buildCFGChains()` when `-apply-ext-tsp-for-size` is used to improve
codesize. Tail merging can change the layout and normally requires
`buildCFGChains()` to be called again, but we want to prevent this when
optimizing for codesize. We saw slight size improvement on large
binaries with this change. If `-apply-ext-tsp-for-size` is not used,
this should be a NFC.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants