Skip to content

Commit 9a1b2f7

Browse files
authored
Merge branch 'main' into docs/autoload
2 parents 0b52b02 + be7f1b3 commit 9a1b2f7

File tree

2 files changed

+165
-7
lines changed

2 files changed

+165
-7
lines changed

.github/workflows/StalePRs.yml

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# A workflow copied from the pytorch/pytorch repo stale PRs that implements similar logic to actions/stale.
2+
#
3+
# Compared to actions/stale, it is implemented to make API requests proportional
4+
# to the number of stale PRs, not the total number of issues in the repo. This
5+
# is because PyTorch has a lot of issues/PRs, so the actions/stale runs into
6+
# rate limits way too quickly.
7+
#
8+
# The behavior is:
9+
# - If a PR is not labeled stale, after 60 days inactivity label the PR as stale and comment about it.
10+
# - If a PR is labeled stale, after 30 days inactivity close the PR.
11+
# - `high priority` and `no-stale` PRs are exempt.
12+
13+
name: Close stale pull requests
14+
15+
on:
16+
schedule:
17+
# Run at midnight UTC.
18+
- cron: '0 0 * * *'
19+
workflow_dispatch:
20+
21+
jobs:
22+
stale:
23+
if: ${{ github.repository == 'pytorch/tutorials' }}
24+
runs-on: ubuntu-latest
25+
permissions:
26+
contents: read
27+
pull-requests: write
28+
29+
steps:
30+
- uses: actions/github-script@v6
31+
with:
32+
script: |
33+
// Do some dumb retries on requests.
34+
const retries = 7;
35+
const baseBackoff = 100;
36+
const sleep = timeout => new Promise(resolve => setTimeout(resolve, timeout));
37+
github.hook.wrap('request', async (request, options) => {
38+
for (let attempt = 1; attempt <= retries; attempt++) {
39+
try {
40+
return await request(options);
41+
} catch (err) {
42+
if (attempt < retries) {
43+
core.warning(`Request getting retried. Attempt: ${attempt}`);
44+
await sleep(baseBackoff * Math.pow(2, attempt));
45+
continue;
46+
}
47+
throw err;
48+
}
49+
}
50+
});
51+
52+
const MAX_API_REQUESTS = 100;
53+
54+
// If a PRs not labeled stale, label them stale after no update for 60 days.
55+
const STALE_LABEL_THRESHOLD_MS = 1000 * 60 * 60 * 24 * 60;
56+
// For PRs already labeled stale, close after not update for 30 days.
57+
const STALE_CLOSE_THRESHOLD_MS = 1000 * 60 * 60 * 24 * 30;
58+
59+
const STALE_MESSAGE =
60+
"Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as `Stale`. <br>" +
61+
"Feel free to remove the `Stale` label if you feel this was a mistake. <br>" +
62+
"If you are unable to remove the `Stale` label please contact a maintainer in order to do so. <br>" +
63+
"If you want the bot to never mark this PR stale again, add the `no-stale` label.<br>" +
64+
"`Stale` pull requests will automatically be closed after 30 days of inactivity.<br>";
65+
66+
let numAPIRequests = 0;
67+
let numProcessed = 0;
68+
69+
async function processPull(pull) {
70+
core.info(`[${pull.number}] URL: ${pull.html_url}`);
71+
numProcessed += 1;
72+
const labels = pull.labels.map((label) => label.name);
73+
74+
// Skip if certain labels are present.
75+
if (labels.includes("no-stale") || labels.includes("high priority")) {
76+
core.info(`[${pull.number}] Skipping because PR has an exempting label.`);
77+
return false;
78+
}
79+
80+
// Check if the PR is stale, according to our configured thresholds.
81+
let staleThresholdMillis;
82+
if (labels.includes("Stale")) {
83+
core.info(`[${pull.number}] PR is labeled stale, checking whether we should close it.`);
84+
staleThresholdMillis = STALE_CLOSE_THRESHOLD_MS;
85+
} else {
86+
core.info(`[${pull.number}] Checking whether to label PR as stale.`);
87+
staleThresholdMillis = STALE_LABEL_THRESHOLD_MS;
88+
}
89+
90+
const millisSinceLastUpdated =
91+
new Date().getTime() - new Date(pull.updated_at).getTime();
92+
93+
if (millisSinceLastUpdated < staleThresholdMillis) {
94+
core.info(`[${pull.number}] Skipping because PR was updated recently`);
95+
return false;
96+
}
97+
98+
// At this point, we know we should do something.
99+
// For PRs already labeled stale, close them.
100+
if (labels.includes("Stale")) {
101+
core.info(`[${pull.number}] Closing PR.`);
102+
numAPIRequests += 1;
103+
await github.rest.issues.update({
104+
owner: "pytorch",
105+
repo: "tutorials",
106+
issue_number: pull.number,
107+
state: "closed",
108+
});
109+
} else {
110+
// For PRs not labeled stale, label them stale.
111+
core.info(`[${pull.number}] Labeling PR as stale.`);
112+
113+
numAPIRequests += 1;
114+
await github.rest.issues.createComment({
115+
owner: "pytorch",
116+
repo: "tutorials",
117+
issue_number: pull.number,
118+
body: STALE_MESSAGE,
119+
});
120+
121+
numAPIRequests += 1;
122+
await github.rest.issues.addLabels({
123+
owner: "pytorch",
124+
repo: "tutorials",
125+
issue_number: pull.number,
126+
labels: ["Stale"],
127+
});
128+
}
129+
}
130+
131+
for await (const response of github.paginate.iterator(
132+
github.rest.pulls.list,
133+
{
134+
owner: "pytorch",
135+
repo: "tutorials",
136+
state: "open",
137+
sort: "created",
138+
direction: "asc",
139+
per_page: 100,
140+
}
141+
)) {
142+
numAPIRequests += 1;
143+
const pulls = response.data;
144+
// Awaiting in a loop is intentional here. We want to serialize execution so
145+
// that log groups are printed correctl
146+
for (const pull of pulls) {
147+
if (numAPIRequests > MAX_API_REQUESTS) {
148+
core.warning("Max API requests exceeded, exiting.");
149+
process.exit(0);
150+
}
151+
await core.group(`Processing PR #${pull.number}`, async () => {
152+
await processPull(pull);
153+
});
154+
}
155+
}
156+
core.info(`Processed ${numProcessed} PRs total.`);
157+

prototype_source/flight_recorder_tutorial.rst

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -46,15 +46,15 @@ Flight Recorder consists of two core parts:
4646

4747
Enabling Flight Recorder
4848
------------------------
49-
There are two required environment variables to get the initial version of Flight Recorder working.
49+
There are three required environment variables to get the initial version of Flight Recorder working.
5050

51-
- ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE``: Setting the path where the flight recorder will be dumped with file prefix. One file per
52-
rank. The default value is ``/tmp/nccl_trace_rank_``.
5351
- ``TORCH_NCCL_TRACE_BUFFER_SIZE = (0, N)``: Setting ``N`` to a positive number enables collection.
5452
``N`` represents the number of entries that will be kept internally in a circular buffer.
55-
We recommended to set this value at *2000*.
53+
We recommended to set this value at *2000*. The default value is ``2000``.
5654
- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout.
57-
If enabled, there will be one file per rank output in the job's running directory.
55+
If enabled, there will be one file per rank output in the job's running directory. The default value is ``false``.
56+
- ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE``: Setting the path where the flight recorder will be dumped with file prefix. One file per
57+
rank. The default value is ``/tmp/nccl_trace_rank_``.
5858

5959
**Optional settings:**
6060

@@ -74,7 +74,8 @@ Additional Settings
7474
``fast`` is a new experimental mode that is shown to be much faster than the traditional ``addr2line``.
7575
Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect C++ traces in the Flight Recorder data.
7676
- If you prefer not to have the flight recorder data dumped into the local disk but rather onto your own storage, you can define your own writer class.
77-
This class should inherit from class ``::c10d::DebugInfoWriter`` and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter``
77+
This class should inherit from class ``::c10d::DebugInfoWriter`` `(code) <https://github.com/pytorch/pytorch/blob/release/2.5/torch/csrc/distributed/c10d/NCCLUtils.hpp#L237>`__
78+
and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter`` `(code) <https://github.com/pytorch/pytorch/blob/release/2.5/torch/csrc/distributed/c10d/NCCLUtils.hpp#L242>`__
7879
before we initiate PyTorch distributed.
7980

8081
Retrieving Flight Recorder Data via an API
@@ -189,7 +190,7 @@ command directly:
189190
Currently, we support two modes for the analyzer script. The first mode allows the script to apply some heuristics to the parsed flight
190191
recorder dumps to generate a report identifying potential culprits for the timeout. The second mode is simply outputs the raw dumps.
191192
By default, the script prints flight recoder dumps for all ranks and all ``ProcessGroups``(PGs). This can be narrowed down to certain
192-
ranks and PGs. An example command is:
193+
ranks and PGs using the *--selected-ranks* argument. An example command is:
193194
194195
Caveat: tabulate module is needed, so you might need pip install it first.
195196

0 commit comments

Comments
 (0)