Skip to content

Commit 2586a79

Browse files
authored
Apply suggestions from code review
1 parent ba4d7d6 commit 2586a79

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

prototype_source/flight_recorder_tutorial.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,8 @@ A job can get stuck for various reasons:
3737
- **Synchronization Issues:** In distributed training, different parts of the computation are often run in parallel and need to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can occur if one or more ranks fail to join a collective while the remaining ranks have joined. This results in an indefinite wait for the job to progress.
3838

3939
Flight Recorder, as the name suggests, captures diagnostics information as collectives run. The captured diagnostic
40-
information is used to help root cause issues when jobs get stuck.
41-
There are two core parts to Flight Recorder.
40+
information is used to help identify the root causes of issues when jobs become stuck.
41+
Flight Recorder consists of two core parts:
4242

4343
- The collection portion: when enabled, information about collectives is recorded in an in-memory circular buffer. Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
4444

0 commit comments

Comments
 (0)