Skip to content

Commit 1a4c125

Browse files
committed
address comments: improve API description
1 parent 03dd8d6 commit 1a4c125

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

prototype_source/context_parallel.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Two Ring Attention variants have been implemented: `the all-gather based pass-KV
3232
1. The all-gather based pass-KV algorithm is used in Llama3 training, which initially performs an all-gather on the key and value tensors, followed by computing the attention output for the
3333
local query tensor chunk. Our modified all-gather based pass-KV algorithm concurrently all-gathers KV shards and computes attention output for the local query tensor chunk
3434
using local key and value tensor chunks, followed by a final computation of attention output for the local query tensor and remaining KV shards. This allows some degree of
35-
overlap between the attention computation and the all-gather collective.
35+
overlap between the attention computation and the all-gather collective. For example, in the case of Llama3 training, we also shard ``freq_cis`` over the sequence dimension.
3636
2. The all-to-all approach uses interleaved all-to-all collectives to ring shuffle KV shards to overlap the SDPA (Scaled Dot Product Attention) computation and the all-to-all communication
3737
necessary for the next SDPA.
3838

0 commit comments

Comments
 (0)