diff --git a/intermediate_source/TP_tutorial.rst b/intermediate_source/TP_tutorial.rst index 2d0193990d4..91e64a87488 100644 --- a/intermediate_source/TP_tutorial.rst +++ b/intermediate_source/TP_tutorial.rst @@ -83,8 +83,6 @@ To see how to utilize DeviceMesh to set up multi-dimensional parallelisms, pleas .. code-block:: python - # run this via torchrun: torchrun --standalone --nproc_per_node=8 ./tp_tutorial.py - from torch.distributed.device_mesh import init_device_mesh tp_mesh = init_device_mesh("cuda", (8,)) @@ -360,4 +358,4 @@ Conclusion This tutorial demonstrates how to train a large Transformer-like model across hundreds to thousands of GPUs using Tensor Parallel in combination with Fully Sharded Data Parallel. It explains how to apply Tensor Parallel to different parts of the model, with **no code changes** to the model itself. Tensor Parallel is a efficient model parallelism technique for large scale training. -To see the complete end to end code example explained in this tutorial, please refer to the `Tensor Parallel examples `__ in the pytorch/examples repository. +To see the complete end-to-end code example explained in this tutorial, please refer to the `Tensor Parallel examples `__ in the pytorch/examples repository.