From 9cc665b7d99bbd6613e93c1c323d2c883927ceb5 Mon Sep 17 00:00:00 2001 From: Sahdev Zala Date: Mon, 22 Jan 2024 12:53:56 -0500 Subject: [PATCH] Add FSDP reference PyTorch Distributed Overview page (https://pytorch.org/tutorials/beginner/dist_overview.html) is widely used to learn basics of distributed package and its offerings. Seems like this page was created before the FSDP support was added in PyTorch. This PR adds missing FSDP reference. --- beginner_source/dist_overview.rst | 17 ++++++++++++++++- en-wordlist.txt | 1 + 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/beginner_source/dist_overview.rst b/beginner_source/dist_overview.rst index 7768dc4876c..7309693e0e1 100644 --- a/beginner_source/dist_overview.rst +++ b/beginner_source/dist_overview.rst @@ -74,7 +74,10 @@ common development trajectory would be: 4. Use multi-machine `DistributedDataParallel `__ and the `launching script `__, if the application needs to scale across machine boundaries. -5. Use `torch.distributed.elastic `__ +5. Use multi-GPU `FullyShardedDataParallel `__ + training on a single-machine or multi-machine when the data and model cannot + fit on one GPU. +6. Use `torch.distributed.elastic `__ to launch distributed training if errors (e.g., out-of-memory) are expected or if resources can join and leave dynamically during training. @@ -134,6 +137,18 @@ DDP materials are listed below: 5. The `Distributed Training with Uneven Inputs Using the Join Context Manager <../advanced/generic_join.html>`__ tutorial walks through using the generic join context for distributed training with uneven inputs. + +``torch.distributed.FullyShardedDataParallel`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The `FullyShardedDataParallel `__ +(FSDP) is a type of data parallelism paradigm which maintains a per-GPU copy of a model’s +parameters, gradients and optimizer states, it shards all of these states across +data-parallel workers. The support for FSDP was added starting PyTorch v1.11. The tutorial +`Getting Started with FSDP `__ +provides in depth explanation and example of how FSDP works. + + torch.distributed.elastic ~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/en-wordlist.txt b/en-wordlist.txt index 1ec9abb68de..da13503a56c 100644 --- a/en-wordlist.txt +++ b/en-wordlist.txt @@ -95,6 +95,7 @@ ExportDB FC FGSM FLAVA +FSDP FX FX's FloydHub