s/constructors/model constructors and s/match/roughly match

szmigacz · szmigacz · commit 3c4e84dabaa7 · 2020-09-21T15:04:39.000-07:00
diff --git a/recipes_source/recipes/tuning_guide.py b/recipes_source/recipes/tuning_guide.py
@@ -333,16 +333,16 @@ def fused_gelu(x):
 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 # `torch.nn.parallel.DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`_
 # with ``find_unused_parameters=True`` uses the order of layers and parameters
-# from constructors to build buckets for ``DistributedDataParallel`` gradient
-# all-reduce. ``DistributedDataParallel`` overlaps all-reduce with the backward
-# pass. All-reduce for a particular bucket is asynchronously triggered only when
-# all gradients for parameters in a given bucket are available.
-#
-# To maximize the amount of overlap, the order in constructors should match the
-# order during the execution. If the order doesn't match, then all-reduce for
-# the entire bucket waits for the gradient which is the last to arrive, this may
-# reduce the overlap between backward pass and all-reduce, all-reduce may end up
-# being exposed, which slows down the training.
+# from model constructors to build buckets for ``DistributedDataParallel``
+# gradient all-reduce. ``DistributedDataParallel`` overlaps all-reduce with the
+# backward pass. All-reduce for a particular bucket is asynchronously triggered
+# only when all gradients for parameters in a given bucket are available.
+#
+# To maximize the amount of overlap, the order in model constructors should
+# roughly match the order during the execution. If the order doesn't match, then
+# all-reduce for the entire bucket waits for the gradient which is the last to
+# arrive, this may reduce the overlap between backward pass and all-reduce,
+# all-reduce may end up being exposed, which slows down the training.
 #
 # ``DistributedDataParallel`` with ``find_unused_parameters=False`` (which is
 # the default setting) relies on automatic bucket formation based on order of