diff --git a/intermediate_source/model_parallel_tutorial.py b/intermediate_source/model_parallel_tutorial.py
index 3a8ba248b43..f707b500c5e 100644
--- a/intermediate_source/model_parallel_tutorial.py
+++ b/intermediate_source/model_parallel_tutorial.py
@@ -4,15 +4,15 @@
*************************************************************
**Author**: `Shen Li `_
-Data parallel and model parallel are widely-used distributed training
+Data parallel and model parallel are widely-used in distributed training
techniques. Previous posts have explained how to use
`DataParallel `_
to train a neural network on multiple GPUs. ``DataParallel`` replicates the
same model to all GPUs, where each GPU consumes a different partition of the
input data. Although it can significantly accelerate the training process, it
-does not work for some use cases where the model is large to fit into a single
-GPU. This post shows how to solve that problem by using model parallel and also
-shares some insights on how to speed up model parallel training.
+does not work for some use cases where the model is too large to fit into a
+single GPU. This post shows how to solve that problem by using model parallel
+and also shares some insights on how to speed up model parallel training.
The high-level idea of model parallel is to place different sub-networks of a
model onto different devices, and implement the ``forward`` method accordingly
@@ -23,11 +23,21 @@
of model parallel. It is up to the readers to apply the ideas to real-world
applications.
-Let us start with a toy model that contains two linear layers. To run this
-model on two GPUs, simply put each linear layer on a different GPU, and move
-inputs and intermediate outputs to match the layer devices accordingly.
+**Recommended Reading:**
+
+- https://pytorch.org/ For installation instructions
+- :doc:`/beginner/blitz/data_parallel_tutorial` Single-Machine Data Parallel
+- :doc:`/intermediate/ddp_tutorial` Combine Distributed Data Parallel and Model Parallel
"""
+######################################################################
+# Basic Usage
+# =======================
+#
+# Let us start with a toy model that contains two linear layers. To run this
+# model on two GPUs, simply put each linear layer on a different GPU, and move
+# inputs and intermediate outputs to match the layer devices accordingly.
+
import torch
import torch.nn as nn
import torch.optim as optim