Skip to content

Commit 3762fd2

Browse files
committed
address reviews
1 parent d937769 commit 3762fd2

File tree

4 files changed

+16
-14
lines changed

4 files changed

+16
-14
lines changed
1.45 KB
Loading
Loading

en-wordlist.txt

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,6 @@ DataLoaders
7575
DeepMind
7676
DeiT
7777
DenseNet
78-
Devito
7978
EOS
8079
EPS
8180
Ecker
@@ -130,7 +129,8 @@ LeNet
130129
LeakyReLU
131130
LeakyReLUs
132131
Lipschitz
133-
logits
132+
LoRa
133+
LRSchedulers
134134
Lua
135135
Luong
136136
MLP
@@ -207,10 +207,10 @@ Unescape
207207
VGG
208208
VQA
209209
VS Code
210+
Woohoo
210211
Wikitext
211212
Xeon
212213
Xcode
213-
Zach
214214
accuracies
215215
activations
216216
adversarially
@@ -331,6 +331,7 @@ labelled
331331
learnable
332332
learnings
333333
loadFilename
334+
logits
334335
manualSeed
335336
matmul
336337
matplotlib

intermediate_source/optimizer_step_in_backward_tutorial.py

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@
1818
1919
To run this tutorial, you will need:
2020
* PyTorch 2.1.0 or newer with ``torchvision``
21-
* 1 CUDA GPU
21+
* 1 CUDA GPU if you'd like to run the memory visualizations locally.
22+
Otherwise, this technique would benefit similarly on any device.
2223
2324
Let us start by importing the required modules and models. We will use a
2425
vision transformer model from torchvision, but feel free to substitute
@@ -60,9 +61,9 @@ def train(model, optimizer):
6061
# analyze them properly. Typically, training memory consists of:
6162
#
6263
# * Model parameters (size P)
63-
# * Activations (size A)
64+
# * Activations that are saved for the backward pass (size A)
6465
# * Gradients, which are the same size as the model parameters, so size G = P.
65-
# * Optimizer state, which is usually a relation to the model parameters. In
66+
# * Optimizer state, which is proportional to the size of the parameters. In
6667
# this case, the state for Adam requires 2x the model parameters, so size O = 2P.
6768
# * Intermediate tensors, which are allocated throughout the compute. We will
6869
# not worry about them for now as they are usually small and ephemeral.
@@ -88,8 +89,8 @@ def train(model, optimizer):
8889
torch.cuda.memory._record_memory_history(enabled=None)
8990

9091
###############################################################################
91-
# Now open up the snapshot in Zach Devito's [CUDA Memory Visualizer](
92-
# https://zdevito.github.io/assets/viz/) by dragging and dropping the
92+
# Now open up the snapshot in the CUDA Memory Visualizer at
93+
# https://pytorch.org/memory_viz by dragging and dropping the
9394
# ``snapshot.pickle`` file. Does the memory timeline match your expectations?
9495
#
9596
# .. figure:: /_static/img/optim_step_in_bwd/snapshot.jpg
@@ -114,7 +115,7 @@ def train(model, optimizer):
114115
# The peak memory usage is during the optimizer step! Note the memory then
115116
# consists of ~1.2GB of params, ~1.2GB of gradients, and ~2.4GB=2*1.2GB of
116117
# the optimizer state as expected. The last ~1.2GB comes from Adam optimizer
117-
# requiring memory for intermediates, totalling to ~6GB of peak memory.
118+
# requiring memory for intermediates, totaling to ~6GB of peak memory.
118119
# Technically, you can remove the need for the last 1.2GB for optimizer
119120
# intermediates if you set ``Adam(model.parameters(), foreach=False)`` which
120121
# would trade off runtime for memory. If switching off the ``foreach`` runtime
@@ -168,7 +169,7 @@ def train(model, optimizer):
168169

169170
# Instead of having just *one* optimizer, we will have a ``dict`` of optimizers
170171
# for every parameter so we could reference them in our hook.
171-
optimizer_dict = {p: torch.optim.Adam([p]) for p in model.parameters()}
172+
optimizer_dict = {p: torch.optim.Adam([p], foreach=False) for p in model.parameters()}
172173

173174
# Define our hook, which will call the optimizer ``step()`` and ``zero_grad()``
174175
def optimizer_hook(parameter) -> None:
@@ -200,14 +201,14 @@ def train(model):
200201
# ``LRScheduler``s or manipulate optimizer configuration throughout the
201202
# training epochs. Working out this API with those changes will be more
202203
# involved and will likely require moving more configuration into global
203-
# state but should not be impossible. That said, a next step for us is
204-
# to make this API easier to adopt with LRSchedulers and other features
204+
# state but should not be impossible. That said, a next step for PyTorch
205+
# is to make this API easier to adopt with LRSchedulers and other features
205206
# you are already used to.
206207
#
207208
# But let me get back to convincing you that this technique is worth it.
208209
# We will consult our friend, the memory snapshot.
209210

210-
# del optimizer memory from before to get a clean slate for the next
211+
# delete optimizer memory from before to get a clean slate for the next
211212
# memory snapshot
212213
del optimizer
213214

@@ -243,7 +244,7 @@ def train(model):
243244
# Note that there is no longer any big chunk of memory allocated for the gradients
244245
# compared to before, accounting for ~1.2GB of memory savings. Instead, we've freed
245246
# each gradient very quickly after they've been computed by moving the optimizer
246-
# step as far ahead as we can. Woo-hoo! By the way, the other ~1.2GB of memory savings
247+
# step as far ahead as we can. Woohoo! By the way, the other ~1.2GB of memory savings
247248
# comes from breaking apart the optimizer into per-parameter optimizers, so the
248249
# intermediates have proportionally shrunk. This detail is `less important` than
249250
# the gradient memory savings, as you can get optimizer intermediates savings

0 commit comments

Comments
 (0)