diff --git a/_static/img/optim_step_in_bwd/snapshot.jpg b/_static/img/optim_step_in_bwd/snapshot.jpg new file mode 100644 index 00000000000..50be55e7b9a Binary files /dev/null and b/_static/img/optim_step_in_bwd/snapshot.jpg differ diff --git a/_static/img/optim_step_in_bwd/snapshot_opt_in_bwd.jpg b/_static/img/optim_step_in_bwd/snapshot_opt_in_bwd.jpg new file mode 100644 index 00000000000..65d53d21c38 Binary files /dev/null and b/_static/img/optim_step_in_bwd/snapshot_opt_in_bwd.jpg differ diff --git a/en-wordlist.txt b/en-wordlist.txt index a4a9a3e4de7..96cd8ccc3f8 100644 --- a/en-wordlist.txt +++ b/en-wordlist.txt @@ -129,7 +129,8 @@ LeNet LeakyReLU LeakyReLUs Lipschitz -logits +LoRa +LRSchedulers Lua Luong MLP @@ -206,6 +207,7 @@ Unescape VGG VQA VS Code +Woohoo Wikitext Xeon Xcode @@ -329,6 +331,7 @@ labelled learnable learnings loadFilename +logits manualSeed matmul matplotlib diff --git a/index.rst b/index.rst index 578f5ac20aa..d86079bc2d8 100644 --- a/index.rst +++ b/index.rst @@ -511,7 +511,7 @@ What's new in PyTorch tutorials? .. customcarditem:: :header: Parametrizations Tutorial - :card_description: Learn how to use torch.nn.utils.parametrize to put constriants on your parameters (e.g. make them orthogonal, symmetric positive definite, low-rank...) + :card_description: Learn how to use torch.nn.utils.parametrize to put constraints on your parameters (e.g. make them orthogonal, symmetric positive definite, low-rank...) :image: _static/img/thumbnails/cropped/parametrizations.png :link: intermediate/parametrizations.html :tags: Model-Optimization,Best-Practice @@ -523,6 +523,13 @@ What's new in PyTorch tutorials? :link: intermediate/pruning_tutorial.html :tags: Model-Optimization,Best-Practice +.. customcarditem:: + :header: How to save memory by fusing the optimizer step into the backward pass + :card_description: Learn a memory-saving technique through fusing the optimizer step into the backward pass using memory snapshots. + :image: _static/img/thumbnails/cropped/pytorch-logo.png + :link: intermediate/optimizer_step_in_backward_tutorial.html + :tags: Model-Optimization,Best-Practice,CUDA,Frontend-APIs + .. customcarditem:: :header: (beta) Dynamic Quantization on an LSTM Word Language Model :card_description: Apply dynamic quantization, the easiest form of quantization, to a LSTM-based next word prediction model. diff --git a/intermediate_source/optimizer_step_in_backward_tutorial.py b/intermediate_source/optimizer_step_in_backward_tutorial.py new file mode 100644 index 00000000000..fd5fcb74fc2 --- /dev/null +++ b/intermediate_source/optimizer_step_in_backward_tutorial.py @@ -0,0 +1,268 @@ +""" + +How to save memory by fusing the optimizer step into the backward pass +====================================================================== + +Hello there! This tutorial aims to showcase one way of reducing the +memory footprint of a training loop by reducing the memory taken by +the *gradients*. Say you have a model and you're interested in ways to +optimize memory to avoid ``Out of Memory`` (OOM) errors or simply to ooze +more out of your GPU. Well, you _might_ be in luck (if gradients take up +a portion of your memory and you do not need to do gradient accumulation). +We will explore the following: + +1. What takes up memory during your training or finetuning loop, +2. How to capture and visualize memory snapshots to determine the bottleneck, +3. The new ``Tensor.register_post_accumulate_grad_hook(hook)`` API, and finally, +4. How everything fits together in 10 lines to achieve memory savings. + +To run this tutorial, you will need: + +* PyTorch 2.1.0 or newer with ``torchvision`` +* 1 CUDA GPU if you'd like to run the memory visualizations locally. + Otherwise, this technique would benefit similarly on any device. + +Let us start by importing the required modules and models. We will use a +vision transformer model from torchvision, but feel free to substitute +with your own model. We will also use ``torch.optim.Adam`` as our optimizer, +but, again, feel free to substitute with your own optimizer. + +""" + +import torch +from torchvision import models +from pickle import dump + +model = models.vit_l_16(weights='DEFAULT').cuda() +optimizer = torch.optim.Adam(model.parameters()) + +############################################################################### +# Now let's define our typical training loop. You should use real images when +# training, but for the purposes of this tutorial, we are passing in fake +# inputs and not worrying about loading any actual data. + +IMAGE_SIZE = 224 + +def train(model, optimizer): + # create our fake image input: tensor shape is batch_size, channels, height, width + fake_image = torch.rand(1, 3, IMAGE_SIZE, IMAGE_SIZE).cuda() + + # call our forward and backward + loss = model.forward(fake_image) + loss.sum().backward() + + # optimizer update + optimizer.step() + optimizer.zero_grad() + +############################################################################### +# Memory usage during training +# """""""""""""""""""""""""""" +# We are about to look at some memory snapshots, so we should be prepared to +# analyze them properly. Typically, training memory consists of: +# +# * Model parameters (size P) +# * Activations that are saved for the backward pass (size A) +# * Gradients, which are the same size as the model parameters, so size G = P. +# * Optimizer state, which is proportional to the size of the parameters. In +# this case, the state for Adam requires 2x the model parameters, so size O = 2P. +# * Intermediate tensors, which are allocated throughout the compute. We will +# not worry about them for now as they are usually small and ephemeral. +# +# Capturing and visualizing memory snapshots +# """""""""""""""""""""""""""""""""""""""""" +# Let's get us a memory snapshot! As your code runs, consider what you may expect +# the CUDA memory timeline to look like. + +# tell CUDA to start recording memory allocations +torch.cuda.memory._record_memory_history(enabled='all') + +# train 3 steps +for _ in range(3): + train(model, optimizer) + +# save a snapshot of the memory allocations +s = torch.cuda.memory._snapshot() +with open(f"snapshot.pickle", "wb") as f: + dump(s, f) + +# tell CUDA to stop recording memory allocations now +torch.cuda.memory._record_memory_history(enabled=None) + +############################################################################### +# Now open up the snapshot in the CUDA Memory Visualizer at +# https://pytorch.org/memory_viz by dragging and dropping the +# ``snapshot.pickle`` file. Does the memory timeline match your expectations? +# +# .. figure:: /_static/img/optim_step_in_bwd/snapshot.jpg +# :alt: snapshot.png loaded into CUDA Memory Visualizer +# +# The model parameters have already been loaded in memory before the training +# step, so we see a chunk of memory devoted to the weights right off the bat. +# As we start our forward pass, memory is allocated gradually for the activations, +# or the tensors we are saving to be able to compute gradients in the backward pass. +# Once we start the backward pass, the activations are gradually freed while memory +# of the gradients starts building up. +# +# Lastly, as the optimizer kicks in, its state will be lazily initialized, so we +# should see the optimizer state memory gradually increase during the optimizer +# step of the first training loop only. In future loops, the optimizer memory +# will remain and be updated in-place. The memory for the gradients is then +# freed accordingly at the end of every training loop when ``zero_grad`` is called. +# +# Where is the memory bottleneck in this training loop? Or, in other words, +# where is the peak memory? +# +# The peak memory usage is during the optimizer step! Note the memory then +# consists of ~1.2GB of parameters, ~1.2GB of gradients, and ~2.4GB=2*1.2GB of +# the optimizer state as expected. The last ~1.2GB comes from Adam optimizer +# requiring memory for intermediates, totaling to ~6GB of peak memory. +# Technically, you can remove the need for the last 1.2GB for optimizer +# intermediates if you set ``Adam(model.parameters(), foreach=False)`` which +# would trade off runtime for memory. If switching off the ``foreach`` runtime +# optimization is sufficient in memory savings for you, nice, but please +# read on if you're curious how this tutorial can help you do better! +# With the technique we will soon introduce, we will reduce peak memory by +# removing the need for the ~1.2GB of **gradients memory** as well as **optimizer +# intermediates memory**. Now, what would you expect the new peak memory to be? +# The answer will be revealed in the `next` snapshot. +# +# DISCLAIMER: This technique is **not** for all +# """"""""""""""""""""""""""""""""""""""""""""" +# Before we get too excited, we have to consider whether this technique is applicable +# for `your` use case. This is NOT a silver bullet! The technique of fusing the +# optimizer step into the backward only targets reducing *gradient* memory (and as a side effect also optimizer intermediates +# memory). Thus, the more sizable the memory taken up by the gradients, the more +# tantamount the memory reduction. In our example above, the gradients eat up 20% +# of the memory pie, which is quite sizable! +# +# This may not be the case for you, for example, if your weights are already tiny, +# (say, due to applying LoRa,) then the gradients do not take much space in your +# training loop and the wins are way less exciting. In that case, you should +# first try other techniques like activations checkpointing, distributed +# training, quantization, or reducing the batch size. Then, when the gradients +# are part of the bottleneck again, come back to this tutorial! +# +# Still here? Cool, let's introduce our new ``register_post_accumulate_grad_hook(hook)`` +# API on Tensor. +# +# ``Tensor.register_post_accumulate_grad_hook(hook)`` API and our technique +# """""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +# Our technique relies on not having to save the gradients during ``backward()``. Instead, +# once a gradient has been accumulated, we will immediately apply the optimizer to +# the corresponding parameter and drop that gradient entirely! This removes the need +# for holding onto a big buffer of gradients until the optimizer step. +# +# So how can we unlock the behavior of applying the optimizer more eagerly? In our 2.1 +# release, we've added a new API :func:`torch.Tensor.register_post_accumulate_grad_hook` +# that would allow us to add a hook onto a Tensor once its ``.grad`` field has been +# accumulated. We will encapsulate the optimizer step into this hook. How? +# +# How everything fits together in 10 lines +# """""""""""""""""""""""""""""""""""""""" +# Remember our model and optimizer setup from the beginning? I'll leave them commented +# out below so we don't spend resources rerunning the code. +# +# .. code-block:: python +# +# model = models.vit_l_16(weights='DEFAULT').cuda() +# optimizer = torch.optim.Adam(model.parameters()) + +# Instead of having just *one* optimizer, we will have a ``dict`` of optimizers +# for every parameter so we could reference them in our hook. +optimizer_dict = {p: torch.optim.Adam([p], foreach=False) for p in model.parameters()} + +# Define our hook, which will call the optimizer ``step()`` and ``zero_grad()`` +def optimizer_hook(parameter) -> None: + optimizer_dict[parameter].step() + optimizer_dict[parameter].zero_grad() + +# Register the hook onto every parameter +for p in model.parameters(): + p.register_post_accumulate_grad_hook(optimizer_hook) + +# Now remember our previous ``train()`` function? Since the optimizer has been +# fused into the backward, we can remove the optimizer step and zero_grad calls. +def train(model): + # create our fake image input: tensor shape is batch_size, channels, height, width + fake_image = torch.rand(1, 3, IMAGE_SIZE, IMAGE_SIZE).cuda() + + # call our forward and backward + loss = model.forward(fake_image) + loss.sum().backward() + + # optimizer update --> no longer needed! + # optimizer.step() + # optimizer.zero_grad() + +######################################################################## +# That took about 10 lines of changes in our sample model, which is neat. +# However, for real models, it could be a fairly intrusive change to switch +# out the optimizer for an optimizer dictionary, especially for those who use +# ``LRScheduler``s or manipulate optimizer configuration throughout the +# training epochs. Working out this API with those changes will be more +# involved and will likely require moving more configuration into global +# state but should not be impossible. That said, a next step for PyTorch +# is to make this API easier to adopt with LRSchedulers and other features +# you are already used to. +# +# But let me get back to convincing you that this technique is worth it. +# We will consult our friend, the memory snapshot. + +# delete optimizer memory from before to get a clean slate for the next +# memory snapshot +del optimizer + +# tell CUDA to start recording memory allocations +torch.cuda.memory._record_memory_history(enabled='all') + +# train 3 steps. note that we no longer pass the optimizer into train() +for _ in range(3): + train(model) + +# save a snapshot of the memory allocations +s = torch.cuda.memory._snapshot() +with open(f"snapshot-opt-in-bwd.pickle", "wb") as f: + dump(s, f) + +# tell CUDA to stop recording memory allocations now +torch.cuda.memory._record_memory_history(enabled=None) + +############################################################################### +# Yes, take some time to drag your snapshot into the CUDA Memory Visualizer. +# +# .. figure:: /_static/img/optim_step_in_bwd/snapshot_opt_in_bwd.jpg +# :alt: snapshot.png loaded into CUDA Memory Visualizer +# +# Several major observations: +# 1. There is no more optimizer step! Right...we fused that into the backward. +# 2. Likewise, the backward drags longer and there are more random allocations +# for intermediates. This is expected, as the optimizer step requires +# intermediates. +# 3. Most importantly! The peak memory is lower! It is now ~4GB (which I +# hope maps closely to your earlier expectation). +# +# Note that there is no longer any big chunk of memory allocated for the gradients +# compared to before, accounting for ~1.2GB of memory savings. Instead, we've freed +# each gradient very quickly after they've been computed by moving the optimizer +# step as far ahead as we can. Woohoo! By the way, the other ~1.2GB of memory savings +# comes from breaking apart the optimizer into per-parameter optimizers, so the +# intermediates have proportionally shrunk. This detail is `less important` than +# the gradient memory savings, as you can get optimizer intermediates savings +# from just turning ``foreach=False`` without this technique. +# +# You may be correctly wondering: if we saved 2.4GB of memory, why is the peak memory +# NOT 6GB - 2.4GB = 3.6GB? Well, the peak has moved! The peak is now near the start +# of the backward step, when we still have activations in memory, where before, the peak +# was during the optimizer step when the activations had been freed. The ~0.4GB difference +# accounting for ~4.0GB - ~3.6GB is thus due to the activations memory. One can then +# imagine that this technique can be coupled with activations checkpointing for more +# memory wins. +# +# Conclusion +# """""""""" +# In this tutorial, we learned about the memory saving technique of +# fusing the optimizer into the backward step through the new +# ``Tensor.register_post_accumulate_grad_hook()`` API and *when* to apply this +# technique (when gradients memory is significant). Along the way, we also learned +# about memory snapshots, which are generally useful in memory optimization. \ No newline at end of file