diff --git a/_posts/2024-08-20-clipping-in-opacus.md b/_posts/2024-08-20-clipping-in-opacus.md new file mode 100644 index 000000000000..2019c68da3c8 --- /dev/null +++ b/_posts/2024-08-20-clipping-in-opacus.md @@ -0,0 +1,362 @@ +--- +layout: blog_detail +title: "Enabling Fast Gradient Clipping and Ghost Clipping in Opacus" +author: Enayat Ullah, Huanyu Zhang, Will Bullock, Ilya Mironov +--- + +## Introduction and Context + +[Differentially Private Stochastic Gradient Descent (DP-SGD)](https://arxiv.org/abs/1607.00133) is the canonical method for training machine learning models with differential privacy. It involves the following two modifications to its non-private counterpart, Stochastic Gradient Descent. + +1. **Per-sample gradient clipping**: Clip gradients with respect to every sample in the mini-batch, ensuring that its norm is at most a pre-specified value, “Clipping Norm”, C, in every iteration. + +2. **Noise addition**: Add Gaussian noise of pre-specified variance, depending on the clipping norm and privacy parameters, to the average clipped gradient, in every iteration. + +The first change, **per-sample gradient clipping**, introduces additional complexities since, in general, it requires instantiating **per-sample** **gradients**. + +[Opacus](http://opacus.ai) is a PyTorch implementation of DP-SGD. Opacus addresses the above task by employing [hook functions](https://medium.com/pytorch/differential-privacy-series-part-2-efficient-per-sample-gradient-computation-in-opacus-5bf4031d9e22), which allows intervening on specific events, such as forward and backward passes. For more details about Opacus, we encourage readers to review the previous blog posts: [DP-SGD Algorithm Explained](https://bit.ly/dp-sgd-algorithm-explained), [Efficient Per-Sample Gradient Computation in Opacus](https://medium.com/pytorch/differential-privacy-series-part-2-efficient-per-sample-gradient-computation-in-opacus-5bf4031d9e22) and [Efficient Per-Sample Gradient Computation for More Layers in Opacus](https://pytorch.medium.com/differential-privacy-series-part-3-efficient-per-sample-gradient-computation-for-more-layers-in-39bd25df237). + +While Opacus provides substantial efficiency gains compared to the naive approaches, the memory cost of instantiating per-sample gradients is significant. In particular, memory usage is proportional to the batch size times the number of trainable parameters. Consequently, memory limits Opacus to small batch sizes and/or small models, significantly restricting its range of applications. + +We introduce [Fast Gradient Clipping](https://arxiv.org/abs/2009.03106) and [Ghost Clipping](https://arxiv.org/abs/2110.05679) to Opacus, which enable developers and researchers to perform gradient clipping without instantiating the per-sample gradients. As an example, this allows for fine-tuning 7M parameters of BERT, on a single 16GB GPU, with a batch size of 1024, with memory comparable to using PyTorch (without applying DP-SGD). In contrast, the previous version of Opacus, supported a maximum batch size of roughly 256 for the same setting. We provide a [tutorial](https://github.com/pytorch/opacus/blob/main/tutorials/building\_text\_classifier.ipynb) on how to use Fast Gradient Clipping in Opacus with the aforementioned task as an example. + +## Fast Gradient Clipping and Ghost Clipping + +The key idea behind these techniques is based on the following observation: suppose per-sample gradient norms are known, then gradient clipping can be achieved by backpropagation on a re-weighted loss function $ \bar{L} $. This loss function is defined as $ \bar{L} = \sum_{i} R_{i} L_{i} $, where $ R_i = \min\left(\frac{C}{C_i}, 1\right) $ are the clipping coefficients computed from the per-sample gradient norms $ {C_i} $ and $ {L_i} $ are per-sample losses. + +The above idea may seem circular at first glance, as it appears to require instantiating per-sample gradients in order to calculate per-sample gradient norms. However, for certain widely-used components of neural network architectures, such as fully connected/linear layers, it is indeed possible to obtain per-sample gradient norms in a single backpropagation pass without the need for per-sample gradients. This suggests a workflow that involves two backpropagation passes: the first to compute per-sample gradient norms, and the second to compute the aggregated (not per-sample) clipped gradient. The second backpropagation is simply the standard batched backpropagation. + +![backpropagation diagram](/assets/images/clipping-in-opacus/fg1.jpg){:style="max-width:800px; display:block; margin-left: auto; margin-right: auto; width:100%"} + +![backpropagation diagram](/assets/images/clipping-in-opacus/fg2.png){:style="max-width:400px; display:block; margin-left: auto; margin-right: auto; width:100%"} + +_Figure 1: Comparison between vanilla **Opacus** (top left), **Fast Gradient Clipping** (top right), and **Ghost clipping** (bottom). We marked in red gradient instantiations that become memory bottlenecks. For vanilla Opacus, it has to instantiate the **per-sample gradients**. **Fast Gradient Clipping** instantiates per-sample gradients for each layer to compute its norm, which is immediately released once the backward pass moves on to the next layer. Ghost Clipping works directly from **per-sample activation gradients** and **per-sample activations**, and avoids the need for gradient instantiation._ + +[**Fast Gradient Clipping**](https://arxiv.org/abs/2009.03106) +In Fast Gradient Clipping, the per-sample gradient norm is calculated in three steps: + +1. For each layer, the per-sample gradient is instantiated and its norm is calculated. +2. The per-sample gradient is then immediately discarded. +3. The (squared) per-sample gradient norms of each layer are summed up to obtain the overall (squared) per-sample gradient norm. + + +[**Ghost Clipping**](https://arxiv.org/abs/2110.05679) +Extending the approach of Fast Gradient Clipping, Ghost Clipping uses the [fact](https://arxiv.org/abs/1510.01799) that for **linear layers[^1],** per-sample gradient norms can be calculated just from **activation gradients** and **activations**. In particular, let `backprops` and `activations` be per-sample activation gradients and activations, of dimensions `batch_size ✕ output_width` and `batch_size ✕ input_width`, respectively. The per-sample gradient is the outer product of the two, which takes `O(batch_size ✕ input_width ✕ output_width)` time and space. + +The [ghost clipping trick](https://arxiv.org/abs/1510.01799) instead calculates the (squared) norm of `backprops` and `activations`, sample-wise, and takes their product, which gives the (squared) norm of the gradient. This takes `O(batch-size ✕ (input_width + output_width))` time and takes `O(batch-size)` space to store. Since **per-sample activation** and **per-sample activation gradients** are already stored, additional memory is needed only for storing the norms. + +**Relationship between Fast Gradient Clipping and Ghost Clipping** + +1. Fast Gradient Clipping and Ghost Clipping are complementary techniques. Fast Gradient Clipping can be applied to any type of layer, while Ghost Clipping is a strictly better technique for supported layers. +2. Our implementation automatically switches to Fast Gradient Clipping when the layer is not supported by Ghost Clipping. + +### How to use Fast Gradient Clipping in Opacus + +The training loop is identical to that of the standard PyTorch loop. As in Opacus before, we use the `PrivacyEngine()`, which “sanitizes” the model and optimizer. To enable Ghost Clipping, the argument `grad_sample_mode="ghost"` is used. Additionally, `make_private()` takes the loss criterion as an extra input and sanitizes it. This allows us to hide the two backward passes and the loss rescaling in between in `loss.backward()`. + +```py +from opacus import PrivacyEngine +criterion = nn.CrossEntropyLoss() # example loss function + +privacy_engine = PrivacyEngine() +model_gc, optimizer_gc, criterion_gc, train_loader, = privacy_engine.make_private( + module=model, + optimizer=optimizer, + data_loader=train_loader, + noise_multiplier=noise_multiplier + max_grad_norm=max_grad_norm, + criterion=criterion, + grad_sample_mode="ghost", +) + +# The training loop below is identical to that of PyTorch + +for input_data, target_data in train_loader: + output_gc = model_gc(input_data) # Forward pass + optimizer_gc.zero_grad() + loss = criterion_gc(output_gc, target_data) + loss.backward() + optimizer_gc.step() # Add noise and update the model +``` + +Internally, before the first pass, we enable the *hooks*, which allows us to capture layer-wise values corresponding to forward and backward calls. They are used to compute the per-sample gradient norms. We then compute the clipping coefficients, rescale the loss function and disable hooks, which lets us use the standard PyTorch backward pass. + +### Memory Complexity Analysis + + Consider a multi-layer neural network with the following properties: + +**L**: Number of layers +**d**: Maximum layer width +**B**: Batch size +**K**: Number of non-supported/non-linear layers + +The memory overhead of DP-SGD with Ghost Clipping compared to plain (PyTorch) SGD is an additive O(BL), required to store the per-sample gradient norms for all layers. Further, if there is a non-supported layer (if K≥1), then there is an additional O(Bd2) memory to instantiate the gradient of that layer. + +### Memory Benchmarking + +We provide results on the memory usage for a variety of settings. + +#### Fine-Tuning BERT + +We consider the problem of [privately fine-tuning](https://github.com/pytorch/opacus/blob/main/tutorials/building\_text\_classifier.ipynb) the last three layers of BERT for a text classification task. The base model has over 100M parameters, of which we fine-tune the last three layers, `BertEncoder,` `BertPooler,` and `Classifier`, comprising roughly 7.6M parameters. The experiments are run on a P100 GPU with 16 GB of memory. + +The following table reports the maximum memory and time taken per iteration for the various methods: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Batch size +
B = 32 + B = 128 + B = 512 + B = 1024 + B = 2048 +
Mem + Time + Mem + Time + Mem + Time + Mem + Time + +
PyTorch SGD + 236 MB + 0.15 s + 1.04 GB + 0.55 s + 5.27 GB + 2.1 s + 12.7 GB + 4.2 s + OOM +
DP-SGD + 1,142 MB + 0.21 s + 4.55 GB + 0.68 s + OOM + OOM + OOM +
FGC DP-SGD + 908 MB + 0.21 s + 3.6 GB + 0.75 s + OOM + OOM + OOM +
GC DP-SGD + 362 MB + 0.21 s + 1.32 GB + 0.67 s + 5.27 GB + 2.5 s + 12.7 GB + 5 s + OOM +
+ + + +In terms of peak memory footprint, DP-SGD \> FGC DP-SGD ≫ GC DP-SGD ≈ PyTorch SGD. Further, the runtimes are similar because most of the parameters are frozen and the forward pass takes up most of the time. + +#### Synthetic Setup: Memory Profiling + +We consider the following setup to profile the memory used by PyTorch SGD, Vanilla DP-SGD and Ghost Clipping, GC DP-SGD. + +* 2-layer fully connected neural network + * Input: 5120 + * Hidden: 2560 + * Output: 1280 + * Total number of model parameters \= 15.6M + * Model size \= 62.5 MB +* Batch size, different values, as seen in the table below. + +The table below summarizes the max memory increase (in MB) broken down by stages of the training loop for each of the methods. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Batch Size + Method + Model to GPU + Forward + First Backward + Second Backward + Optimizer Step +
32 + PyTorch SGD + 62.5 + 0.5 + 62.5 + N/A + 0 +
Vanilla DP-SGD + 62.5 + 0.47 + 3,663 + N/A + 162.5 +
GC DP-SGD + 62.5 + 0.47 + 63.13 + 50 + 125 +
217 + PyTorch SGD + 62.5 + 1920 + 1932.5 + N/A + 0 +
Vanilla DP-SGD + OOM +
GC DP-SGD + 62.5 + 1920 + 2625 + 1932.5 + 125 +
+ + +#### Industry use case + +We tested Ghost Clipping DP-SGD on an internal Meta use case, consisting of a model of size roughly 100B with 40M trainable parameters. Our initial results show that Ghost Clipping SGD reduces 95% memory of vanilla DP-SGD, and achieves comparable memory usage to PyTorch SGD. + +## Conclusion + +In this post, we describe implementations of Fast Gradient Clipping and Ghost Clipping in Opacus that enable memory-efficient training of machine learning models with differential privacy. Currently, the Ghost Clipping implementation only applies to linear layers, but, as outlined in [part 3 of the series](https://pytorch.medium.com/differential-privacy-series-part-3-efficient-per-sample-gradient-computation-for-more-layers-in-39bd25df237), it can be extended to “generalized” linear layers such as convolutions and multi-head attention. The current techniques require two explicit backpropagation steps, which increases runtime. We will explore developments on top of Ghost Clipping such as the [Book-Keeping algorithm](https://arxiv.org/abs/2210.00038) for mitigation. + +To learn more about Opacus, visit [opacus.ai](https://opacus.ai/) and [github.com/pytorch/opacus](https://github.com/pytorch/opacus). + +## Acknowledgements + +We thank Iden Kalemaj, Darren Liu, Karthik Prasad, Hao Shi, Igor Shilov, Davide Testuggine, Eli Uriegas, Haicheng Wang, and Richard Zou for valuable feedback and suggestions. + +[^1]: There are [ways](https://proceedings.neurips.cc/paper\_files/paper/2023/file/a45d344b28179c8da7646bc38ff50ad8-Paper-Conference.pdf) to extend Ghost Clipping to non-linear layers. diff --git a/assets/images/clipping-in-opacus/fg1.jpg b/assets/images/clipping-in-opacus/fg1.jpg new file mode 100644 index 000000000000..f9045f7a6f7c Binary files /dev/null and b/assets/images/clipping-in-opacus/fg1.jpg differ diff --git a/assets/images/clipping-in-opacus/fg2.png b/assets/images/clipping-in-opacus/fg2.png new file mode 100644 index 000000000000..42de0f570add Binary files /dev/null and b/assets/images/clipping-in-opacus/fg2.png differ