Skip to content

Commit 7594fe0

Browse files
committed
feedback
1 parent b09b952 commit 7594fe0

File tree

2 files changed

+32
-5
lines changed

2 files changed

+32
-5
lines changed

docs/source/en/optimization/fp16.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,9 @@ Refer to the [mixed precision training](https://huggingface.co/docs/transformers
8484

8585
## Scaled dot product attention
8686

87+
> [!TIP]
88+
> Memory-efficient attention optimizes for inference speed *and* [memory usage](./memory#memory-efficient-attention)!
89+
8790
[Scaled dot product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) implements several attention backends, [FlashAttention](https://github.com/Dao-AILab/flash-attention), [xFormers](https://github.com/facebookresearch/xformers), and a native C++ implementation. It automatically selects the most optimal backend for your hardware.
8891

8992
SDPA is enabled by default if you're using PyTorch >= 2.0 and no additional changes are required to your code. You could try experimenting with other attention backends though if you'd like to choose your own. The example below uses the [torch.nn.attention.sdpa_kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html) context manager to enable efficient attention.
@@ -132,9 +135,8 @@ pipeline = StableDiffusionXLPipeline.from_pretrained(
132135
).to("cuda")
133136
pipeline.unet.to(memory_format=torch.channels_last)
134137
pipeline.vae.to(memory_format=torch.channels_last)
135-
pipeline.unet = torch.compile(pipeline.unet,
136-
mode="max-autotune",
137-
fullgraph=True
138+
pipeline.unet = torch.compile(
139+
pipeline.unet, mode="max-autotune", fullgraph=True
138140
)
139141
pipeline.vae.decode = torch.compile(
140142
pipeline.vae.decode,
@@ -174,6 +176,9 @@ In general, the `sigmas` should [stay on the CPU](https://github.com/huggingface
174176

175177
The example below applies [dynamic int8 quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) to the UNet and VAE with the [torchao](../quantization/torchao) library.
176178

179+
> [!TIP]
180+
> Refer to our [torchao](../quantization/torchao) docs to learn more about how to use the Diffusers torchao integration.
181+
177182
Configure the compiler tags for maximum speed.
178183

179184
```py

docs/source/en/optimization/memory.md

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,12 @@ specific language governing permissions and limitations under the License.
1212

1313
# Reduce memory usage
1414

15-
Modern diffusion models like [Flux](../api/pipelines/flux) and [Wan](../api/pipelines/wan) have billions of parameters that take up a lot of memory on your hardware for inference. This is challenging because common GPUs often don't have sufficient memory.
15+
Modern diffusion models like [Flux](../api/pipelines/flux) and [Wan](../api/pipelines/wan) have billions of parameters that take up a lot of memory on your hardware for inference. This is challenging because common GPUs often don't have sufficient memory. To overcome the memory limitations, you can use more than one GPU (if available), offload some of the pipeline components to the CPU, and more.
1616

17-
To overcome the memory limitations, you can use more than one GPU (if available), offload some of the pipeline components to the CPU, and more. This guide will show you how to reduce your memory usage.
17+
This guide will show you how to reduce your memory usage.
18+
19+
> [!TIP]
20+
> Keep in mind these techniques may need to be adjusted depending on the model! For example, a transformer-based diffusion model may not benefit equally from these inference speed optimizations as a UNet-based model.
1821
1922
## Multiple GPUs
2023

@@ -73,6 +76,20 @@ pipeline = StableDiffusionXLPipeline.from_pretrained(
7376
)
7477
```
7578

79+
The `device_map` parameter also works on the model-level. This is useful for loading large models, such as the Flux diffusion transformer which has 12.5B parameters. Instead of `balanced`, set it to `"auto"` to automatically distribute a model across the fastest device first before moving to slower devices. Refer to the
80+
81+
```py
82+
import torch
83+
from diffusers import AutoModel
84+
85+
transformer = AutoModel.from_pretrained(
86+
"black-forest-labs/FLUX.1-dev",
87+
subfolder="transformer",
88+
device_map="auto",
89+
torch_dtype=torch.bfloat16
90+
)
91+
```
92+
7693
You can inspect a pipeline's device map with `hf_device_map`.
7794

7895
```py
@@ -270,6 +287,8 @@ Set `record_stream=True` for more of a speedup at the cost of slightly increased
270287
> [!TIP]
271288
> When `use_stream=True` on VAEs with tiling enabled, make sure to do a dummy forward pass (possible with dummy inputs as well) before inference to avoid device mismatch errors. This may not work on all implementations, so feel free to open an issue if you encounter any problems.
272289
290+
The `num_blocks_per_group` parameter should be set to `1` if `use_stream` is enabled.
291+
273292
```py
274293
pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True, record_stream=True)
275294
```
@@ -466,6 +485,9 @@ with torch.inference_mode():
466485

467486
## Memory-efficient attention
468487

488+
> [!TIP]
489+
> Memory-efficient attention optimizes for memory usage *and* [inference speed](./fp16#scaled-dot-product-attention!
490+
469491
The Transformers attention mechanism is memory-intensive, especially for long sequences, so you can try using different and more memory-efficient attention types.
470492

471493
By default, if PyTorch >= 2.0 is installed, [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) is used. You don't need to make any additional changes to your code.

0 commit comments

Comments
 (0)