You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/optimization/fp16.md
+8-3Lines changed: 8 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -84,6 +84,9 @@ Refer to the [mixed precision training](https://huggingface.co/docs/transformers
84
84
85
85
## Scaled dot product attention
86
86
87
+
> [!TIP]
88
+
> Memory-efficient attention optimizes for inference speed *and*[memory usage](./memory#memory-efficient-attention)!
89
+
87
90
[Scaled dot product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) implements several attention backends, [FlashAttention](https://github.com/Dao-AILab/flash-attention), [xFormers](https://github.com/facebookresearch/xformers), and a native C++ implementation. It automatically selects the most optimal backend for your hardware.
88
91
89
92
SDPA is enabled by default if you're using PyTorch >= 2.0 and no additional changes are required to your code. You could try experimenting with other attention backends though if you'd like to choose your own. The example below uses the [torch.nn.attention.sdpa_kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html) context manager to enable efficient attention.
@@ -174,6 +176,9 @@ In general, the `sigmas` should [stay on the CPU](https://github.com/huggingface
174
176
175
177
The example below applies [dynamic int8 quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) to the UNet and VAE with the [torchao](../quantization/torchao) library.
176
178
179
+
> [!TIP]
180
+
> Refer to our [torchao](../quantization/torchao) docs to learn more about how to use the Diffusers torchao integration.
Copy file name to clipboardExpand all lines: docs/source/en/optimization/memory.md
+24-2Lines changed: 24 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -12,9 +12,12 @@ specific language governing permissions and limitations under the License.
12
12
13
13
# Reduce memory usage
14
14
15
-
Modern diffusion models like [Flux](../api/pipelines/flux) and [Wan](../api/pipelines/wan) have billions of parameters that take up a lot of memory on your hardware for inference. This is challenging because common GPUs often don't have sufficient memory.
15
+
Modern diffusion models like [Flux](../api/pipelines/flux) and [Wan](../api/pipelines/wan) have billions of parameters that take up a lot of memory on your hardware for inference. This is challenging because common GPUs often don't have sufficient memory. To overcome the memory limitations, you can use more than one GPU (if available), offload some of the pipeline components to the CPU, and more.
16
16
17
-
To overcome the memory limitations, you can use more than one GPU (if available), offload some of the pipeline components to the CPU, and more. This guide will show you how to reduce your memory usage.
17
+
This guide will show you how to reduce your memory usage.
18
+
19
+
> [!TIP]
20
+
> Keep in mind these techniques may need to be adjusted depending on the model! For example, a transformer-based diffusion model may not benefit equally from these inference speed optimizations as a UNet-based model.
The `device_map` parameter also works on the model-level. This is useful for loading large models, such as the Flux diffusion transformer which has 12.5B parameters. Instead of `balanced`, set it to `"auto"` to automatically distribute a model across the fastest device first before moving to slower devices. Refer to the
80
+
81
+
```py
82
+
import torch
83
+
from diffusers import AutoModel
84
+
85
+
transformer = AutoModel.from_pretrained(
86
+
"black-forest-labs/FLUX.1-dev",
87
+
subfolder="transformer",
88
+
device_map="auto",
89
+
torch_dtype=torch.bfloat16
90
+
)
91
+
```
92
+
76
93
You can inspect a pipeline's device map with `hf_device_map`.
77
94
78
95
```py
@@ -270,6 +287,8 @@ Set `record_stream=True` for more of a speedup at the cost of slightly increased
270
287
> [!TIP]
271
288
> When `use_stream=True` on VAEs with tiling enabled, make sure to do a dummy forward pass (possible with dummy inputs as well) before inference to avoid device mismatch errors. This may not work on all implementations, so feel free to open an issue if you encounter any problems.
272
289
290
+
The `num_blocks_per_group` parameter should be set to `1` if `use_stream` is enabled.
> Memory-efficient attention optimizes for memory usage *and*[inference speed](./fp16#scaled-dot-product-attention!
490
+
469
491
The Transformers attention mechanism is memory-intensive, especially for long sequences, so you can try using different and more memory-efficient attention types.
470
492
471
493
By default, if PyTorch >= 2.0 is installed, [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) is used. You don't need to make any additional changes to your code.
0 commit comments