feedback

stevhliu · stevhliu · commit 7594fe06bd7e · 2025-04-30T12:33:56.000-07:00
diff --git a/docs/source/en/optimization/fp16.md b/docs/source/en/optimization/fp16.md
@@ -84,6 +84,9 @@ Refer to the [mixed precision training](https://huggingface.co/docs/transformers
 
 ## Scaled dot product attention
 
+> [!TIP]
+> Memory-efficient attention optimizes for inference speed *and* [memory usage](./memory#memory-efficient-attention)!
+
 [Scaled dot product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) implements several attention backends, [FlashAttention](https://github.com/Dao-AILab/flash-attention), [xFormers](https://github.com/facebookresearch/xformers), and a native C++ implementation. It automatically selects the most optimal backend for your hardware.
 
 SDPA is enabled by default if you're using PyTorch >= 2.0 and no additional changes are required to your code. You could try experimenting with other attention backends though if you'd like to choose your own. The example below uses the [torch.nn.attention.sdpa_kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html) context manager to enable efficient attention.
@@ -132,9 +135,8 @@ pipeline = StableDiffusionXLPipeline.from_pretrained(
 ).to("cuda")
 pipeline.unet.to(memory_format=torch.channels_last)
 pipeline.vae.to(memory_format=torch.channels_last)
-pipeline.unet = torch.compile(pipeline.unet,
-    mode="max-autotune",
-    fullgraph=True
+pipeline.unet = torch.compile(
+    pipeline.unet, mode="max-autotune", fullgraph=True
 )
 pipeline.vae.decode = torch.compile(
     pipeline.vae.decode,
@@ -174,6 +176,9 @@ In general, the `sigmas` should [stay on the CPU](https://github.com/huggingface
 
 The example below applies [dynamic int8 quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) to the UNet and VAE with the [torchao](../quantization/torchao) library.
 
+> [!TIP]
+> Refer to our [torchao](../quantization/torchao) docs to learn more about how to use the Diffusers torchao integration.
+
 Configure the compiler tags for maximum speed.
 
 ```py
diff --git a/docs/source/en/optimization/memory.md b/docs/source/en/optimization/memory.md
@@ -12,9 +12,12 @@ specific language governing permissions and limitations under the License.
 
 # Reduce memory usage
 
-Modern diffusion models like [Flux](../api/pipelines/flux) and [Wan](../api/pipelines/wan) have billions of parameters that take up a lot of memory on your hardware for inference. This is challenging because common GPUs often don't have sufficient memory.
+Modern diffusion models like [Flux](../api/pipelines/flux) and [Wan](../api/pipelines/wan) have billions of parameters that take up a lot of memory on your hardware for inference. This is challenging because common GPUs often don't have sufficient memory. To overcome the memory limitations, you can use more than one GPU (if available), offload some of the pipeline components to the CPU, and more.
 
-To overcome the memory limitations, you can use more than one GPU (if available), offload some of the pipeline components to the CPU, and more. This guide will show you how to reduce your memory usage.
+This guide will show you how to reduce your memory usage. 
+
+> [!TIP]
+> Keep in mind these techniques may need to be adjusted depending on the model! For example, a transformer-based diffusion model may not benefit equally from these inference speed optimizations as a UNet-based model.
 
 ## Multiple GPUs
 
@@ -73,6 +76,20 @@ pipeline = StableDiffusionXLPipeline.from_pretrained(
 )
 ```
 
+The `device_map` parameter also works on the model-level. This is useful for loading large models, such as the Flux diffusion transformer which has 12.5B parameters. Instead of `balanced`, set it to `"auto"` to automatically distribute a model across the fastest device first before moving to slower devices. Refer to the 
+
+```py
+import torch
+from diffusers import AutoModel
+
+transformer = AutoModel.from_pretrained(
+    "black-forest-labs/FLUX.1-dev", 
+    subfolder="transformer",
+    device_map="auto",
+    torch_dtype=torch.bfloat16
+)
+```
+
 You can inspect a pipeline's device map with `hf_device_map`.
 
 ```py
@@ -270,6 +287,8 @@ Set `record_stream=True` for more of a speedup at the cost of slightly increased
 > [!TIP]
 > When `use_stream=True` on VAEs with tiling enabled, make sure to do a dummy forward pass (possible with dummy inputs as well) before inference to avoid device mismatch errors. This may not work on all implementations, so feel free to open an issue if you encounter any problems.
 
+The `num_blocks_per_group` parameter should be set to `1` if `use_stream` is enabled.
+
 ```py
 pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True, record_stream=True)
 ```
@@ -466,6 +485,9 @@ with torch.inference_mode():
 
 ## Memory-efficient attention
 
+> [!TIP]
+> Memory-efficient attention optimizes for memory usage *and* [inference speed](./fp16#scaled-dot-product-attention!
+
 The Transformers attention mechanism is memory-intensive, especially for long sequences, so you can try using different and more memory-efficient attention types.
 
 By default, if PyTorch >= 2.0 is installed, [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) is used. You don't need to make any additional changes to your code.