You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/optimization/fp16.md
-25Lines changed: 0 additions & 25 deletions
Original file line number
Diff line number
Diff line change
@@ -217,29 +217,4 @@ An input is projected into three subspaces, represented by the projection matric
217
217
218
218
```py
219
219
pipeline.fuse_qkv_projections()
220
-
```
221
-
222
-
## Distilled models
223
-
224
-
Another option for accelerating inference is to use a smaller distilled model if it's available. During distillation, many of the UNet's residual and attention blocks are discarded to reduce model size and improve latency. A distilled model is faster and uses less memory without compromising quality compared to a full-sized model.
225
-
226
-
> [!TIP]
227
-
> Read [Open-sourcing Knowledge Distillation Code and Weights of SD-Small and SD-Tiny](https://huggingface.co/blog/sd_distillation) to learn more about how knowledge distillation training works to produce a faster, smaller, and cheaper generative model.
228
-
229
-
The example below uses a distilled Stable Diffusion XL model and VAE.
230
-
231
-
```py
232
-
import torch
233
-
from diffusers import DiffusionPipeline, AutoencoderTiny
234
-
235
-
pipeline = DiffusionPipeline.from_pretrained(
236
-
"segmind/SSD-1B", torch_dtype=torch.float16
237
-
)
238
-
pipeline.vae = AutoencoderTiny.from_pretrained(
239
-
"madebyollin/taesdxl", torch_dtype=torch.float16
240
-
)
241
-
pipeline = pipeline.to("cuda")
242
-
243
-
prompt ="slice of delicious New York-style cheesecake topped with berries, mint, chocolate crumble"
The `device_map` parameter also works on the model-level. This is useful for loading large models, such as the Flux diffusion transformer which has 12.5B parameters. Instead of `balanced`, set it to `"auto"` to automatically distribute a model across the fastest device first before moving to slower devices. Refer to the
79
+
You can inspect a pipeline's device map with `hf_device_map`.
The `device_map` parameter also works on the model-level. This is useful for loading large models, such as the Flux diffusion transformer which has 12.5B parameters. Instead of `balanced`, set it to `"auto"` to automatically distribute a model across the fastest device first before moving to slower devices. Refer to the [Model sharding](../training/distributed_inference#model-sharding) docs for more details.
For more fine-grained control, pass a dictionary to enforce the maximum GPU memory to use on each device. If a device is not in `max_memory`, it is ignored and pipeline components won't be distributed to it.
101
101
102
102
```py
@@ -245,7 +245,7 @@ Call [`~ModelMixin.enable_group_offload`] to enable it for standard Diffusers mo
245
245
246
246
The `offload_type` parameter can be set to `block_level` or `leaf_level`.
247
247
248
-
-`block_level` offloads groups of layers based on the `num_blocks_per_group` parameter. For example, if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (2o total onloads/offloads). This drastically reduces memory requirements.
248
+
-`block_level` offloads groups of layers based on the `num_blocks_per_group` parameter. For example, if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (20 total onloads/offloads). This drastically reduces memory requirements.
249
249
-`leaf_level` offloads individual layers at the lowest level and is equivalent to [CPU offloading](#cpu-offloading). But it can be made faster if you use streams without giving up inference speed.
250
250
251
251
```py
@@ -287,7 +287,7 @@ Set `record_stream=True` for more of a speedup at the cost of slightly increased
287
287
> [!TIP]
288
288
> When `use_stream=True` on VAEs with tiling enabled, make sure to do a dummy forward pass (possible with dummy inputs as well) before inference to avoid device mismatch errors. This may not work on all implementations, so feel free to open an issue if you encounter any problems.
289
289
290
-
The `num_blocks_per_group` parameter should be set to `1` if `use_stream` is enabled.
290
+
If you're using `block_level` group offloading with `use_stream` enabled, the `num_blocks_per_group` parameter should be set to `1`, otherwise a warning will be raised.
0 commit comments