Skip to content

Commit 118b2c3

Browse files
committed
feedback
1 parent f8f45ba commit 118b2c3

File tree

2 files changed

+10
-35
lines changed

2 files changed

+10
-35
lines changed

docs/source/en/optimization/fp16.md

Lines changed: 0 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -217,29 +217,4 @@ An input is projected into three subspaces, represented by the projection matric
217217

218218
```py
219219
pipeline.fuse_qkv_projections()
220-
```
221-
222-
## Distilled models
223-
224-
Another option for accelerating inference is to use a smaller distilled model if it's available. During distillation, many of the UNet's residual and attention blocks are discarded to reduce model size and improve latency. A distilled model is faster and uses less memory without compromising quality compared to a full-sized model.
225-
226-
> [!TIP]
227-
> Read [Open-sourcing Knowledge Distillation Code and Weights of SD-Small and SD-Tiny](https://huggingface.co/blog/sd_distillation) to learn more about how knowledge distillation training works to produce a faster, smaller, and cheaper generative model.
228-
229-
The example below uses a distilled Stable Diffusion XL model and VAE.
230-
231-
```py
232-
import torch
233-
from diffusers import DiffusionPipeline, AutoencoderTiny
234-
235-
pipeline = DiffusionPipeline.from_pretrained(
236-
"segmind/SSD-1B", torch_dtype=torch.float16
237-
)
238-
pipeline.vae = AutoencoderTiny.from_pretrained(
239-
"madebyollin/taesdxl", torch_dtype=torch.float16
240-
)
241-
pipeline = pipeline.to("cuda")
242-
243-
prompt = "slice of delicious New York-style cheesecake topped with berries, mint, chocolate crumble"
244-
pipeline(prompt, num_inference_steps=50).images[0]
245220
```

docs/source/en/optimization/memory.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,14 @@ pipeline = StableDiffusionXLPipeline.from_pretrained(
7676
)
7777
```
7878

79-
The `device_map` parameter also works on the model-level. This is useful for loading large models, such as the Flux diffusion transformer which has 12.5B parameters. Instead of `balanced`, set it to `"auto"` to automatically distribute a model across the fastest device first before moving to slower devices. Refer to the
79+
You can inspect a pipeline's device map with `hf_device_map`.
80+
81+
```py
82+
print(pipeline.hf_device_map)
83+
{'unet': 1, 'vae': 1, 'safety_checker': 0, 'text_encoder': 0}
84+
```
85+
86+
The `device_map` parameter also works on the model-level. This is useful for loading large models, such as the Flux diffusion transformer which has 12.5B parameters. Instead of `balanced`, set it to `"auto"` to automatically distribute a model across the fastest device first before moving to slower devices. Refer to the [Model sharding](../training/distributed_inference#model-sharding) docs for more details.
8087

8188
```py
8289
import torch
@@ -90,13 +97,6 @@ transformer = AutoModel.from_pretrained(
9097
)
9198
```
9299

93-
You can inspect a pipeline's device map with `hf_device_map`.
94-
95-
```py
96-
print(pipeline.hf_device_map)
97-
{'unet': 1, 'vae': 1, 'safety_checker': 0, 'text_encoder': 0}
98-
```
99-
100100
For more fine-grained control, pass a dictionary to enforce the maximum GPU memory to use on each device. If a device is not in `max_memory`, it is ignored and pipeline components won't be distributed to it.
101101

102102
```py
@@ -245,7 +245,7 @@ Call [`~ModelMixin.enable_group_offload`] to enable it for standard Diffusers mo
245245

246246
The `offload_type` parameter can be set to `block_level` or `leaf_level`.
247247

248-
- `block_level` offloads groups of layers based on the `num_blocks_per_group` parameter. For example, if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (2o total onloads/offloads). This drastically reduces memory requirements.
248+
- `block_level` offloads groups of layers based on the `num_blocks_per_group` parameter. For example, if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (20 total onloads/offloads). This drastically reduces memory requirements.
249249
- `leaf_level` offloads individual layers at the lowest level and is equivalent to [CPU offloading](#cpu-offloading). But it can be made faster if you use streams without giving up inference speed.
250250

251251
```py
@@ -287,7 +287,7 @@ Set `record_stream=True` for more of a speedup at the cost of slightly increased
287287
> [!TIP]
288288
> When `use_stream=True` on VAEs with tiling enabled, make sure to do a dummy forward pass (possible with dummy inputs as well) before inference to avoid device mismatch errors. This may not work on all implementations, so feel free to open an issue if you encounter any problems.
289289
290-
The `num_blocks_per_group` parameter should be set to `1` if `use_stream` is enabled.
290+
If you're using `block_level` group offloading with `use_stream` enabled, the `num_blocks_per_group` parameter should be set to `1`, otherwise a warning will be raised.
291291

292292
```py
293293
pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True, record_stream=True)

0 commit comments

Comments
 (0)