Skip to content

Commit f967993

Browse files
authored
Merge branch 'main' into lora
2 parents 360379c + 38ced7e commit f967993

File tree

130 files changed

+12576
-1307
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

130 files changed

+12576
-1307
lines changed

.github/workflows/nightly_tests.yml

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,55 @@ jobs:
180180
pip install slack_sdk tabulate
181181
python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
182182
183+
run_torch_compile_tests:
184+
name: PyTorch Compile CUDA tests
185+
186+
runs-on:
187+
group: aws-g4dn-2xlarge
188+
189+
container:
190+
image: diffusers/diffusers-pytorch-compile-cuda
191+
options: --gpus 0 --shm-size "16gb" --ipc host
192+
193+
steps:
194+
- name: Checkout diffusers
195+
uses: actions/checkout@v3
196+
with:
197+
fetch-depth: 2
198+
199+
- name: NVIDIA-SMI
200+
run: |
201+
nvidia-smi
202+
- name: Install dependencies
203+
run: |
204+
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
205+
python -m uv pip install -e [quality,test,training]
206+
- name: Environment
207+
run: |
208+
python utils/print_env.py
209+
- name: Run torch compile tests on GPU
210+
env:
211+
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
212+
RUN_COMPILE: yes
213+
run: |
214+
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/
215+
- name: Failure short reports
216+
if: ${{ failure() }}
217+
run: cat reports/tests_torch_compile_cuda_failures_short.txt
218+
219+
- name: Test suite reports artifacts
220+
if: ${{ always() }}
221+
uses: actions/upload-artifact@v4
222+
with:
223+
name: torch_compile_test_reports
224+
path: reports
225+
226+
- name: Generate Report and Notify Channel
227+
if: always()
228+
run: |
229+
pip install slack_sdk tabulate
230+
python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
231+
183232
run_big_gpu_torch_tests:
184233
name: Torch tests on big GPU
185234
strategy:

.github/workflows/release_tests_fast.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -335,7 +335,7 @@ jobs:
335335
- name: Environment
336336
run: |
337337
python utils/print_env.py
338-
- name: Run example tests on GPU
338+
- name: Run torch compile tests on GPU
339339
env:
340340
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
341341
RUN_COMPILE: yes

docker/diffusers-onnxruntime-cpu/Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,9 @@ ENV PATH="/opt/venv/bin:$PATH"
2828
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
2929
RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
3030
python3 -m uv pip install --no-cache-dir \
31-
torch==2.1.2 \
32-
torchvision==0.16.2 \
33-
torchaudio==2.1.2 \
31+
torch \
32+
torchvision \
33+
torchaudio\
3434
onnxruntime \
3535
--extra-index-url https://download.pytorch.org/whl/cpu && \
3636
python3 -m uv pip install --no-cache-dir \

docs/source/en/_toctree.yml

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -291,12 +291,12 @@
291291
title: AuraFlowTransformer2DModel
292292
- local: api/models/cogvideox_transformer3d
293293
title: CogVideoXTransformer3DModel
294-
- local: api/models/consisid_transformer3d
295-
title: ConsisIDTransformer3DModel
296294
- local: api/models/cogview3plus_transformer2d
297295
title: CogView3PlusTransformer2DModel
298296
- local: api/models/cogview4_transformer2d
299297
title: CogView4Transformer2DModel
298+
- local: api/models/consisid_transformer3d
299+
title: ConsisIDTransformer3DModel
300300
- local: api/models/dit_transformer2d
301301
title: DiTTransformer2DModel
302302
- local: api/models/easyanimate_transformer3d
@@ -311,12 +311,12 @@
311311
title: HunyuanVideoTransformer3DModel
312312
- local: api/models/latte_transformer3d
313313
title: LatteTransformer3DModel
314-
- local: api/models/lumina_nextdit2d
315-
title: LuminaNextDiT2DModel
316-
- local: api/models/lumina2_transformer2d
317-
title: Lumina2Transformer2DModel
318314
- local: api/models/ltx_video_transformer3d
319315
title: LTXVideoTransformer3DModel
316+
- local: api/models/lumina2_transformer2d
317+
title: Lumina2Transformer2DModel
318+
- local: api/models/lumina_nextdit2d
319+
title: LuminaNextDiT2DModel
320320
- local: api/models/mochi_transformer3d
321321
title: MochiTransformer3DModel
322322
- local: api/models/omnigen_transformer
@@ -325,10 +325,10 @@
325325
title: PixArtTransformer2DModel
326326
- local: api/models/prior_transformer
327327
title: PriorTransformer
328-
- local: api/models/sd3_transformer2d
329-
title: SD3Transformer2DModel
330328
- local: api/models/sana_transformer2d
331329
title: SanaTransformer2DModel
330+
- local: api/models/sd3_transformer2d
331+
title: SD3Transformer2DModel
332332
- local: api/models/stable_audio_transformer
333333
title: StableAudioDiTModel
334334
- local: api/models/transformer2d
@@ -343,10 +343,10 @@
343343
title: StableCascadeUNet
344344
- local: api/models/unet
345345
title: UNet1DModel
346-
- local: api/models/unet2d
347-
title: UNet2DModel
348346
- local: api/models/unet2d-cond
349347
title: UNet2DConditionModel
348+
- local: api/models/unet2d
349+
title: UNet2DModel
350350
- local: api/models/unet3d-cond
351351
title: UNet3DConditionModel
352352
- local: api/models/unet-motion
@@ -355,6 +355,10 @@
355355
title: UViT2DModel
356356
title: UNets
357357
- sections:
358+
- local: api/models/asymmetricautoencoderkl
359+
title: AsymmetricAutoencoderKL
360+
- local: api/models/autoencoder_dc
361+
title: AutoencoderDC
358362
- local: api/models/autoencoderkl
359363
title: AutoencoderKL
360364
- local: api/models/autoencoderkl_allegro
@@ -371,10 +375,6 @@
371375
title: AutoencoderKLMochi
372376
- local: api/models/autoencoder_kl_wan
373377
title: AutoencoderKLWan
374-
- local: api/models/asymmetricautoencoderkl
375-
title: AsymmetricAutoencoderKL
376-
- local: api/models/autoencoder_dc
377-
title: AutoencoderDC
378378
- local: api/models/consistency_decoder_vae
379379
title: ConsistencyDecoderVAE
380380
- local: api/models/autoencoder_oobleck
@@ -522,40 +522,40 @@
522522
- sections:
523523
- local: api/pipelines/stable_diffusion/overview
524524
title: Overview
525-
- local: api/pipelines/stable_diffusion/text2img
526-
title: Text-to-image
525+
- local: api/pipelines/stable_diffusion/depth2img
526+
title: Depth-to-image
527+
- local: api/pipelines/stable_diffusion/gligen
528+
title: GLIGEN (Grounded Language-to-Image Generation)
529+
- local: api/pipelines/stable_diffusion/image_variation
530+
title: Image variation
527531
- local: api/pipelines/stable_diffusion/img2img
528532
title: Image-to-image
529533
- local: api/pipelines/stable_diffusion/svd
530534
title: Image-to-video
531535
- local: api/pipelines/stable_diffusion/inpaint
532536
title: Inpainting
533-
- local: api/pipelines/stable_diffusion/depth2img
534-
title: Depth-to-image
535-
- local: api/pipelines/stable_diffusion/image_variation
536-
title: Image variation
537+
- local: api/pipelines/stable_diffusion/k_diffusion
538+
title: K-Diffusion
539+
- local: api/pipelines/stable_diffusion/latent_upscale
540+
title: Latent upscaler
541+
- local: api/pipelines/stable_diffusion/ldm3d_diffusion
542+
title: LDM3D Text-to-(RGB, Depth), Text-to-(RGB-pano, Depth-pano), LDM3D Upscaler
537543
- local: api/pipelines/stable_diffusion/stable_diffusion_safe
538544
title: Safe Stable Diffusion
545+
- local: api/pipelines/stable_diffusion/sdxl_turbo
546+
title: SDXL Turbo
539547
- local: api/pipelines/stable_diffusion/stable_diffusion_2
540548
title: Stable Diffusion 2
541549
- local: api/pipelines/stable_diffusion/stable_diffusion_3
542550
title: Stable Diffusion 3
543551
- local: api/pipelines/stable_diffusion/stable_diffusion_xl
544552
title: Stable Diffusion XL
545-
- local: api/pipelines/stable_diffusion/sdxl_turbo
546-
title: SDXL Turbo
547-
- local: api/pipelines/stable_diffusion/latent_upscale
548-
title: Latent upscaler
549553
- local: api/pipelines/stable_diffusion/upscale
550554
title: Super-resolution
551-
- local: api/pipelines/stable_diffusion/k_diffusion
552-
title: K-Diffusion
553-
- local: api/pipelines/stable_diffusion/ldm3d_diffusion
554-
title: LDM3D Text-to-(RGB, Depth), Text-to-(RGB-pano, Depth-pano), LDM3D Upscaler
555555
- local: api/pipelines/stable_diffusion/adapter
556556
title: T2I-Adapter
557-
- local: api/pipelines/stable_diffusion/gligen
558-
title: GLIGEN (Grounded Language-to-Image Generation)
557+
- local: api/pipelines/stable_diffusion/text2img
558+
title: Text-to-image
559559
title: Stable Diffusion
560560
- local: api/pipelines/stable_unclip
561561
title: Stable unCLIP

docs/source/en/api/loaders/lora.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,10 @@ LoRA is a fast and lightweight training method that inserts and trains a signifi
2525
- [`SanaLoraLoaderMixin`] provides similar functions for [Sana](https://huggingface.co/docs/diffusers/main/en/api/pipelines/sana).
2626
- [`HunyuanVideoLoraLoaderMixin`] provides similar functions for [HunyuanVideo](https://huggingface.co/docs/diffusers/main/en/api/pipelines/hunyuan_video).
2727
- [`Lumina2LoraLoaderMixin`] provides similar functions for [Lumina2](https://huggingface.co/docs/diffusers/main/en/api/pipelines/lumina2).
28+
- [`WanLoraLoaderMixin`] provides similar functions for [Wan](https://huggingface.co/docs/diffusers/main/en/api/pipelines/wan).
29+
- [`CogView4LoraLoaderMixin`] provides similar functions for [CogView4](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogview4).
2830
- [`AmusedLoraLoaderMixin`] is for the [`AmusedPipeline`].
31+
- [`HiDreamImageLoraLoaderMixin`] provides similar functions for [HiDream Image](https://huggingface.co/docs/diffusers/main/en/api/pipelines/hidream)
2932
- [`LoraBaseMixin`] provides a base class with several utility methods to fuse, unfuse, unload, LoRAs and more.
3033

3134
<Tip>
@@ -77,10 +80,22 @@ To learn more about how to load LoRA weights, see the [LoRA](../../using-diffuse
7780

7881
[[autodoc]] loaders.lora_pipeline.Lumina2LoraLoaderMixin
7982

83+
## CogView4LoraLoaderMixin
84+
85+
[[autodoc]] loaders.lora_pipeline.CogView4LoraLoaderMixin
86+
87+
## WanLoraLoaderMixin
88+
89+
[[autodoc]] loaders.lora_pipeline.WanLoraLoaderMixin
90+
8091
## AmusedLoraLoaderMixin
8192

8293
[[autodoc]] loaders.lora_pipeline.AmusedLoraLoaderMixin
8394

95+
## HiDreamImageLoraLoaderMixin
96+
97+
[[autodoc]] loaders.lora_pipeline.HiDreamImageLoraLoaderMixin
98+
8499
## LoraBaseMixin
85100

86101
[[autodoc]] loaders.lora_base.LoraBaseMixin

docs/source/en/api/pipelines/aura_flow.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,23 @@ image = pipeline(prompt).images[0]
8989
image.save("auraflow.png")
9090
```
9191

92+
## Support for `torch.compile()`
93+
94+
AuraFlow can be compiled with `torch.compile()` to speed up inference latency even for different resolutions. First, install PyTorch nightly following the instructions from [here](https://pytorch.org/). The snippet below shows the changes needed to enable this:
95+
96+
```diff
97+
+ torch.fx.experimental._config.use_duck_shape = False
98+
+ pipeline.transformer = torch.compile(
99+
pipeline.transformer, fullgraph=True, dynamic=True
100+
)
101+
```
102+
103+
Specifying `use_duck_shape` to be `False` instructs the compiler if it should use the same symbolic variable to represent input sizes that are the same. For more details, check out [this comment](https://github.com/huggingface/diffusers/pull/11327#discussion_r2047659790).
104+
105+
This enables from 100% (on low resolutions) to a 30% (on 1536x1536 resolution) speed improvements.
106+
107+
Thanks to [AstraliteHeart](https://github.com/huggingface/diffusers/pull/11297/) who helped us rewrite the [`AuraFlowTransformer2DModel`] class so that the above works for different resolutions ([PR](https://github.com/huggingface/diffusers/pull/11297/)).
108+
92109
## AuraFlowPipeline
93110

94111
[[autodoc]] AuraFlowPipeline

docs/source/en/api/pipelines/flux.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -347,7 +347,7 @@ image = pipe(
347347
height=1024,
348348
prompt="wearing sunglasses",
349349
negative_prompt="",
350-
true_cfg=4.0,
350+
true_cfg_scale=4.0,
351351
generator=torch.Generator().manual_seed(4444),
352352
ip_adapter_image=image,
353353
).images[0]

docs/source/en/api/pipelines/wan.md

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424

2525
## Generating Videos with Wan 2.1
2626

27-
We will first need to install some addtional dependencies.
27+
We will first need to install some additional dependencies.
2828

2929
```shell
3030
pip install -u ftfy imageio-ffmpeg imageio
@@ -133,6 +133,60 @@ output = pipe(
133133
export_to_video(output, "wan-i2v.mp4", fps=16)
134134
```
135135

136+
### First and Last Frame Interpolation
137+
138+
```python
139+
import numpy as np
140+
import torch
141+
import torchvision.transforms.functional as TF
142+
from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
143+
from diffusers.utils import export_to_video, load_image
144+
from transformers import CLIPVisionModel
145+
146+
147+
model_id = "Wan-AI/Wan2.1-FLF2V-14B-720P-diffusers"
148+
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32)
149+
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
150+
pipe = WanImageToVideoPipeline.from_pretrained(
151+
model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
152+
)
153+
pipe.to("cuda")
154+
155+
first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
156+
last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")
157+
158+
def aspect_ratio_resize(image, pipe, max_area=720 * 1280):
159+
aspect_ratio = image.height / image.width
160+
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
161+
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
162+
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
163+
image = image.resize((width, height))
164+
return image, height, width
165+
166+
def center_crop_resize(image, height, width):
167+
# Calculate resize ratio to match first frame dimensions
168+
resize_ratio = max(width / image.width, height / image.height)
169+
170+
# Resize the image
171+
width = round(image.width * resize_ratio)
172+
height = round(image.height * resize_ratio)
173+
size = [width, height]
174+
image = TF.center_crop(image, size)
175+
176+
return image, height, width
177+
178+
first_frame, height, width = aspect_ratio_resize(first_frame, pipe)
179+
if last_frame.size != first_frame.size:
180+
last_frame, _, _ = center_crop_resize(last_frame, height, width)
181+
182+
prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
183+
184+
output = pipe(
185+
image=first_frame, last_image=last_frame, prompt=prompt, height=height, width=width, guidance_scale=5.5
186+
).frames[0]
187+
export_to_video(output, "output.mp4", fps=16)
188+
```
189+
136190
### Video to Video Generation
137191

138192
```python

docs/source/en/training/cogvideox.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -216,7 +216,7 @@ Setting the `<ID_TOKEN>` is not necessary. From some limited experimentation, we
216216
> - The original repository uses a `lora_alpha` of `1`. We found this not suitable in many runs, possibly due to difference in modeling backends and training settings. Our recommendation is to set to the `lora_alpha` to either `rank` or `rank // 2`.
217217
> - If you're training on data whose captions generate bad results with the original model, a `rank` of 64 and above is good and also the recommendation by the team behind CogVideoX. If the generations are already moderately good on your training captions, a `rank` of 16/32 should work. We found that setting the rank too low, say `4`, is not ideal and doesn't produce promising results.
218218
> - The authors of CogVideoX recommend 4000 training steps and 100 training videos overall to achieve the best result. While that might yield the best results, we found from our limited experimentation that 2000 steps and 25 videos could also be sufficient.
219-
> - When using the Prodigy opitimizer for training, one can follow the recommendations from [this](https://huggingface.co/blog/sdxl_lora_advanced_script) blog. Prodigy tends to overfit quickly. From my very limited testing, I found a learning rate of `0.5` to be suitable in addition to `--prodigy_use_bias_correction`, `prodigy_safeguard_warmup` and `--prodigy_decouple`.
219+
> - When using the Prodigy optimizer for training, one can follow the recommendations from [this](https://huggingface.co/blog/sdxl_lora_advanced_script) blog. Prodigy tends to overfit quickly. From my very limited testing, I found a learning rate of `0.5` to be suitable in addition to `--prodigy_use_bias_correction`, `prodigy_safeguard_warmup` and `--prodigy_decouple`.
220220
> - The recommended learning rate by the CogVideoX authors and from our experimentation with Adam/AdamW is between `1e-3` and `1e-4` for a dataset of 25+ videos.
221221
>
222222
> Note that our testing is not exhaustive due to limited time for exploration. Our recommendation would be to play around with the different knobs and dials to find the best settings for your data.

docs/source/en/training/dreambooth.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -589,7 +589,7 @@ For stage 2 of DeepFloyd IF with DreamBooth, pay attention to these parameters:
589589

590590
* `--learning_rate=5e-6`, use a lower learning rate with a smaller effective batch size
591591
* `--resolution=256`, the expected resolution for the upscaler
592-
* `--train_batch_size=2` and `--gradient_accumulation_steps=6`, to effectively train on images wiht faces requires larger batch sizes
592+
* `--train_batch_size=2` and `--gradient_accumulation_steps=6`, to effectively train on images with faces requires larger batch sizes
593593

594594
```bash
595595
export MODEL_NAME="DeepFloyd/IF-II-L-v1.0"

docs/source/en/training/t2i_adapters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ Many of the basic and important parameters are described in the [Text-to-image](
8989

9090
As with the script parameters, a walkthrough of the training script is provided in the [Text-to-image](text2image#training-script) training guide. Instead, this guide takes a look at the T2I-Adapter relevant parts of the script.
9191

92-
The training script begins by preparing the dataset. This incudes [tokenizing](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L674) the prompt and [applying transforms](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L714) to the images and conditioning images.
92+
The training script begins by preparing the dataset. This includes [tokenizing](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L674) the prompt and [applying transforms](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L714) to the images and conditioning images.
9393

9494
```py
9595
conditioning_image_transforms = transforms.Compose(

0 commit comments

Comments
 (0)