Skip to content

Commit c64fa22

Browse files
authored
Merge branch 'main' into layerwise-upcasting
2 parents 0d1a1f8 + ba4348d commit c64fa22

File tree

13 files changed

+136
-16
lines changed

13 files changed

+136
-16
lines changed

docs/source/en/api/pipelines/controlnet_sd3.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ The abstract from the paper is:
2222

2323
*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.*
2424

25-
This controlnet code is mainly implemented by [The InstantX Team](https://huggingface.co/InstantX). The inpainting-related code was developed by [The Alimama Creative Team](https://huggingface.co/alimama-creative). You can find pre-trained checkpoints for SD3-ControlNet in the table below:
25+
This controlnet code is mainly implemented by [The InstantX Team](https://huggingface.co/InstantX). The inpainting-related code was developed by [The Alimama Creative Team](https://huggingface.co/alimama-creative). You can find pre-trained checkpoints for SD3-ControlNet in the table below:
2626

2727

2828
| ControlNet type | Developer | Link |

docs/source/en/api/pipelines/kolors.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.
1414

1515
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kolors/kolors_header_collage.png)
1616

17-
Kolors is a large-scale text-to-image generation model based on latent diffusion, developed by [the Kuaishou Kolors team](kwai-kolors@kuaishou.com). Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and closed-source models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters. Furthermore, Kolors supports both Chinese and English inputs, demonstrating strong performance in understanding and generating Chinese-specific content. For more details, please refer to this [technical report](https://github.com/Kwai-Kolors/Kolors/blob/master/imgs/Kolors_paper.pdf).
17+
Kolors is a large-scale text-to-image generation model based on latent diffusion, developed by [the Kuaishou Kolors team](https://github.com/Kwai-Kolors/Kolors). Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and closed-source models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters. Furthermore, Kolors supports both Chinese and English inputs, demonstrating strong performance in understanding and generating Chinese-specific content. For more details, please refer to this [technical report](https://github.com/Kwai-Kolors/Kolors/blob/master/imgs/Kolors_paper.pdf).
1818

1919
The abstract from the technical report is:
2020

@@ -74,7 +74,7 @@ image_encoder = CLIPVisionModelWithProjection.from_pretrained(
7474

7575
pipe = KolorsPipeline.from_pretrained(
7676
"Kwai-Kolors/Kolors-diffusers", image_encoder=image_encoder, torch_dtype=torch.float16, variant="fp16"
77-
).to("cuda")
77+
)
7878
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True)
7979

8080
pipe.load_ip_adapter(

docs/source/en/api/pipelines/pag.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ The abstract from the paper is:
2020

2121
*Recent studies have demonstrated that diffusion models are capable of generating high-quality samples, but their quality heavily depends on sampling guidance techniques, such as classifier guidance (CG) and classifier-free guidance (CFG). These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration. In this paper, we propose a novel sampling guidance, called Perturbed-Attention Guidance (PAG), which improves diffusion sample quality across both unconditional and conditional settings, achieving this without requiring additional training or the integration of external modules. PAG is designed to progressively enhance the structure of samples throughout the denoising process. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, by considering the self-attention mechanisms' ability to capture structural information, and guiding the denoising process away from these degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios. Moreover, PAG significantly improves the baseline performance in various downstream tasks where existing guidances such as CG or CFG cannot be fully utilized, including ControlNet with empty prompts and image restoration such as inpainting and deblurring.*
2222

23-
PAG can be used by specifying the `pag_applied_layers` as a parameter when instantiating a PAG pipeline. It can be a single string or a list of strings. Each string can be a unique layer identifier or a regular expression to identify one or more layers.
23+
PAG can be used by specifying the `pag_applied_layers` as a parameter when instantiating a PAG pipeline. It can be a single string or a list of strings. Each string can be a unique layer identifier or a regular expression to identify one or more layers.
2424

2525
- Full identifier as a normal string: `down_blocks.2.attentions.0.transformer_blocks.0.attn1.processor`
2626
- Full identifier as a RegEx: `down_blocks.2.(attentions|motion_modules).0.transformer_blocks.0.attn1.processor`
@@ -46,7 +46,7 @@ Since RegEx is supported as a way for matching layer identifiers, it is crucial
4646
## KolorsPAGPipeline
4747
[[autodoc]] KolorsPAGPipeline
4848
- all
49-
- __call__
49+
- __call__
5050

5151
## StableDiffusionPAGPipeline
5252
[[autodoc]] StableDiffusionPAGPipeline

examples/dreambooth/README_flux.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,17 @@
33
[DreamBooth](https://arxiv.org/abs/2208.12242) is a method to personalize text2image models like stable diffusion given just a few (3~5) images of a subject.
44

55
The `train_dreambooth_flux.py` script shows how to implement the training procedure and adapt it for [FLUX.1 [dev]](https://blackforestlabs.ai/announcing-black-forest-labs/). We also provide a LoRA implementation in the `train_dreambooth_lora_flux.py` script.
6-
> [!NOTE]
6+
> [!NOTE]
77
> **Memory consumption**
8-
>
9-
> Flux can be quite expensive to run on consumer hardware devices and as a result finetuning it comes with high memory requirements -
8+
>
9+
> Flux can be quite expensive to run on consumer hardware devices and as a result finetuning it comes with high memory requirements -
1010
> a LoRA with a rank of 16 (w/ all components trained) can exceed 40GB of VRAM for training.
11-
> For more tips & guidance on training on a resource-constrained device please visit [`@bghira`'s guide](https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md)
11+
> For more tips & guidance on training on a resource-constrained device please visit [`@bghira`'s guide](https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md)
1212
1313

1414
> [!NOTE]
1515
> **Gated model**
16-
>
16+
>
1717
> As the model is gated, before using it with diffusers you first need to go to the [FLUX.1 [dev] Hugging Face page](https://huggingface.co/black-forest-labs/FLUX.1-dev), fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate. Use the command below to log in:
1818
1919
```bash
@@ -163,7 +163,7 @@ To do so, just specify `--train_text_encoder` while launching training. Please k
163163

164164
> [!NOTE]
165165
> FLUX.1 has 2 text encoders (CLIP L/14 and T5-v1.1-XXL).
166-
By enabling `--train_text_encoder`, fine-tuning of the **CLIP encoder** is performed.
166+
By enabling `--train_text_encoder`, fine-tuning of the **CLIP encoder** is performed.
167167
> At the moment, T5 fine-tuning is not supported and weights remain frozen when text encoder training is enabled.
168168
169169
To perform DreamBooth LoRA with text-encoder training, run:

examples/dreambooth/train_dreambooth_lora_sd3.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1454,7 +1454,7 @@ def compute_text_embeddings(prompt, text_encoders, tokenizers):
14541454
)
14551455

14561456
# Clear the memory here
1457-
if not args.train_text_encoder and train_dataset.custom_instance_prompts:
1457+
if not args.train_text_encoder and not train_dataset.custom_instance_prompts:
14581458
del tokenizers, text_encoders
14591459
# Explicitly delete the objects as well, otherwise only the lists are deleted and the original references remain, preventing garbage collection
14601460
del text_encoder_one, text_encoder_two, text_encoder_three

src/diffusers/loaders/ip_adapter.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -222,7 +222,11 @@ def load_ip_adapter(
222222

223223
# create feature extractor if it has not been registered to the pipeline yet
224224
if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is None:
225-
clip_image_size = self.image_encoder.config.image_size
225+
# FaceID IP adapters don't need the image encoder so it's not present, in this case we default to 224
226+
default_clip_size = 224
227+
clip_image_size = (
228+
self.image_encoder.config.image_size if self.image_encoder is not None else default_clip_size
229+
)
226230
feature_extractor = CLIPImageProcessor(size=clip_image_size, crop_size=clip_image_size)
227231
self.register_modules(feature_extractor=feature_extractor)
228232

src/diffusers/models/transformers/auraflow_transformer_2d.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -274,6 +274,7 @@ class AuraFlowTransformer2DModel(ModelMixin, ConfigMixin):
274274
pos_embed_max_size (`int`, defaults to 4096): Maximum positions to embed from the image latents.
275275
"""
276276

277+
_no_split_modules = ["AuraFlowJointTransformerBlock", "AuraFlowSingleTransformerBlock", "AuraFlowPatchEmbed"]
277278
_supports_gradient_checkpointing = True
278279
_always_upcast_modules = ["AuraFlowPatchEmbed"]
279280

src/diffusers/pipelines/auto_pipeline.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@
4949
)
5050
from .kandinsky3 import Kandinsky3Img2ImgPipeline, Kandinsky3Pipeline
5151
from .latent_consistency_models import LatentConsistencyModelImg2ImgPipeline, LatentConsistencyModelPipeline
52+
from .lumina import LuminaText2ImgPipeline
5253
from .pag import (
5354
HunyuanDiTPAGPipeline,
5455
PixArtSigmaPAGPipeline,
@@ -106,6 +107,7 @@
106107
("pixart-sigma-pag", PixArtSigmaPAGPipeline),
107108
("auraflow", AuraFlowPipeline),
108109
("flux", FluxPipeline),
110+
("lumina", LuminaText2ImgPipeline),
109111
]
110112
)
111113

src/diffusers/pipelines/latte/pipeline_latte.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@
5656
>>> from diffusers.utils import export_to_gif
5757
5858
>>> # You can replace the checkpoint id with "maxin-cn/Latte-1" too.
59-
>>> pipe = LattePipeline.from_pretrained("maxin-cn/Latte-1", torch_dtype=torch.float16).to("cuda")
59+
>>> pipe = LattePipeline.from_pretrained("maxin-cn/Latte-1", torch_dtype=torch.float16)
6060
>>> # Enable memory optimizations.
6161
>>> pipe.enable_model_cpu_offload()
6262

src/diffusers/pipelines/lumina/pipeline_lumina.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@
5454
5555
>>> pipe = LuminaText2ImgPipeline.from_pretrained(
5656
... "Alpha-VLLM/Lumina-Next-SFT-diffusers", torch_dtype=torch.bfloat16
57-
... ).cuda()
57+
... )
5858
>>> # Enable memory optimizations.
5959
>>> pipe.enable_model_cpu_offload()
6060

src/diffusers/pipelines/stable_diffusion_k_diffusion/pipeline_stable_diffusion_k_diffusion.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -602,9 +602,9 @@ def __call__(
602602
sigma_min: float = self.k_diffusion_model.sigmas[0].item()
603603
sigma_max: float = self.k_diffusion_model.sigmas[-1].item()
604604
sigmas = get_sigmas_karras(n=num_inference_steps, sigma_min=sigma_min, sigma_max=sigma_max)
605-
sigmas = sigmas.to(device)
606605
else:
607606
sigmas = self.scheduler.sigmas
607+
sigmas = sigmas.to(device)
608608
sigmas = sigmas.to(prompt_embeds.dtype)
609609

610610
# 6. Prepare latent variables

tests/models/transformers/test_models_transformer_aura_flow.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@
2929
class SD3TransformerTests(ModelTesterMixin, unittest.TestCase):
3030
model_class = AuraFlowTransformer2DModel
3131
main_input_name = "hidden_states"
32+
# We override the items here because the transformer under consideration is small.
33+
model_split_percents = [0.7, 0.6, 0.6]
3234

3335
@property
3436
def dummy_input(self):
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# coding=utf-8
2+
# Copyright 2024 HuggingFace Inc.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
import unittest
17+
18+
import torch
19+
20+
from diffusers import LuminaNextDiT2DModel
21+
from diffusers.utils.testing_utils import (
22+
enable_full_determinism,
23+
torch_device,
24+
)
25+
26+
from ..test_modeling_common import ModelTesterMixin
27+
28+
29+
enable_full_determinism()
30+
31+
32+
class LuminaNextDiT2DModelTransformerTests(ModelTesterMixin, unittest.TestCase):
33+
model_class = LuminaNextDiT2DModel
34+
main_input_name = "hidden_states"
35+
36+
@property
37+
def dummy_input(self):
38+
"""
39+
Args:
40+
None
41+
Returns:
42+
Dict: Dictionary of dummy input tensors
43+
"""
44+
batch_size = 2 # N
45+
num_channels = 4 # C
46+
height = width = 16 # H, W
47+
embedding_dim = 32 # D
48+
sequence_length = 16 # L
49+
50+
hidden_states = torch.randn((batch_size, num_channels, height, width)).to(torch_device)
51+
encoder_hidden_states = torch.randn((batch_size, sequence_length, embedding_dim)).to(torch_device)
52+
timestep = torch.rand(size=(batch_size,)).to(torch_device)
53+
encoder_mask = torch.randn(size=(batch_size, sequence_length)).to(torch_device)
54+
image_rotary_emb = torch.randn((384, 384, 4)).to(torch_device)
55+
56+
return {
57+
"hidden_states": hidden_states,
58+
"encoder_hidden_states": encoder_hidden_states,
59+
"timestep": timestep,
60+
"encoder_mask": encoder_mask,
61+
"image_rotary_emb": image_rotary_emb,
62+
"cross_attention_kwargs": {},
63+
}
64+
65+
@property
66+
def input_shape(self):
67+
"""
68+
Args:
69+
None
70+
Returns:
71+
Tuple: (int, int, int)
72+
"""
73+
return (4, 16, 16)
74+
75+
@property
76+
def output_shape(self):
77+
"""
78+
Args:
79+
None
80+
Returns:
81+
Tuple: (int, int, int)
82+
"""
83+
return (4, 16, 16)
84+
85+
def prepare_init_args_and_inputs_for_common(self):
86+
"""
87+
Args:
88+
None
89+
90+
Returns:
91+
Tuple: (Dict, Dict)
92+
"""
93+
init_dict = {
94+
"sample_size": 16,
95+
"patch_size": 2,
96+
"in_channels": 4,
97+
"hidden_size": 24,
98+
"num_layers": 2,
99+
"num_attention_heads": 3,
100+
"num_kv_heads": 1,
101+
"multiple_of": 16,
102+
"ffn_dim_multiplier": None,
103+
"norm_eps": 1e-5,
104+
"learn_sigma": False,
105+
"qk_norm": True,
106+
"cross_attention_dim": 32,
107+
"scaling_factor": 1.0,
108+
}
109+
110+
inputs_dict = self.dummy_input
111+
return init_dict, inputs_dict

0 commit comments

Comments
 (0)