Skip to content

Commit 7829f3d

Browse files
authored
Merge branch 'main' into sega-flux-community
2 parents 42e1109 + 328e0d2 commit 7829f3d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+4025
-36
lines changed

docs/source/en/_toctree.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,8 @@
7979
- sections:
8080
- local: using-diffusers/cogvideox
8181
title: CogVideoX
82+
- local: using-diffusers/consisid
83+
title: ConsisID
8284
- local: using-diffusers/sdxl
8385
title: Stable Diffusion XL
8486
- local: using-diffusers/sdxl_turbo
@@ -179,6 +181,8 @@
179181
title: TGATE
180182
- local: optimization/xdit
181183
title: xDiT
184+
- local: optimization/para_attn
185+
title: ParaAttention
182186
- sections:
183187
- local: using-diffusers/stable_diffusion_jax_how_to
184188
title: JAX/Flax
@@ -268,6 +272,8 @@
268272
title: AuraFlowTransformer2DModel
269273
- local: api/models/cogvideox_transformer3d
270274
title: CogVideoXTransformer3DModel
275+
- local: api/models/consisid_transformer3d
276+
title: ConsisIDTransformer3DModel
271277
- local: api/models/cogview3plus_transformer2d
272278
title: CogView3PlusTransformer2DModel
273279
- local: api/models/dit_transformer2d
@@ -370,6 +376,8 @@
370376
title: CogVideoX
371377
- local: api/pipelines/cogview3
372378
title: CogView3
379+
- local: api/pipelines/consisid
380+
title: ConsisID
373381
- local: api/pipelines/consistency_models
374382
title: Consistency Models
375383
- local: api/pipelines/controlnet
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# ConsisIDTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D data from [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) was introduced in [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/pdf/2411.17440) by Peking University & University of Rochester & etc.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import ConsisIDTransformer3DModel
20+
21+
transformer = ConsisIDTransformer3DModel.from_pretrained("BestWishYsh/ConsisID-preview", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")
22+
```
23+
24+
## ConsisIDTransformer3DModel
25+
26+
[[autodoc]] ConsisIDTransformer3DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
-->
15+
16+
# ConsisID
17+
18+
[Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440) from Peking University & University of Rochester & etc, by Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan.
19+
20+
The abstract from the paper is:
21+
22+
*Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in the literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving Diffusion Transformer (DiT)-based control scheme. To achieve these goals, we propose **ConsisID**, a tuning-free DiT-based controllable IPT2V model to keep human-**id**entity **consis**tent in the generated video. Inspired by prior findings in frequency analysis of vision/diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features (e.g., profile, proportions) and high-frequency intrinsic features (e.g., identity markers that remain unaffected by pose changes). First, from a low-frequency perspective, we introduce a global facial extractor, which encodes the reference image and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into the shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into the transformer blocks, enhancing the model's ability to preserve fine-grained features. To leverage the frequency information for identity preservation, we propose a hierarchical training strategy, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our **ConsisID** achieves excellent results in generating high-quality, identity-preserving videos, making strides towards more effective IPT2V. The model weight of ConsID is publicly available at https://github.com/PKU-YuanGroup/ConsisID.*
23+
24+
<Tip>
25+
26+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
27+
28+
</Tip>
29+
30+
This pipeline was contributed by [SHYuanBest](https://github.com/SHYuanBest). The original codebase can be found [here](https://github.com/PKU-YuanGroup/ConsisID). The original weights can be found under [hf.co/BestWishYsh](https://huggingface.co/BestWishYsh).
31+
32+
There are two official ConsisID checkpoints for identity-preserving text-to-video.
33+
34+
| checkpoints | recommended inference dtype |
35+
|:---:|:---:|
36+
| [`BestWishYsh/ConsisID-preview`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |
37+
| [`BestWishYsh/ConsisID-1.5`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |
38+
39+
### Memory optimization
40+
41+
ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/SHYuanBest/bc4207c36f454f9e969adbb50eaf8258) script.
42+
43+
| Feature (overlay the previous) | Max Memory Allocated | Max Memory Reserved |
44+
| :----------------------------- | :------------------- | :------------------ |
45+
| - | 37 GB | 44 GB |
46+
| enable_model_cpu_offload | 22 GB | 25 GB |
47+
| enable_sequential_cpu_offload | 16 GB | 22 GB |
48+
| vae.enable_slicing | 16 GB | 22 GB |
49+
| vae.enable_tiling | 5 GB | 7 GB |
50+
51+
## ConsisIDPipeline
52+
53+
[[autodoc]] ConsisIDPipeline
54+
55+
- all
56+
- __call__
57+
58+
## ConsisIDPipelineOutput
59+
60+
[[autodoc]] pipelines.consisid.pipeline_output.ConsisIDPipelineOutput

docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ from diffusers import StableDiffusion3Pipeline
7777
from transformers import SiglipVisionModel, SiglipImageProcessor
7878

7979
image_encoder_id = "google/siglip-so400m-patch14-384"
80-
ip_adapter_id = "InstantX/SD3.5-Large-IP-Adapter"
80+
ip_adapter_id = "guiyrt/InstantX-SD3.5-Large-IP-Adapter-diffusers"
8181

8282
feature_extractor = SiglipImageProcessor.from_pretrained(
8383
image_encoder_id,

0 commit comments

Comments
 (0)