Skip to content

Commit b96759c

Browse files
committed
ltx
1 parent 130d813 commit b96759c

File tree

4 files changed

+135
-109
lines changed

4 files changed

+135
-109
lines changed

docs/source/en/api/pipelines/cogvideox.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323

2424
[CogVideoX](https://huggingface.co/papers/2408.06072) is a large diffusion transformer model - available in 2B and 5B parameters - designed to generate longer and more consistent videos from text. This model uses a 3D causal variational autoencoder to more efficiently process video data by reducing sequence length (and associated training compute) and preventing flickering in generated videos. An "expert" transformer with adaptive LayerNorm improves alignment between text and video, and 3D full attention helps accurately capture motion and time in generated videos.
2525

26-
You can find all the original CogVideoX checkpoints under the CogVideoX [collection](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce).
26+
You can find all the original CogVideoX checkpoints under the [CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) collection.
2727

2828
> [!TIP]
2929
> Click on the CogVideoX models in the right sidebar for more examples of how to use CogVideoX for other video generation tasks.

docs/source/en/api/pipelines/hunyuan_video.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222

2323
[HunyuanVideo](https://huggingface.co/papers/2412.03603) is a 13B diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate.
2424

25-
You can find all the original HunyuanVideo checkpoints under the Tencent [organization](https://huggingface.co/tencent).
25+
You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization.
2626

2727
> [!TIP]
2828
> The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers.
@@ -64,6 +64,8 @@ export_to_video(video, "output.mp4", fps=15)
6464
</hfoptions>
6565
<hfoption id="inference speed">
6666

67+
Compilation is slow the first time but subsequent calls to the pipeline are faster.
68+
6769
```py
6870
import torch
6971
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline

docs/source/en/api/pipelines/ltx_video.md

Lines changed: 125 additions & 103 deletions
Original file line numberDiff line numberDiff line change
@@ -12,125 +12,139 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License. -->
1414

15-
# LTX Video
16-
17-
<div class="flex flex-wrap space-x-1">
18-
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
19-
<img alt="MPS" src="https://img.shields.io/badge/MPS-000000?style=flat&logo=apple&logoColor=white%22">
15+
<div style="float: right;">
16+
<div class="flex flex-wrap space-x-1">
17+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
18+
</div>
2019
</div>
2120

22-
[LTX Video](https://huggingface.co/Lightricks/LTX-Video) is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 24 FPS videos at a 768x512 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content. We provide a model for both text-to-video as well as image + text-to-video usecases.
23-
24-
<Tip>
25-
26-
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
21+
# LTX-Video
2722

28-
</Tip>
23+
[LTX-Video](https://huggingface.co/Lightricks/LTX-Video) is a diffusion transformer designed for fast and real-time generation of high-resolution videos from text and images. The main feature of LTX-Video is the Video-VAE. The Video-VAE has a higher pixel to latent compression ratio (1:192) which enables more efficient video data processing and faster generation speed. To support and prevent the finer details from being lost during generation, the Video-VAE decoder performs the latent to pixel conversion *and* the last denoising step.
2924

30-
Available models:
25+
You can find all the original LTX-Video checkpoints under the [Lightricks](https://huggingface.co/Lightricks) organization.
3126

32-
| Model name | Recommended dtype |
33-
|:-------------:|:-----------------:|
34-
| [`LTX Video 0.9.0`](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.safetensors) | `torch.bfloat16` |
35-
| [`LTX Video 0.9.1`](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.1.safetensors) | `torch.bfloat16` |
36-
| [`LTX Video 0.9.5`](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.5.safetensors) | `torch.bfloat16` |
27+
> [!TIP]
28+
> Click on the LTX-Video models in the right sidebar for more examples of how to use LTX-Video for other video generation tasks.
3729
38-
Note: The recommended dtype is for the transformer component. The VAE and text encoders can be either `torch.float32`, `torch.bfloat16` or `torch.float16` but the recommended dtype is `torch.bfloat16` as used in the original repository.
30+
The example below demonstrates how to generate a video optimized for memory or inference speed.
3931

40-
## Loading Single Files
32+
<hfoptions id="usage">
33+
<hfoption id="memory">
4134

42-
Loading the original LTX Video checkpoints is also possible with [`~ModelMixin.from_single_file`]. We recommend using `from_single_file` for the Lightricks series of models, as they plan to release multiple models in the future in the single file format.
43-
44-
```python
35+
```py
4536
import torch
46-
from diffusers import AutoencoderKLLTXVideo, LTXImageToVideoPipeline, LTXVideoTransformer3DModel
37+
from diffusers import LTXPipeline, LTXVideoTransformer3DModel
38+
from diffusers.hooks import apply_group_offloading
39+
from diffusers.utils import export_to_video
4740

48-
# `single_file_url` could also be https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.1.safetensors
49-
single_file_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
50-
transformer = LTXVideoTransformer3DModel.from_single_file(
51-
single_file_url, torch_dtype=torch.bfloat16
41+
# fp8 layerwise weight-casting
42+
transformer = LTXVideoTransformer3DModel.from_pretrained(
43+
"Lightricks/LTX-Video",
44+
subfolder="transformer",
45+
torch_dtype=torch.bfloat16
5246
)
53-
vae = AutoencoderKLLTXVideo.from_single_file(single_file_url, torch_dtype=torch.bfloat16)
54-
pipe = LTXImageToVideoPipeline.from_pretrained(
55-
"Lightricks/LTX-Video", transformer=transformer, vae=vae, torch_dtype=torch.bfloat16
47+
transformer.enable_layerwise_casting(
48+
storage_dtype=torch.float8_e4m3fn,
49+
compute_dtype=torch.bfloat16
5650
)
5751

58-
# ... inference code ...
59-
```
52+
pipeline = LTXPipeline.from_pretrained("Lightricks/LTX-Video", transformer=transformer, torch_dtype=torch.bfloat16)
6053

61-
Alternatively, the pipeline can be used to load the weights with [`~FromSingleFileMixin.from_single_file`].
54+
# group-offloading
55+
onload_device = torch.device("cuda")
56+
offload_device = torch.device("cpu")
57+
pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True)
58+
apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2)
59+
apply_group_offloading(pipeline.vae, onload_device=onload_device, offload_type="leaf_level")
6260

63-
```python
64-
import torch
65-
from diffusers import LTXImageToVideoPipeline
66-
from transformers import T5EncoderModel, T5Tokenizer
61+
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
62+
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
6763

68-
single_file_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
69-
text_encoder = T5EncoderModel.from_pretrained(
70-
"Lightricks/LTX-Video", subfolder="text_encoder", torch_dtype=torch.bfloat16
71-
)
72-
tokenizer = T5Tokenizer.from_pretrained(
73-
"Lightricks/LTX-Video", subfolder="tokenizer", torch_dtype=torch.bfloat16
74-
)
75-
pipe = LTXImageToVideoPipeline.from_single_file(
76-
single_file_url, text_encoder=text_encoder, tokenizer=tokenizer, torch_dtype=torch.bfloat16
77-
)
64+
video = pipeline(
65+
prompt=prompt,
66+
negative_prompt=negative_prompt,
67+
width=768,
68+
height=512,
69+
num_frames=161,
70+
decode_timestep=0.03,
71+
decode_noise_scale=0.025,
72+
num_inference_steps=50,
73+
).frames[0]
74+
export_to_video(video, "output.mp4", fps=24)
7875
```
7976

80-
Loading [LTX GGUF checkpoints](https://huggingface.co/city96/LTX-Video-gguf) are also supported:
77+
Reduce memory usage even more if necessary by quantizing a model to a lower precision data type.
8178

8279
```py
8380
import torch
8481
from diffusers.utils import export_to_video
85-
from diffusers import LTXPipeline, LTXVideoTransformer3DModel, GGUFQuantizationConfig
82+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LTXVideoTransformer3DModel, LTXPipeline
83+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
8684

87-
ckpt_path = (
88-
"https://huggingface.co/city96/LTX-Video-gguf/blob/main/ltx-video-2b-v0.9-Q3_K_S.gguf"
85+
# quantize weights to int8 with bitsandbytes
86+
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
87+
text_encoder = T5EncoderModel.from_pretrained(
88+
"Lightricks/LTX-Video",
89+
subfolder="text_encoder",
90+
quantization_config=quantization_config,
91+
torch_dtype=torch.bfloat16,
8992
)
90-
transformer = LTXVideoTransformer3DModel.from_single_file(
91-
ckpt_path,
92-
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
93+
94+
quantization_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
95+
transformer = LTXVideoTransformer3DModel.from_pretrained(
96+
"Lightricks/LTX-Video",
97+
subfolder="transformer",
98+
quantization_config=quantization_config,
9399
torch_dtype=torch.bfloat16,
94100
)
95-
pipe = LTXPipeline.from_pretrained(
101+
102+
pipeline = LTXPipeline.from_pretrained(
96103
"Lightricks/LTX-Video",
104+
text_encoder=text_en,
97105
transformer=transformer,
98106
torch_dtype=torch.bfloat16,
99107
)
100-
pipe.enable_model_cpu_offload()
101108

102109
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
103110
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
104-
105-
video = pipe(
111+
video = pipeline(
106112
prompt=prompt,
107113
negative_prompt=negative_prompt,
108-
width=704,
109-
height=480,
114+
width=768,
115+
height=512,
110116
num_frames=161,
117+
decode_timestep=0.03,
118+
decode_noise_scale=0.025,
111119
num_inference_steps=50,
112120
).frames[0]
113-
export_to_video(video, "output_gguf_ltx.mp4", fps=24)
121+
export_to_video(video, "output.mp4", fps=24)
114122
```
115123

116-
Make sure to read the [documentation on GGUF](../../quantization/gguf) to learn more about our GGUF support.
117-
118-
<!-- TODO(aryan): Update this when official weights are supported -->
124+
</hfoption>
125+
<hfoption id="inference speed">
119126

120-
Loading and running inference with [LTX Video 0.9.1](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.1.safetensors) weights.
127+
Compilation is slow the first time but subsequent calls to the pipeline are faster.
121128

122-
```python
129+
```py
123130
import torch
124131
from diffusers import LTXPipeline
125132
from diffusers.utils import export_to_video
126133

127-
pipe = LTXPipeline.from_pretrained("a-r-r-o-w/LTX-Video-0.9.1-diffusers", torch_dtype=torch.bfloat16)
128-
pipe.to("cuda")
134+
pipeline = LTXPipeline.from_pretrained(
135+
"Lightricks/LTX-Video", torch_dtype=torch.bfloat16
136+
)
137+
138+
# torch.compile
139+
pipeline.transformer.to(memory_format=torch.channels_last)
140+
pipeline.transformer = torch.compile(
141+
pipeline.transformer, mode="max-autotune", fullgraph=True
142+
)
129143

130144
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
131145
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
132146

133-
video = pipe(
147+
video = pipeline(
134148
prompt=prompt,
135149
negative_prompt=negative_prompt,
136150
width=768,
@@ -143,48 +157,56 @@ video = pipe(
143157
export_to_video(video, "output.mp4", fps=24)
144158
```
145159

146-
Refer to [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox#memory-optimization) to learn more about optimizing memory consumption.
160+
</hfoption>
161+
</hfoptions>
147162

148-
## Quantization
163+
## Notes
149164

150-
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
165+
- LTX-Video supports LoRAs with [`~LTXVideoLoraLoaderMixin.load_lora_weights`].
151166

152-
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`LTXPipeline`] for inference with bitsandbytes.
167+
```py
168+
import torch
169+
from diffusers import LTXConditionPipeline
170+
from diffusers.utils import export_to_video
153171

154-
```py
155-
import torch
156-
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LTXVideoTransformer3DModel, LTXPipeline
157-
from diffusers.utils import export_to_video
158-
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
172+
pipeline = LTXConditionPipeline.from_pretrained(
173+
"Lightricks/LTX-Video-0.9.5", torch_dtype=torch.bfloat16
174+
)
159175

160-
quant_config = BitsAndBytesConfig(load_in_8bit=True)
161-
text_encoder_8bit = T5EncoderModel.from_pretrained(
162-
"Lightricks/LTX-Video",
163-
subfolder="text_encoder",
164-
quantization_config=quant_config,
165-
torch_dtype=torch.float16,
166-
)
176+
pipeline.load_lora_weights("Lightricks/LTX-Video-Cakeify-LoRA", adapter_name="cakeify")
177+
pipeline.set_adapters("cakeify", 0.9)
167178

168-
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
169-
transformer_8bit = LTXVideoTransformer3DModel.from_pretrained(
170-
"Lightricks/LTX-Video",
171-
subfolder="transformer",
172-
quantization_config=quant_config,
173-
torch_dtype=torch.float16,
174-
)
179+
prompt = "CAKEIFY a person using a knife to cut a cake shaped like a pair of cowboy boots"
175180

176-
pipeline = LTXPipeline.from_pretrained(
177-
"Lightricks/LTX-Video",
178-
text_encoder=text_encoder_8bit,
179-
transformer=transformer_8bit,
180-
torch_dtype=torch.float16,
181-
device_map="balanced",
182-
)
181+
video = pipeline(
182+
prompt=prompt,
183+
width=768,
184+
height=512,
185+
num_frames=161,
186+
decode_timestep=0.03,
187+
decode_noise_scale=0.025,
188+
num_inference_steps=50,
189+
).frames[0]
190+
export_to_video(video, "output.mp4", fps=24)
191+
```
192+
- LTX-Video supports loading from single files, such as [GGUF checkpoints](../../quantization/gguf), with [`FromOriginalModelMixin.from_single_file`] or [`FromSingleFileMixin.from_single_file`].
183193

184-
prompt = "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
185-
video = pipeline(prompt=prompt, num_frames=161, num_inference_steps=50).frames[0]
186-
export_to_video(video, "ship.mp4", fps=24)
187-
```
194+
```py
195+
import torch
196+
from diffusers.utils import export_to_video
197+
from diffusers import LTXPipeline, LTXVideoTransformer3DModel, GGUFQuantizationConfig
198+
199+
transformer = LTXVideoTransformer3DModel.from_single_file(
200+
"https://huggingface.co/city96/LTX-Video-gguf/blob/main/ltx-video-2b-v0.9-Q3_K_S.gguf"
201+
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
202+
torch_dtype=torch.bfloat16
203+
)
204+
pipeline = LTXPipeline.from_pretrained(
205+
"Lightricks/LTX-Video",
206+
transformer=transformer,
207+
torch_dtype=bfloat16
208+
)
209+
```
188210

189211
## LTXPipeline
190212

docs/source/en/api/pipelines/wan.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,14 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License. -->
1414

15-
# Wan
16-
17-
<div class="flex flex-wrap space-x-1">
18-
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
15+
<div style="float: right;">
16+
<div class="flex flex-wrap space-x-1">
17+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
18+
</div>
1919
</div>
2020

21+
# Wan
22+
2123
[Wan 2.1](https://github.com/Wan-Video/Wan2.1) by the Alibaba Wan Team.
2224

2325
<!-- TODO(aryan): update abstract once paper is out -->

0 commit comments

Comments
 (0)