Skip to content

Commit ddab9b4

Browse files
committed
ltx
1 parent 0d3f911 commit ddab9b4

File tree

1 file changed

+203
-2
lines changed

1 file changed

+203
-2
lines changed

docs/source/en/api/pipelines/ltx_video.md

Lines changed: 203 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
<a href="https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference" target="_blank" rel="noopener">
1818
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
1919
</a>
20+
<img alt="MPS" src="https://img.shields.io/badge/MPS-000000?style=flat&logo=apple&logoColor=white%22">
2021
</div>
2122
</div>
2223

@@ -64,7 +65,10 @@ apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offlo
6465
apply_group_offloading(pipeline.vae, onload_device=onload_device, offload_type="leaf_level")
6566

6667
prompt = """
67-
A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage
68+
A woman with long brown hair and light skin smiles at another woman with long blonde hair.
69+
The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek.
70+
The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and
71+
natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage
6872
"""
6973
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
7074

@@ -102,7 +106,10 @@ pipeline.transformer = torch.compile(
102106
)
103107

104108
prompt = """
105-
A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage
109+
A woman with long brown hair and light skin smiles at another woman with long blonde hair.
110+
The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek.
111+
The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and
112+
natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage
106113
"""
107114
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
108115

@@ -124,6 +131,200 @@ export_to_video(video, "output.mp4", fps=24)
124131

125132
## Notes
126133

134+
- Refer to the following recommended settings for generation from the [LTX-Video](https://github.com/Lightricks/LTX-Video) repository.
135+
136+
- The recommended dtype for the transformer, VAE, and text encoder is `torch.bfloat16`. The VAE and text encoder can also be `torch.float32` or `torch.float16`.
137+
- For guidance-distilled variants of LTX-Video, set `guidance_scale` to `1.0`. The `guidance_scale` for any other model should be set higher, like `5.0`, for good generation quality.
138+
- For timestep-aware VAE variants (LTX-Video 0.9.1 and above), set `decode_timestep` to `0.05` and `image_cond_noise_scale` to `0.025`.
139+
- For variants that support interpolation between multiple conditioning images and videos (LTX-Video 0.9.5 and above), use similar images and videos for the best results. Divergence from the conditioning inputs may lead to abrupt transitionts in the generated video.
140+
141+
- LTX-Video 0.9.7 includes a spatial latent upscaler and a 13B parameter transformer. During inference, a low resolution video is quickly generated first and then upscaled and refined.
142+
143+
<details>
144+
<summary>Show example code</summary>
145+
146+
```py
147+
import torch
148+
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
149+
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
150+
from diffusers.utils import export_to_video, load_video
151+
152+
pipeline = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
153+
pipeline_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipeline.vae, torch_dtype=torch.bfloat16)
154+
pipeline.to("cuda")
155+
pipe_upsample.to("cuda")
156+
pipeline.vae.enable_tiling()
157+
158+
def round_to_nearest_resolution_acceptable_by_vae(height, width):
159+
height = height - (height % pipeline.vae_temporal_compression_ratio)
160+
width = width - (width % pipeline.vae_temporal_compression_ratio)
161+
return height, width
162+
163+
video = load_video(
164+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
165+
)[:21] # only use the first 21 frames as conditioning
166+
condition1 = LTXVideoCondition(video=video, frame_index=0)
167+
168+
prompt = """
169+
The video depicts a winding mountain road covered in snow, with a single vehicle
170+
traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation.
171+
The landscape is characterized by rugged terrain and a river visible in the distance.
172+
The scene captures the solitude and beauty of a winter drive through a mountainous region.
173+
"""
174+
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
175+
expected_height, expected_width = 768, 1152
176+
downscale_factor = 2 / 3
177+
num_frames = 161
178+
179+
# 1. Generate video at smaller resolution
180+
# Text-only conditioning is also supported without the need to pass `conditions`
181+
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
182+
downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
183+
latents = pipeline(
184+
conditions=[condition1],
185+
prompt=prompt,
186+
negative_prompt=negative_prompt,
187+
width=downscaled_width,
188+
height=downscaled_height,
189+
num_frames=num_frames,
190+
num_inference_steps=30,
191+
decode_timestep=0.05,
192+
decode_noise_scale=0.025,
193+
image_cond_noise_scale=0.0,
194+
guidance_scale=5.0,
195+
guidance_rescale=0.7,
196+
generator=torch.Generator().manual_seed(0),
197+
output_type="latent",
198+
).frames
199+
200+
# 2. Upscale generated video using latent upsampler with fewer inference steps
201+
# The available latent upsampler upscales the height/width by 2x
202+
upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
203+
upscaled_latents = pipe_upsample(
204+
latents=latents,
205+
output_type="latent"
206+
).frames
207+
208+
# 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
209+
video = pipeline(
210+
conditions=[condition1],
211+
prompt=prompt,
212+
negative_prompt=negative_prompt,
213+
width=upscaled_width,
214+
height=upscaled_height,
215+
num_frames=num_frames,
216+
denoise_strength=0.4, # Effectively, 4 inference steps out of 10
217+
num_inference_steps=10,
218+
latents=upscaled_latents,
219+
decode_timestep=0.05,
220+
decode_noise_scale=0.025,
221+
image_cond_noise_scale=0.0,
222+
guidance_scale=5.0,
223+
guidance_rescale=0.7,
224+
generator=torch.Generator().manual_seed(0),
225+
output_type="pil",
226+
).frames[0]
227+
228+
# 4. Downscale the video to the expected resolution
229+
video = [frame.resize((expected_width, expected_height)) for frame in video]
230+
231+
export_to_video(video, "output.mp4", fps=24)
232+
```
233+
234+
</details>
235+
236+
- LTX-Video 0.9.7 distilled model is guidance and timestep-distilled to speedup generation. It requires `guidance_scale` to be set to `1.0` and `num_inference_steps` should be set between `4` and `10` for good generation quality. You should also use the following custom timesteps for the best results.
237+
238+
- Base model inference to prepare for upscaling: `[1000, 993, 987, 981, 975, 909, 725, 0.03]`.
239+
- Upscaling: `[1000, 909, 725, 421, 0]`.
240+
241+
<details>
242+
<summary>Show example code</summary>
243+
244+
```py
245+
import torch
246+
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
247+
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
248+
from diffusers.utils import export_to_video, load_video
249+
250+
pipeline = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-distilled", torch_dtype=torch.bfloat16)
251+
pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipeline.vae, torch_dtype=torch.bfloat16)
252+
pipeline.to("cuda")
253+
pipe_upsample.to("cuda")
254+
pipeline.vae.enable_tiling()
255+
256+
def round_to_nearest_resolution_acceptable_by_vae(height, width):
257+
height = height - (height % pipeline.vae_temporal_compression_ratio)
258+
width = width - (width % pipeline.vae_temporal_compression_ratio)
259+
return height, width
260+
261+
prompt = """
262+
artistic anatomical 3d render, utlra quality, human half full male body with transparent
263+
skin revealing structure instead of organs, muscular, intricate creative patterns,
264+
monochromatic with backlighting, lightning mesh, scientific concept art, blending biology
265+
with botany, surreal and ethereal quality, unreal engine 5, ray tracing, ultra realistic,
266+
16K UHD, rich details. camera zooms out in a rotating fashion
267+
"""
268+
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
269+
expected_height, expected_width = 768, 1152
270+
downscale_factor = 2 / 3
271+
num_frames = 161
272+
273+
# 1. Generate video at smaller resolution
274+
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
275+
downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
276+
latents = pipeline(
277+
prompt=prompt,
278+
negative_prompt=negative_prompt,
279+
width=downscaled_width,
280+
height=downscaled_height,
281+
num_frames=num_frames,
282+
timesteps=[1000, 993, 987, 981, 975, 909, 725, 0.03],
283+
decode_timestep=0.05,
284+
decode_noise_scale=0.025,
285+
image_cond_noise_scale=0.0,
286+
guidance_scale=1.0,
287+
guidance_rescale=0.7,
288+
generator=torch.Generator().manual_seed(0),
289+
output_type="latent",
290+
).frames
291+
292+
# 2. Upscale generated video using latent upsampler with fewer inference steps
293+
# The available latent upsampler upscales the height/width by 2x
294+
upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
295+
upscaled_latents = pipe_upsample(
296+
latents=latents,
297+
adain_factor=1.0,
298+
output_type="latent"
299+
).frames
300+
301+
# 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
302+
video = pipeline(
303+
prompt=prompt,
304+
negative_prompt=negative_prompt,
305+
width=upscaled_width,
306+
height=upscaled_height,
307+
num_frames=num_frames,
308+
denoise_strength=0.999, # Effectively, 4 inference steps out of 5
309+
timesteps=[1000, 909, 725, 421, 0],
310+
latents=upscaled_latents,
311+
decode_timestep=0.05,
312+
decode_noise_scale=0.025,
313+
image_cond_noise_scale=0.0,
314+
guidance_scale=1.0,
315+
guidance_rescale=0.7,
316+
generator=torch.Generator().manual_seed(0),
317+
output_type="pil",
318+
).frames[0]
319+
320+
# 4. Downscale the video to the expected resolution
321+
video = [frame.resize((expected_width, expected_height)) for frame in video]
322+
323+
export_to_video(video, "output.mp4", fps=24)
324+
```
325+
326+
</details>
327+
127328
- LTX-Video supports LoRAs with [`~loaders.LTXVideoLoraLoaderMixin.load_lora_weights`].
128329

129330
<details>

0 commit comments

Comments
 (0)