Skip to content

Hunyuan I2V #10983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 7, 2025
Merged

Hunyuan I2V #10983

merged 11 commits into from
Mar 7, 2025

Conversation

a-r-r-o-w
Copy link
Member

@a-r-r-o-w a-r-r-o-w commented Mar 6, 2025

Thanks to the Tencent Hunyuan team for the amazing release!

Checkpoint: https://huggingface.co/hunyuanvideo-community/HunyuanVideo-I2V

Example:

import torch
from diffusers import HunyuanVideoImageToVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import load_image, export_to_video

model_id = "hunyuanvideo-community/HunyuanVideo-I2V"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoImageToVideoPipeline.from_pretrained(
    model_id, transformer=transformer, torch_dtype=torch.float16
)
pipe.vae.enable_tiling()
pipe.to("cuda")

prompt = "A man with short gray hair plays a red electric guitar."
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png"
)

output = pipe(image=image, prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=15)
output2.mp4

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@a-r-r-o-w a-r-r-o-w requested a review from yiyixuxu March 6, 2025 21:58
@Kaisa-Supergene
Copy link

@a-r-r-o-w Hi, I'm Kaisa Lim who is using and studying image AI using diffusers.
While testing in this PR, I got an error while calling text_encoder in ._get_llama_prompt_embeds function that the number of tokens in image_embeds and image_emb_len value in DEFAULT_PROMPT_TEMPLATE are different. Has anyone experienced a similar issue?

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

self.vae_scale_factor_spatial = self.vae.spatial_compression_ratio if getattr(self, "vae", None) else 8
self.video_processor = VideoProcessor(vae_scale_factor=self.vae_scale_factor_spatial)

def _get_llama_prompt_embeds(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not copied from the other pipeline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has extra logic to deal with image embeddings

@a-r-r-o-w
Copy link
Member Author

@Kaisa-Supergene I'll take a look into that asap. I believe these values are from the official code and so, for the integration, we're going to use these anyway (even if they're incorrect). We can update on our end if it is indeed different.

https://github.com/Tencent/HunyuanVideo-I2V/blob/f1aa9a499fd06b418966bdcc7235c156c2d567d0/hyvideo/constants.py#L97

@a-r-r-o-w
Copy link
Member Author

Failing tests are unrelated

@a-r-r-o-w a-r-r-o-w merged commit 2e5203b into main Mar 7, 2025
14 of 15 checks passed
@a-r-r-o-w a-r-r-o-w deleted the integrations/hunyuan-i2v branch March 7, 2025 07:22
@chengzeyi
Copy link
Contributor

  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 751, in __call__
    prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 404, in encode_prompt
    prompt_embeds, prompt_attention_mask = self._get_llama_prompt_embeds(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 277, in _get_llama_prompt_embeds
    prompt_embeds = self.text_encoder(
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 427, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 1, features 576

I got this🧐

@Kaisa-Supergene
Copy link

  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 751, in __call__
    prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 404, in encode_prompt
    prompt_embeds, prompt_attention_mask = self._get_llama_prompt_embeds(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 277, in _get_llama_prompt_embeds
    prompt_embeds = self.text_encoder(
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 427, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 1, features 576

I got this🧐

i got a same issue too.
so i tried searching the issue like this in HuyuanVideo github repository, than i found the solution.
if use use HuyuanVideo model with diffusers, check your transformers version.
greater than 4.47.1 versions transformers will raise that error.
try transformers==4.47.1

@chengzeyi
Copy link
Contributor

  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 751, in __call__
    prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 404, in encode_prompt
    prompt_embeds, prompt_attention_mask = self._get_llama_prompt_embeds(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 277, in _get_llama_prompt_embeds
    prompt_embeds = self.text_encoder(
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 427, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 1, features 576

I got this🧐

i got a same issue too. so i tried searching the issue like this in HuyuanVideo github repository, than i found the solution. if use use HuyuanVideo model with diffusers, check your transformers version. greater than 4.47.1 versions transformers will raise that error. try transformers==4.47.1

This version gives another different error🤣

@Kaisa-Supergene
Copy link

  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 751, in __call__
    prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 404, in encode_prompt
    prompt_embeds, prompt_attention_mask = self._get_llama_prompt_embeds(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 277, in _get_llama_prompt_embeds
    prompt_embeds = self.text_encoder(
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 427, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 1, features 576

I got this🧐

i got a same issue too. so i tried searching the issue like this in HuyuanVideo github repository, than i found the solution. if use use HuyuanVideo model with diffusers, check your transformers version. greater than 4.47.1 versions transformers will raise that error. try transformers==4.47.1

This version gives another different error🤣

that is bad news lol.
i dont know how to solve this issue, but you can see this issue to solve problem myabe.
Tencent-Hunyuan/HunyuanVideo-I2V#7

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented Mar 7, 2025

I'm on the v4.48.0-dev branch of transformers during the integration. Here's my environment where it does not error out:

- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): 0.8.5 (cpu)
- Jax version: 0.4.31
- JaxLib version: 0.4.31
- Huggingface_hub version: 0.28.1
- Transformers version: 4.48.0.dev0
- Accelerate version: 1.1.0.dev0
- PEFT version: 0.14.1.dev0
- Bitsandbytes version: 0.43.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA DGX Display, 4096 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB

I think we might have to version guard Hunyuan-I2V if it is causing problems

@ychenZHANG
Copy link

Nice work!!

Looks like the inference scripts and model ckpt is from Tencent March 6 release. They have released another version on March 7 to fix the ID consistent bug, with 16-dim input channel to the transformer instead of 33 input channels. Any plans to adapt that as well?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants