Skip to content

Feature request for transformers use-cases #673

Closed
@zucchini-nlp

Description

@zucchini-nlp

🚀 The feature

Hi 👋

First of all, huge thanks to you and the team, the latest torchcodec release with audio support is fantastic! It's a long-awaited feature

I'm the maintainer of multimodal models in transformers and I'm thinking to use torchcodec to load multimodal data for MLLMs. Looking forward for a stable version to be released. For now, I’ve been testing the latest release and noticed a few points that might be useful to consider for future support.

  1. Mono channel audio support: Some audio models (like Whisper from Hugging Face) only support mono-channel input. It would be helpful if audio loading allowed channel selection or converted stereo to mono optionally.

  2. Fallback for video files with no audio: When loading audio from a video file that has no audio stream, an error is raised currently. A more flexible behavior would be to return None, similar to how moviepy handles it and can be checked as if clip.audio is not None.

  3. Loading from URL: Loading audio/video from URLs seems to work for some urls I have tested with, though I couldn’t find in the docs whether URL input is officially supported. Hope it will be officially supported for the stable release

  4. Video decoder issues with avi format: When trying to load avi files, the decoder fails to infer duration and related metadata, which prevents sampling frames by seconds. Loading the same video saved as mp4 resolves the issue. You can try this video as an example.

Let me know if you'd like me to file any of these separately or provide reproducible examples. Thanks again for the awesome work!

Motivation, pitch

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions