-
Notifications
You must be signed in to change notification settings - Fork 11.9k
mtmd : add ultravox audio input #13623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ok somehow it works magically, the code is still nowhere near finish Tested using first 6 seconds from https://www.youtube.com/watch?v=vP4iY1TtS3s ![]() |
With the
Next step is to allow more than 30s input |
if (has_audio) { | ||
LOG_WRN("%s: audio input is in experimental stage and may have reduced quality:\n" | ||
" https://github.com/ggml-org/llama.cpp/pull/13623\n", __func__); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model hallucinates on audio longer than 1 minute and I'm still not sure why (haven't yet had time to try the same audio on transformers
)
But I think for now putting a small notice here is enough, this is kinda experimental support for now, hopefully we will get gemma 3n supported soon
convert_hf_to_gguf.py
Outdated
self.hparams["image_size"] = self.hparams["num_mel_bins"] | ||
self.hparams["patch_size"] = self.hparams["num_mel_bins"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the image_size
and patch_size
used in the audio encoder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unused, but I leave it here from my first draft version so the warmup works. But yeah I should remove this
tools/mtmd/mtmd.h
Outdated
#define MTMD_DEFAULT_MEDIA_MARKER "<__media__>" | ||
|
||
// deprecated marker, use MTMD_DEFAULT_MEDIA_MARKER instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have such constants in llama.h
and ggml.h
, but we eventually have to start moving those behind API calls. It's more future-proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I added it in 107790a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preprocessor will convert input PCM to mel spectrogram with dimension of n_frames * n_mel, so it can be considered as a gray scale (1 channel) image with W=n_frames and H=n_mel
This is a neat idea. Do you think it would be compatible with other audio models or is this a lucky coincidence for this architecture? I guess the question is if all audio encoders work with 2D spectrograms.
I have seen so far just 2 types of model:
So overall, I think this system should work well for most audio models I'll resolve the 2 comments a bit later today, and will merge it after that. Thanks for reviewing this! |
truly AI expert, ......genius programmer, another gg! |
Supersede #12745
Important
Support for
llama-server
will be added in a separated PRFor ultravox, it does not work very well with audio longer than 1 minute - Not sure why
How it works
This PR target specifically ultravox model, which is essentially a fine-tuned Whisper encoder and a custom projector.
Most of the preprocessing code are copied from whisper.cpp. The preprocessor will convert input PCM to mel spectrogram with dimension of
n_frames * n_mel
, so it can be considered as a gray scale (1 channel) image with W=n_frames and H=n_melThe preprocessing code is inside
mtmd-audio.cpp
, the mel filters values are hard-coded for convenient.Demo CLI
Supported formats: mp3, wav, flac
Example output:
New API
The API now accepts PCM F32 as input via
mtmd_bitmap_init_from_audio()
. Optionally, you can check if a given bitmap is audio or not by usingmtmd_bitmap_is_audio()
The helper
mtmd_helper_bitmap_init_from_buf/file
is extended to load input file data to the correctmtmd_bitmap
type (decided by the magic bytes of the file), so it will just work out-of-the-box without any changes in application code.mtmd_input_chunk
now has a new type calledMTMD_INPUT_CHUNK_TYPE_AUDIO
You can get the number of audio/image tokens that a chunk takes via the newly added
mtmd_input_chunk_get_n_tokens
APIThe rest of the process (encode/decode) is the same as before. So, very little changes for downstream application.
For complete changes, see
tools/mtmd/mtmd-cli.cpp
: https://github.com/ggml-org/llama.cpp/pull/13623/files#diff-4bfe825a05fa2d2598cc93f39aaa081605d2fd82823bd5d15e7dab72acd85e7cDeprecated API
The image marker
<__image__>
will continue to work, but it's deprecated as a new marker<__media__>
being added. This marker is defined inMTMD_DEFAULT_MEDIA_MARKER
The 3 APIs will be deprecated (but will continue to function, NO breaking change):
mtmd_image_tokens_get_n_tokens
mtmd_image_tokens_get_id
mtmd_image_tokens_get_n_pos
They simple change their prefix to mtmd_input_chunk_ :
mtmd_input_chunk_get_n_tokens
mtmd_input_chunk_get_id
mtmd_input_chunk_get_n_pos
TODO in next PRs:
miniaudio.h
andstb_image.h
tomtmd_helper
mtmd_image_tokens_get_n_tokens / n_pos / id