Skip to content

mtmd : add ultravox audio input #13623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
May 22, 2025
Merged

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented May 18, 2025

Supersede #12745

Important

Support for llama-server will be added in a separated PR

For ultravox, it does not work very well with audio longer than 1 minute - Not sure why


How it works

This PR target specifically ultravox model, which is essentially a fine-tuned Whisper encoder and a custom projector.

Most of the preprocessing code are copied from whisper.cpp. The preprocessor will convert input PCM to mel spectrogram with dimension of n_frames * n_mel, so it can be considered as a gray scale (1 channel) image with W=n_frames and H=n_mel

The preprocessing code is inside mtmd-audio.cpp, the mel filters values are hard-coded for convenient.

Demo CLI

Supported formats: mp3, wav, flac

# use pre-quantized model
llama-mtmd-cli -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF
llama-mtmd-cli -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF

# use local Llama 3.2 1B model (original model from Meta, no fine-tuned) with ultravox projector
llama-mtmd-cli -m llama3_2-1b.gguf --mmproj mmproj-ultravox-v0_5-llama-3_2-1b-f16.gguf

# run one-shot, no chat
llama-mtmd-cli -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF --audio ./my_audio.mp3 -p "Transcribe this audio"

Example output:

 Running in chat mode, available commands:
   /audio <path>    load an audio
   /clear           clear the chat history
   /quit or /exit   exit the program

> /audio ../models/i-have-a-dream-30s.mp3
../models/i-have-a-dream-30s.mp3 audio loaded

> what is this
encoding audio slice...
audio slice encoded in 894 ms
decoding audio batch 1/1, n_tokens_batch = 187
audio decoded (batch 1/1) in 57 ms
encoding audio slice...
audio slice encoded in 885 ms
decoding audio batch 1/1, n_tokens_batch = 187
audio decoded (batch 1/1) in 58 ms

I have a dream that one day every valley shall be exalted and every hill and
mountain shall be made low the rough places will be made straight and the crooked
places will be made straight and the Lord shall be revealed and all shall see it
together this is our hope this is the path that I go back to the sun with this
faith we will be able to Hew out of the mountain of despair of stone of stone.

New API

The API now accepts PCM F32 as input via mtmd_bitmap_init_from_audio(). Optionally, you can check if a given bitmap is audio or not by using mtmd_bitmap_is_audio()

The helper mtmd_helper_bitmap_init_from_buf/file is extended to load input file data to the correct mtmd_bitmap type (decided by the magic bytes of the file), so it will just work out-of-the-box without any changes in application code.

mtmd_input_chunk now has a new type called MTMD_INPUT_CHUNK_TYPE_AUDIO

You can get the number of audio/image tokens that a chunk takes via the newly added mtmd_input_chunk_get_n_tokens API

The rest of the process (encode/decode) is the same as before. So, very little changes for downstream application.

For complete changes, see tools/mtmd/mtmd-cli.cpp : https://github.com/ggml-org/llama.cpp/pull/13623/files#diff-4bfe825a05fa2d2598cc93f39aaa081605d2fd82823bd5d15e7dab72acd85e7c

Deprecated API

The image marker <__image__> will continue to work, but it's deprecated as a new marker <__media__> being added. This marker is defined in MTMD_DEFAULT_MEDIA_MARKER

The 3 APIs will be deprecated (but will continue to function, NO breaking change):

  • mtmd_image_tokens_get_n_tokens
  • mtmd_image_tokens_get_id
  • mtmd_image_tokens_get_n_pos

They simple change their prefix to mtmd_input_chunk_ :

  • mtmd_input_chunk_get_n_tokens
  • mtmd_input_chunk_get_id
  • mtmd_input_chunk_get_n_pos

TODO in next PRs:

  • support audio input on server
  • move miniaudio.h and stb_image.h to mtmd_helper
  • add deprecation macro for mtmd_image_tokens_get_n_tokens / n_pos / id

@github-actions github-actions bot added examples python python script changes labels May 18, 2025
@ngxson
Copy link
Collaborator Author

ngxson commented May 18, 2025

Ok somehow it works magically, the code is still nowhere near finish

Tested using first 6 seconds from https://www.youtube.com/watch?v=vP4iY1TtS3s

image

@ngxson
Copy link
Collaborator Author

ngxson commented May 20, 2025

With the gelu_erf from #13667 , this is now able to transcribe full 30s of audio:

I can transcribe the audio for you. Here is the transcription:

"I have a dream that one day every valley shall be exalted and every hill and mountain shall be made low the rough places will be made plain and the crooked places will be made straight and the Lord shall be revealed and all shall see it together this is our hope this is the peace that I go back to the sun with this faith we will be able to Hew out of the mountain of despair of stone of the darkness"

Note: The original audio may have slight variations in tone and pitch, but the above transcription should be accurate.

Next step is to allow more than 30s input

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 21, 2025
@github-actions github-actions bot added the Apple Metal https://en.wikipedia.org/wiki/Metal_(API) label May 21, 2025
@ngxson ngxson force-pushed the xsn/mtmd_ultravox branch from 167dc89 to e7c8a2e Compare May 21, 2025 15:15
@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 21, 2025
@ngxson ngxson changed the title mtmd : (WIP) add ultravox audio input mtmd : add ultravox audio input May 21, 2025
@ngxson ngxson marked this pull request as ready for review May 21, 2025 16:30
@ngxson ngxson requested a review from ggerganov May 21, 2025 16:30
Comment on lines +205 to +208
if (has_audio) {
LOG_WRN("%s: audio input is in experimental stage and may have reduced quality:\n"
" https://github.com/ggml-org/llama.cpp/pull/13623\n", __func__);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model hallucinates on audio longer than 1 minute and I'm still not sure why (haven't yet had time to try the same audio on transformers)

But I think for now putting a small notice here is enough, this is kinda experimental support for now, hopefully we will get gemma 3n supported soon

@ngxson ngxson removed ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 21, 2025
Comment on lines 5990 to 5991
self.hparams["image_size"] = self.hparams["num_mel_bins"]
self.hparams["patch_size"] = self.hparams["num_mel_bins"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the image_size and patch_size used in the audio encoder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unused, but I leave it here from my first draft version so the warmup works. But yeah I should remove this

Comment on lines 42 to 44
#define MTMD_DEFAULT_MEDIA_MARKER "<__media__>"

// deprecated marker, use MTMD_DEFAULT_MEDIA_MARKER instead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have such constants in llama.h and ggml.h, but we eventually have to start moving those behind API calls. It's more future-proof.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I added it in 107790a

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preprocessor will convert input PCM to mel spectrogram with dimension of n_frames * n_mel, so it can be considered as a gray scale (1 channel) image with W=n_frames and H=n_mel

This is a neat idea. Do you think it would be compatible with other audio models or is this a lucky coincidence for this architecture? I guess the question is if all audio encoders work with 2D spectrograms.

@ngxson
Copy link
Collaborator Author

ngxson commented May 22, 2025

I have seen so far just 2 types of model:

  • whisper-based (used by ultravox, qwen2-audio, phi-4-mm) and they use 2D mel spec as input. As many models do this way, the current impl is quite bias toward whisper for this reason 😂
  • quantized residual vector based models (mimi encoder, gemma 3n) which accepts raw PCM F32 as input, so technically it will be a 1D image (W=n_samples and H=1)

So overall, I think this system should work well for most audio models

I'll resolve the 2 comments a bit later today, and will merge it after that. Thanks for reviewing this!

@ngxson
Copy link
Collaborator Author

ngxson commented May 22, 2025

Ok so I ended up adding a prefix clip.audio which should allow both audio + vision encoders to coexist in the same mmproj

image

GGUFs on ggml-org for ultravox was also updated to reflect this change.

Tested the conversion script with gemma 3 to make sure that it doesn't produce a broken mmproj file

I also ran a test to make sure this doesn't accidentally break any existing vision models. Merging this PR once the CI is green 🤞

OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0

@ngxson ngxson merged commit 797990c into ggml-org:master May 22, 2025
49 checks passed
@zhouwg
Copy link
Contributor

zhouwg commented May 23, 2025

truly AI expert, ......genius programmer, another gg!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants