-
Notifications
You must be signed in to change notification settings - Fork 5
Xsn/qwen25omni #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Xsn/qwen25omni #25
Conversation
WalkthroughThe changes introduce support for multimodal models with separate vision and audio encoders, refactoring code to handle distinct hyperparameters, tensor mappings, and model classes. They include updates to Python scripts, C++ headers, and tests to enable and test combined vision-audio capabilities, especially for Qwen2.5-Omni models. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant mtmd_context
participant clip_ctx_vision
participant clip_ctx_audio
User->>CLI: Provide prompt, images, audio
CLI->>mtmd_context: Initialize
mtmd_context->>clip_ctx_vision: Load vision model (if present)
mtmd_context->>clip_ctx_audio: Load audio model (if present)
CLI->>mtmd_context: Tokenize input (text, images, audio)
mtmd_context->>clip_ctx_vision: Encode images (if any)
mtmd_context->>clip_ctx_audio: Encode audio (if any)
mtmd_context-->>CLI: Return tokenized/encoded input
CLI-->>User: Output result
sequenceDiagram
participant convert_hf_to_gguf
participant MmprojModel
participant Qwen25OmniModel
convert_hf_to_gguf->>MmprojModel: Load model config
MmprojModel->>MmprojModel: Extract vision/audio configs
MmprojModel->>Qwen25OmniModel: If Qwen2.5-Omni, instantiate
Qwen25OmniModel->>Qwen25OmniModel: Set GGUF params for vision/audio
Qwen25OmniModel->>Qwen25OmniModel: Modify tensors for multimodal
Qwen25OmniModel-->>convert_hf_to_gguf: Return processed model
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
🧰 Additional context used🧠 Learnings (1)tools/mtmd/clip.cpp (1)undefined <retrieved_learning> 🧬 Code Graph Analysis (1)tools/mtmd/clip.cpp (4)
⏰ Context from checks skipped due to timeout of 90000ms (20)
🔇 Additional comments (1)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
@coderabbitai time to nits picking. first step, see if I make any typo error (i.e. broken english) |
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
tools/mtmd/tests.sh (1)
86-87
:⚠️ Potential issueFix inconsistent function usage in huge tests
Lines 86-87 still use the old
add_test
function instead of the newadd_test_vision
function. This will cause errors sinceadd_test
is no longer defined.Apply this fix:
- add_test "llama-mtmd-cli" "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M" - add_test "llama-mtmd-cli" "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S" + add_test_vision "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M" + add_test_vision "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S"
🧹 Nitpick comments (7)
tools/mtmd/mtmd-cli.cpp (1)
287-289
: Consider improving marker logic for edge cases.The current logic adds one marker per image only if the prompt contains NO markers. This could lead to inconsistent behavior:
- Scenario 1: 3 images, 0 markers → adds 3 markers ✓
- Scenario 2: 3 images, 1 marker → adds 0 markers
⚠️ - Scenario 3: 3 images, 5 markers → adds 0 markers
⚠️ Consider a more robust approach:
- if (params.prompt.find(mtmd_default_marker()) == std::string::npos) { - for (size_t i = 0; i < params.image.size(); i++) { - params.prompt += mtmd_default_marker(); - } - } + // Count existing markers in the prompt + size_t marker_count = 0; + size_t pos = 0; + const std::string marker = mtmd_default_marker(); + while ((pos = params.prompt.find(marker, pos)) != std::string::npos) { + marker_count++; + pos += marker.length(); + } + + // Add markers for any missing images + for (size_t i = marker_count; i < params.image.size(); i++) { + params.prompt += marker; + }Alternatively, document the current behavior clearly for users to understand marker placement expectations.
tools/mtmd/mtmd.cpp (3)
302-305
: Null-check before freeing contexts
clip_free(nullptr)
is undefined unless the helper guards internally. A trivial guard improves robustness and symmetry with other checks in the file.~mtmd_context() { - clip_free(ctx_a); - clip_free(ctx_v); + if (ctx_a) clip_free(ctx_a); + if (ctx_v) clip_free(ctx_v); }
280-287
:get_clip_ctx
lacks TEXT branch -> unreachable abortWhen the caller passes a text chunk accidentally, the function
GGML_ABORT
s, crashing the process instead of signalling a user-level error.clip_ctx * get_clip_ctx(const mtmd_input_chunk * chunk) const { if (chunk->type == MTMD_INPUT_CHUNK_TYPE_IMAGE) { return ctx_v; } else if (chunk->type == MTMD_INPUT_CHUNK_TYPE_AUDIO) { return ctx_a; + } else if (chunk->type == MTMD_INPUT_CHUNK_TYPE_TEXT) { + return nullptr; // text chunks have no clip ctx } GGML_ABORT("unknown chunk type"); }Failing fast is good, but crashing on an innocuous misuse hurts API ergonomics. Returning
nullptr
(caller can then error-out) is safer.
654-672
:split_text
can mis-count markers, causing misleading log outputThe helper returns the delimiter itself as an element; however the “expected markers” reported elsewhere use
parts.size() - 1
, which doubles the actual count in ordinary strings ("a <m> b"
→ 3 parts). Consider counting the delimiter occurrences directly or returning only text fragments to avoid confusion.This is purely logging, so not critical, but the message hampers debugging.
tools/mtmd/clip.cpp (1)
2589-2604
: Warm-up allocation for audio encoded models can explode memory usage
alloc_compute_meta()
builds a fake batch whose width is initialised asimg->nx = hparams.warmup_audio_size; // default 3000 img->ny = hparams.n_mel_bins;
For speech models using 16 kHz audio this corresponds to 187.5 ms per frame, yet many users feed noticeably shorter snippets during inference.
Allocating the graph for 3 000 frames (~46 s) overly inflates both compute and weight buffers (especially on GPU back-ends with unified memory budgeting).Consider tightening the default (
3000
→400
?) or – better – scaling it dynamically fromctx_params.max_audio_len
/ a CLI flag.
It will noticeably reduce start-up latency and VRAM footprint for typical use-cases.convert_hf_to_gguf.py (2)
1127-1128
: Add documentation for the newly added class attributes.Consider adding docstrings to clarify the purpose of
n_block_keys
- it's not immediately clear that this list contains the possible key names for block count across different model configurations.class MmprojModel(ModelBase): model_type = ModelType.MMPROJ model_arch = gguf.MODEL_ARCH.MMPROJ preprocessor_config: dict[str, Any] global_config: dict[str, Any] + # List of possible key names for block count in different model configurations n_block_keys = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth"]
1132-1135
: Consider extracting the separate vision and audio hparams initialization into dedicated methods.The initialization logic for
hparams_vision
andhparams_audio
could be extracted into separate methods for better maintainability and testability.+ def _init_modality_hparams(self) -> None: + """Initialize vision and audio hyperparameters from the global config.""" + self.hparams_vision = self.get_vision_config() + self.hparams_audio = self.get_audio_config() + class MmprojModel(ModelBase): # ... existing code ... def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) # ... existing initialization code ... # move vision config to the top level, while preserving the original hparams in global_config import copy self.global_config = copy.deepcopy(self.hparams) - self.hparams_vision = self.get_vision_config() - self.hparams_audio = self.get_audio_config() + self._init_modality_hparams()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
tools/mtmd/test-2.mp3
is excluded by!**/*.mp3
📒 Files selected for processing (10)
convert_hf_to_gguf.py
(8 hunks)gguf-py/gguf/constants.py
(1 hunks)gguf-py/gguf/tensor_mapping.py
(2 hunks)tools/mtmd/clip-impl.h
(2 hunks)tools/mtmd/clip.cpp
(45 hunks)tools/mtmd/clip.h
(1 hunks)tools/mtmd/mtmd-cli.cpp
(1 hunks)tools/mtmd/mtmd-helper.cpp
(3 hunks)tools/mtmd/mtmd.cpp
(13 hunks)tools/mtmd/tests.sh
(3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
tools/mtmd/mtmd-cli.cpp (1)
tools/mtmd/mtmd.cpp (2)
mtmd_default_marker
(82-84)mtmd_default_marker
(82-82)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
MODEL_TENSOR
(355-551)
tools/mtmd/mtmd-helper.cpp (3)
tools/mtmd/mtmd.cpp (13)
nx
(30-30)nx
(34-42)n_tokens
(51-57)GGML_ASSERT
(174-262)GGML_ASSERT
(264-277)mtmd_input_chunk_get_tokens_image
(943-948)mtmd_input_chunk_get_tokens_image
(943-943)chunk
(280-287)chunk
(280-280)mtmd_image_tokens_get_nx
(1015-1017)mtmd_image_tokens_get_nx
(1015-1015)mtmd_image_tokens_get_ny
(1019-1021)mtmd_image_tokens_get_ny
(1019-1019)tools/mtmd/mtmd.h (6)
nx
(120-120)chunk
(148-148)chunk
(149-149)chunk
(159-159)mtmd_image_tokens_get_nx
(168-168)mtmd_image_tokens_get_ny
(169-169)tools/server/utils.hpp (7)
GGML_ASSERT
(1156-1159)pos
(1110-1117)pos
(1110-1110)pos
(1162-1165)pos
(1162-1162)chunk
(1127-1147)chunk
(1127-1127)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-cross
- GitHub Check: build-linux-cross / ubuntu-24-arm64-vulkan-cross
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-vulkan-cross
- GitHub Check: build-linux-cross / ubuntu-24-ppc64el-cpu-cross
- GitHub Check: build-linux-cross / ubuntu-24-ppc64el-vulkan-cross
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: macOS-latest-swift (generic/platform=macOS)
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: macOS-latest-cmake-tvos
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: check-requirements
- GitHub Check: server-windows
- GitHub Check: server (Release)
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
🔇 Additional comments (27)
gguf-py/gguf/constants.py (1)
2263-2263
: LGTM! Consistent constant addition for Qwen2.5-Omni support.The new
QWEN25O
constant follows the existing naming pattern and is properly aligned with the corresponding C++ enum mapping intools/mtmd/clip-impl.h
. The comment clearly indicates this is for omni-modal support.tools/mtmd/clip-impl.h (1)
133-133
: LGTM! Proper enum and mapping addition for Qwen2.5-Omni projector type.The enum value
PROJECTOR_TYPE_QWEN25O
is correctly positioned beforePROJECTOR_TYPE_UNKNOWN
, and the corresponding string mapping to"qwen2.5o"
is consistent with the Python constant ingguf-py/gguf/constants.py
. The formatting follows the existing code style.Also applies to: 152-152
gguf-py/gguf/tensor_mapping.py (3)
1128-1128
: LGTM: Audio tensor mapping additionThe addition of "audio_tower.ln_post" to the
A_POST_NORM
tensor mapping correctly extends support for Qwen2.5-Omni audio encoders.
1165-1166
: Good documentation of pseudo-prefix conventionThe comment clearly explains the purpose of the "audio." pseudo-prefix for preventing conflicts with vision tensors, which helps with code maintainability.
1174-1174
: LGTM: Audio projection tensor mappingThe addition of "audio_tower.proj" to the
A_MMPROJ_FC
tensor mapping correctly supports the audio tower projection layer in Qwen2.5-Omni models.tools/mtmd/clip.h (4)
20-23
: LGTM: Clear modality enumerationThe
clip_modality
enum provides a clean way to distinguish between vision and audio processing modes.
27-27
: LGTM: Modality field additionAdding the
modality
field toclip_context_params
allows the initialization function to determine the appropriate processing mode.
31-34
: LGTM: Well-designed result structureThe
clip_init_result
struct with separatectx_v
andctx_a
pointers provides a clean way to return both vision and audio contexts from initialization.
36-36
:✅ Verification successful
Breaking API change - ensure all callers are updated
The function signature change from returning a single
clip_ctx *
to returningclip_init_result
is a breaking change that will require updates to all calling code.
🏁 Script executed:
#!/bin/bash # Search for all usages of clip_init to verify they've been updated rg -A 5 -B 5 "clip_init\s*\(" --type cpp --type cLength of output: 1816
No further action needed – all
clip_init
callers have been updated to useclip_init_result
.
- Verified in tools/mtmd/mtmd.cpp that the call to
clip_init
is unpacked viaauto res = clip_init(...)
, withres.ctx_v
andres.ctx_a
being used.tools/mtmd/mtmd-helper.cpp (3)
69-70
: Good clarification of function purposeRenaming to
set_position_mrope_2d
and adding the "for image" comment clearly indicates this function is specifically for 2D spatial positioning used by image tokens.
89-104
: LGTM: Audio M-RoPE implementationThe 1D M-RoPE implementation for audio tokens correctly sets sequential positions for all three used dimensions while leaving the fourth dimension unused, consistent with the 2D version.
167-180
: Robust chunk type handling for M-RoPEThe logic correctly distinguishes between image and audio chunks, calling the appropriate M-RoPE positioning function for each type, with proper error handling for unsupported chunk types.
tools/mtmd/tests.sh (4)
35-35
: Good addition of file arrayThe
arr_file
array properly supports different input file types for vision and audio tests.
37-50
: Well-designed specialized test functionsThe
add_test_vision
andadd_test_audio
functions properly handle the different requirements for each modality (vision uses chat templates, audio doesn't; different input files).
101-101
: LGTM: Simplified build targetBuilding only the unified
llama-mtmd-cli
binary aligns with the architectural consolidation.
105-118
: LGTM: Updated test execution logicThe test loop correctly uses the new array structure and applies the appropriate input file for each test case.
tools/mtmd/clip.cpp (1)
2099-2107
:❓ Verification inconclusive
Possible prefix mismatch between hyper-parameter keys and tensor names
load_hparams()
usesconst char * prefix = is_vision ? "vision" : "audio";when querying numeric keys such as
"vision_n_embd"
/"audio_n_embd"
(and similar).Later on,
load_tensors()
derives a different prefix for the very same modality:const char * prefix = model.modality == CLIP_MODALITY_AUDIO ? "a" : "v";All subsequent tensor look-ups are built with
string_format(..., prefix, …)
and therefore expect names like"v_attn_q.weight"
or"a_ln_1.bias"
.If your GGUF conversion script exported:
- hyper-parameter keys with
"vision_" / "audio_"
and- tensor names with
"v_" / "a_"
then everything is fine.
However, if the exporter used the long form ("vision_" / "audio_"
) for both metadata and tensors (or vice-versa), the loader will silently fail to locate mandatory tensors, returningnullptr
which will later cause hard-to-trace SIGSEGVs.I recommend unifying the prefix logic in one place, e.g.
-const char * prefix = model.modality == CLIP_MODALITY_AUDIO ? "a" : "v"; +const char * prefix = model.modality == CLIP_MODALITY_AUDIO ? "audio" : "vision";(or the opposite, depending on your GGUF convention) and providing an assertion after
get_tensor()
that no required tensor was missed.Would you double-check the actual tensor names inside a representative GGUF file and confirm which style is used?
If you need a helper script to dump the tensor list I can provide one.Also applies to: 2294-2300
🏁 Script executed:
#!/bin/bash # Show the load_tensors implementation and look for prefix assignment rg -n "load_tensors" -A20 tools/mtmd/clip.cpp # Find all prefix definitions in clip.cpp to compare both usages rg -n "const char \\* prefix" -n tools/mtmd/clip.cppLength of output: 2301
Ensure consistent prefix usage for hparams and tensors
There’s a mismatch between how you load hyper-parameters vs. tensors:
- In load_hparams (around line 2104):
const char *prefix = is_vision ? "vision" : "audio";
- In load_tensors (line 2296):
const char *prefix = model.modality == CLIP_MODALITY_AUDIO ? "a" : "v";
If your GGUF export uses the long form (
"vision_"
/"audio_"
) for both metadata and tensors (or the short form for both), the loader will silently fail to find tensors and crash later.I recommend:
• Unify the prefixes (choose long or short) in one place, e.g. in load_tensors:
- const char *prefix = model.modality == CLIP_MODALITY_AUDIO ? "a" : "v"; + const char *prefix = model.modality == CLIP_MODALITY_AUDIO ? "audio" : "vision";• After each get_tensor():
auto *t = get_tensor(...); assert(t && "missing tensor: " KEY(prefix) "_" name);Could you please confirm which convention your GGUF file uses for tensor names? For example, you can run:
strings model.gguf | grep -E '"?(vision|audio|v|a)_' | sort -uand share the output so we can lock in the correct prefix.
convert_hf_to_gguf.py (10)
435-438
: LGTM! Clean configuration renaming for Qwen2.5-Omni models.The code correctly handles the configuration structure change by renaming
thinker_config
totext_config
for Qwen2.5-Omni models.
1172-1177
: Excellent modular design for multimodal configuration handling.The introduction of
get_vision_config()
andget_audio_config()
methods provides a clean abstraction for accessing modality-specific configurations. This design allows for easy extension to support models with different configuration structures.
1216-1231
: Well-designed parameter lookup methods with proper error handling.The
find_vparam
,find_aparam
, and_find_param
methods provide a clean and consistent API for accessing modality-specific parameters. The assertion checks ensure that the respective configs exist before attempting to access them.
2709-2714
: LGTM! Proper registration of Qwen2.5-Omni models.The addition of
"Qwen2_5OmniModel"
to the model registration follows the established pattern.
2732-2738
: Comprehensive tensor name processing for multimodal components.The enhanced tensor name processing correctly handles the stripping of
"thinker."
prefixes and properly skips multimodal tensors (visual, audio, talker, token2wav). This ensures that only relevant tensors are processed.
2759-2777
: Robust projector type handling for different model versions.The code properly differentiates between different Qwen2 vision model types and sets the appropriate projector types. The handling of
fullatt_block_indexes
for window attention patterns is well-implemented with proper validation.
2828-2834
: LGTM! Proper audio hyperparameter initialization.The audio configuration is correctly extracted and mapped to the expected parameter names for consistency with the rest of the codebase.
2835-2840
: Clean audio-specific GGUF parameter setup.The audio-specific parameters are properly added to the GGUF writer with sensible defaults for the layer norm epsilon.
2847-2858
: Well-implemented sinusoidal position embeddings for audio.The implementation follows the standard sinusoidal position encoding formula with appropriate scaling. The fixed length of 1500 and max_timescale of 10000 appear to be model-specific constants.
2869-2878
: Correct handling of audio tensor modifications.The code properly handles:
- Unsqueezing bias tensors for conv layers to match expected dimensions
- Skipping the unused
audio_bos_eos_token
as documented in the transformers code reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
tools/mtmd/tests.sh (1)
89-91
:⚠️ Potential issue
add_test
is undefined – the huge branch will failAfter the refactor only
add_test_vision
/add_test_audio
exist, but the huge section still calls the removedadd_test
helper, leading to “command not found” and an early script abort.- add_test "llama-mtmd-cli" "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M" - add_test "llama-mtmd-cli" "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S" + add_test_vision "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M" + add_test_vision "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S"
♻️ Duplicate comments (2)
tools/mtmd/mtmd.cpp (1)
730-732
: 🛠️ Refactor suggestionPrefer querying the projector for its embedding size
Using
ctx->n_embd_text
assumes text and audio dimensions are forever equal.
Future adapters (e.g. 1×1 linear) could break this assumption.- int n_mmproj_embd = ctx->n_embd_text; + int n_mmproj_embd = clip_n_mmproj_embd(ctx->ctx_a);This keeps the audio path self-contained and resilient.
tools/mtmd/clip.cpp (1)
4044-4085
:⚠️ Potential issueCritical: Null-pointer dereference risk in
clip_n_mmproj_embd()
Several projector branches blindly read tensor shapes without checking the pointer, e.g.
case PROJECTOR_TYPE_LDP: return ctx->model.mm_model_block_1_block_2_1_b->ne[0];If a future GGUF is produced without optional bias terms (quite common for weight-only formats to save space)
mm_model_block_1_block_2_1_b
will benullptr
and the call will seg-fault the encoder during the very first size query.Please guard every dereference with a nullptr check (or fall back to another tensor that is guaranteed to exist), e.g.
-case PROJECTOR_TYPE_LDP: - return ctx->model.mm_model_block_1_block_2_1_b->ne[0]; +case PROJECTOR_TYPE_LDP: { + GGML_ASSERT(ctx->model.mm_model_block_1_block_2_1_b && + "Missing bias tensor for LDP projector"); + return ctx->model.mm_model_block_1_block_2_1_b->ne[0]; +}The same applies to the
LDPV2
,MLP
,PIXTRAL
,MINICPMV
and several other cases below.
🧹 Nitpick comments (1)
docs/multimodal.md (1)
101-109
: Clarify placeholder & flags for new mixed-modality examplesThe newly-added commands still include the literal
(tool_name)
placeholder. Readers may miss the earlier note that this must be replaced withllama-mtmd-cli
orllama-server
.Consider either:
- Replacing the placeholder with a concrete example (as done in lines 45-63), or
- Adding an inline reminder, e.g.
(tool_name) # replace with llama-mtmd-cli or llama-server
Also mention whether an external
--mmproj
is required when the model is loaded via-m <local-gguf>
(unlike the-hf
path).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/multimodal.md
(1 hunks)tools/mtmd/clip.cpp
(45 hunks)tools/mtmd/mtmd.cpp
(13 hunks)tools/mtmd/tests.sh
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-cross
- GitHub Check: ubuntu-22-cmake-vulkan
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-vulkan-cross
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: build-linux-cross / ubuntu-24-arm64-vulkan-cross
- GitHub Check: android-build
- GitHub Check: build-linux-cross / ubuntu-24-ppc64el-vulkan-cross
- GitHub Check: macOS-latest-swift (generic/platform=macOS)
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: ios-xcode-build
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: server-windows
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: server (Release)
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
- GitHub Check: server (ADDRESS, RelWithDebInfo)
🔇 Additional comments (9)
tools/mtmd/tests.sh (1)
108-123
:❌ Incorrect review comment
Audio files are passed with
--image
; verify correct CLI flag
--image $SCRIPT_DIR/$inp_file
is used for every modality, butadd_test_audio
pushes an MP3 file. Unlessllama-mtmd-cli
treats--image
as a generic
media flag, this may silently skip audio processing or raise a MIME error.Suggested defensive tweak:
flag=$( [[ $inp_file == *.mp3 ]] && echo "--audio" || echo "--image" ) "$PROJ_ROOT/build/bin/$bin" \ -hf "$hf" \ $flag "$SCRIPT_DIR/$inp_file" \ …Please confirm the intended CLI syntax.
🏁 Script executed:
#!/bin/bash # Search for audio and image test registration and arr_file usage in tests.sh rg --with-filename --no-heading --color never "add_test_audio" -n tools/mtmd/tests.sh rg --with-filename --no-heading --color never "add_test_image" -n tools/mtmd/tests.sh rg --with-filename --no-heading --color never "arr_file" -n tools/mtmd/tests.shLength of output: 653
🏁 Script executed:
#!/bin/bash # Search for defined CLI flags for audio and image in the codebase rg --with-filename --no-heading --color never "\-\-(audio|image)" -n . # Look for argument parser definitions (Python argparse or Rust Clap) rg --with-filename --no-heading --color never "add_argument" -n . rg --with-filename --no-heading --color never "#\[derive(Parser)" -n . rg --with-filename --no-heading --color never "clap::Parser" -n . # Check README for any usage examples of --audio or --image if [ -f README.md ]; then grep -nE '\-\-(audio|image)' README.md || echo "No audio/image flags in README" fiLength of output: 28697
🏁 Script executed:
#!/usr/bin/env bash # Show context around the media FILE argument definition rg -n -C5 '\{"--image", "--audio"\}' common/arg.cppLength of output: 618
CLI flags for image/audio are interchangeable
The argument parser in common/arg.cpp registers both--image
and--audio
as synonyms for the same FILE option (all values are stored inparams.image
), so passing an MP3 with--image
will still be processed as audio. No changes are required—feel free to use--audio
instead of--image
for clarity.Likely an incorrect or invalid review comment.
tools/mtmd/mtmd.cpp (1)
162-171
: Great! Dimension-mismatch guard addedValidating that vision and audio projectors share the same
n_embd
prevents the
buffer over/under-run reported earlier. 👍tools/mtmd/clip.cpp (7)
178-207
: LGTM! Good hyperparameter structure extension for multimodal support.The addition of
image_mean
,image_std
arrays,warmup_audio_size
, and legacy compatibility fields properly supports the transition to unified vision/audio processing.
244-247
: Excellent refactoring to unified model structure.The addition of
modality
andproj_type
fields toclip_model
enables clean support for both vision and audio modalities under a single structure, improving code maintainability.
417-422
: Good backward compatibility design with proj_type() accessor.The accessor function maintains API compatibility while enabling the internal refactoring to proceed cleanly.
2683-2717
: Well-structured multimodal initialization.The refactoring to return separate vision and audio contexts with proper error handling and cleanup is well implemented. The use of
clip_model_loader
to handle both modalities cleanly maintains the separation of concerns.
2067-2097
: Robust modality-specific loading logic.The sanity checks for modality support and special handling for
PROJECTOR_TYPE_QWEN25O
multimodal models demonstrate good understanding of the different model architectures. The logic correctly maps the generic multimodal type to specific vision/audio projector types.
2289-2589
: Clean tensor loading with modality-aware prefixing.The use of
model.modality
to determine tensor prefixes ("v" for vision, "a" for audio) enables clean reuse of the tensor loading logic for both modalities while maintaining proper namespace separation.
4087-4123
: Excellent API compatibility maintenance.The updated detection functions (
clip_has_vision_encoder
,clip_has_audio_encoder
, etc.) properly use the newmodality
field andproj_type()
accessor while maintaining backward compatibility with existing API consumers.
void init_audio() { | ||
GGML_ASSERT(ctx_a != nullptr); | ||
projector_type proj = clip_get_projector_type(ctx_a); | ||
|
||
if (clip_has_whisper_encoder(ctx_a)) { | ||
// TODO @ngxson : check if model n_mel is 128 or 80 | ||
w_filters = whisper_precalc_filters::get_128_bins(); | ||
} | ||
|
||
// warning messages | ||
if (proj == PROJECTOR_TYPE_LLAMA4) { | ||
LOG_WRN("%s: llama 4 vision is known to have degraded quality:\n" | ||
" https://github.com/ggml-org/llama.cpp/pull/13282\n", __func__); | ||
LOG_WRN("%s: audio input is in experimental stage and may have reduced quality:\n" | ||
" https://github.com/ggml-org/llama.cpp/discussions/13759\n", __func__); | ||
|
||
if (proj == PROJECTOR_TYPE_QWEN2A) { | ||
// <|audio_bos|> ... (embeddings) ... <|audio_eos|> | ||
aud_beg = "<|audio_bos|>"; | ||
aud_end = "<|audio_eos|>"; | ||
|
||
} | ||
if (has_audio) { | ||
LOG_WRN("%s: audio input is in experimental stage and may have reduced quality:\n" | ||
" https://github.com/ggml-org/llama.cpp/discussions/13759\n", __func__); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing BOS/EOS tokens for Qwen-2.5 Omni audio
init_audio()
sets aud_beg
/aud_end
only for PROJECTOR_TYPE_QWEN2A
.
The new projector type PROJECTOR_TYPE_QWEN25O
(added elsewhere) is ignored,
so audio chunks from Qwen2.5 Omni will be injected without boundary markers,
likely breaking generation.
} else if (proj == PROJECTOR_TYPE_QWEN25O) {
aud_beg = "<|audio_bos|>";
aud_end = "<|audio_eos|>";
}
(Adjust token strings to match the HF tokenizer if they differ.)
🤖 Prompt for AI Agents
In tools/mtmd/mtmd.cpp between lines 275 and 293, the init_audio() function only
sets the aud_beg and aud_end tokens for PROJECTOR_TYPE_QWEN2A, ignoring the new
PROJECTOR_TYPE_QWEN25O. This causes audio chunks from Qwen-2.5 Omni to lack
boundary markers, which may break generation. Add a condition to set aud_beg and
aud_end tokens for PROJECTOR_TYPE_QWEN25O, matching the tokens used for Qwen2.5
Omni, and ensure the tokens align with the HF tokenizer if they differ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PROJECTOR_TYPE_QWEN25O, // will be replaced by QWEN2A or QWEN25VL depending on clip_ctx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for where PROJECTOR_TYPE_QWEN25O gets replaced or mapped
rg "QWEN25O" -A 5 -B 5
# Look for any mapping or replacement logic
rg "QWEN2A|QWEN25VL" -A 3 -B 3 | grep -E "(replace|map|convert|assign)"
# Search for projector type assignment/resolution logic
ast-grep --pattern $'if ($$$) {
$$$
proj_type = $$$
$$$
}'
Length of output: 4939
@ngxson Thank you for the clarification! You're absolutely right. I found the replacement logic in tools/mtmd/clip.cpp
:
// correct arch for multimodal models
if (model.proj_type == PROJECTOR_TYPE_QWEN25O) {
model.proj_type = modality == CLIP_MODALITY_VISION
? PROJECTOR_TYPE_QWEN25VL
: PROJECTOR_TYPE_QWEN2A;
}
This means that by the time init_audio()
is called for a Qwen2.5 Omni audio context, the projector type has already been resolved to PROJECTOR_TYPE_QWEN2A
, which is correctly handled by the existing conditional block that sets aud_beg = "<|audio_bos|>"
and aud_end = "<|audio_eos|>"
.
My original comment was a false positive. The code is working correctly as designed. Thanks for pointing this out!
🧬 Code Graph Analysis Results
<no_relevant_code_snippets>
✏️ Learnings added
Learnt from: ngxson
PR: ngxson/llama.cpp#25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.635Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tools/mtmd/tests.sh (2)
72-74
: Audio test still passed via--image
; consider--audio
for clarityAlthough the CLI currently accepts an audio file on
--image
, this is counter-intuitive and may confuse contributors. Prefer adding a dedicated--audio
flag (or reuse--media
) and updating the test invocation accordingly.
107-118
: Hard-coded binary name but flexible arrays – future-proofing idea
bin="llama-mtmd-cli"
is hard-coded even though the legacy helper allowed passing a different binary.
If the intention is to test multiple binaries again, consider restoring the argument in the function signature and storing it in anarr_bin
array.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tools/mtmd/clip-impl.h
(2 hunks)tools/mtmd/mtmd.cpp
(13 hunks)tools/mtmd/tests.sh
(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
tools/mtmd/clip-impl.h (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.635Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
tools/mtmd/mtmd.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.635Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: windows-latest-cmake (vulkan-x64, -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_...
- GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=c...
- GitHub Check: windows-latest-cmake (llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-win...
- GitHub Check: windows-latest-cmake (cpu-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-...
- GitHub Check: windows-latest-cmake (openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-win...
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: ubuntu-22-cmake-vulkan
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: server (Release)
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: pyright type-check
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
- GitHub Check: server-windows
- GitHub Check: check-requirements
🔇 Additional comments (4)
tools/mtmd/clip-impl.h (1)
133-135
: 🛠️ Refactor suggestion
⚠️ Potential issueAdding
PROJECTOR_TYPE_QWEN25O
in-between existing enum values silently shifts every following numeric constant – this can break on-disk/ABI compatibility.Many structures (e.g. GGUF metadata) serialise the raw integer of
projector_type
; inserting a new value in the middle changes those integers for every subsequent type (UNKNOWN
used to be N, now it is N+1).
Unless the enum is guaranteed to be used only in memory, append new variants at the end or assign explicit values:enum projector_type { @@ PROJECTOR_TYPE_QWEN2A, - PROJECTOR_TYPE_QWEN25O, // will be replaced by QWEN2A or QWEN25VL depending on clip_ctx - PROJECTOR_TYPE_UNKNOWN, + PROJECTOR_TYPE_QWEN25O, // placeholder … + PROJECTOR_TYPE_UNKNOWN = 0xFFFF, // keep stable value };Alternatively, add
= <fixed_id>
for every enum that is part of a serialised format.Also applies to: 152-153
⛔ Skipped due to learnings
Learnt from: ngxson PR: ngxson/llama.cpp#25 File: tools/mtmd/mtmd.cpp:275-293 Timestamp: 2025-05-26T09:45:20.635Z Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
tools/mtmd/mtmd.cpp (3)
148-151
:media_marker
emptiness check prevents accidental infinite splitting – good catchThe early guard ensures the tokenizer cannot enter an infinite loop when the delimiter is empty.
Looks correct as implemented.
162-172
: Helpful dimension-mismatch guardVerifying that the vision and audio projectors agree on
n_embd
prevents subtle buffer overruns later on.
Nice defensive programming!
805-808
: Early-exit for Gemma-3 non-causal decode – concise and correctThe modality-specific flag is computed once and re-used – looks good.
int n_mmproj_embd = ctx->n_embd_text; | ||
ctx->image_embd_v.resize(chunk->tokens_audio->n_tokens * n_mmproj_embd); | ||
bool ok = clip_image_batch_encode( | ||
ctx->ctx_clip, | ||
ctx->ctx_a, | ||
ctx->n_threads, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use clip_n_mmproj_embd(ctx->ctx_a)
instead of n_embd_text
for robustness
n_mmproj_embd
is derived from the text model embedding size, assuming it was already validated to match the audio projector.
If, in a future refactor, the validation path is bypassed (or conditional compilation removes vision support), this assignment could overflow / under-allocate image_embd_v
.
- int n_mmproj_embd = ctx->n_embd_text;
+ const int n_mmproj_embd = clip_n_mmproj_embd(ctx->ctx_a);
Keeps the function self-contained, reduces coupling, and makes the intent clearer.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
int n_mmproj_embd = ctx->n_embd_text; | |
ctx->image_embd_v.resize(chunk->tokens_audio->n_tokens * n_mmproj_embd); | |
bool ok = clip_image_batch_encode( | |
ctx->ctx_clip, | |
ctx->ctx_a, | |
ctx->n_threads, | |
// allocate buffer using the actual audio projector dimension | |
- int n_mmproj_embd = ctx->n_embd_text; | |
+ const int n_mmproj_embd = clip_n_mmproj_embd(ctx->ctx_a); | |
ctx->image_embd_v.resize(chunk->tokens_audio->n_tokens * n_mmproj_embd); | |
bool ok = clip_image_batch_encode( | |
ctx->ctx_a, | |
ctx->n_threads, |
🤖 Prompt for AI Agents
In tools/mtmd/mtmd.cpp around lines 753 to 757, the variable n_mmproj_embd is
assigned from ctx->n_embd_text, which may not accurately reflect the current
embedding size if the text and image models are decoupled or changed. Replace
the assignment with clip_n_mmproj_embd(ctx->ctx_a) to dynamically retrieve the
correct embedding size, ensuring the resize operation is robust and
future-proof. This change improves code clarity and reduces potential bugs
related to mismatched embedding dimensions.
tools/mtmd/tests.sh
Outdated
add_test_vision() { | ||
local hf=$1 | ||
local tmpl=${2:-""} # default to empty string if not provided | ||
arr_prefix+=("[vision]") | ||
arr_hf+=("$hf") | ||
arr_tmpl+=("$tmpl") | ||
arr_file+=("test-1.jpeg") | ||
} | ||
|
||
add_test "llama-mtmd-cli" "ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0" | ||
add_test "llama-mtmd-cli" "ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M" | ||
add_test "llama-mtmd-cli" "ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0" | ||
add_test "llama-mtmd-cli" "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M" | ||
add_test "llama-mtmd-cli" "THUDM/glm-edge-v-5b-gguf:Q4_K_M" | ||
add_test "llama-mtmd-cli" "second-state/Llava-v1.5-7B-GGUF:Q2_K" "vicuna" | ||
add_test "llama-mtmd-cli" "cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M" "vicuna" | ||
add_test "llama-mtmd-cli" "ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M" | ||
add_test "llama-mtmd-cli" "second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K" # model from openbmb is corrupted | ||
add_test "llama-mtmd-cli" "openbmb/MiniCPM-V-2_6-gguf:Q2_K" | ||
add_test "llama-mtmd-cli" "openbmb/MiniCPM-o-2_6-gguf:Q4_0" | ||
add_test "llama-mtmd-cli" "bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M" | ||
add_test "llama-mtmd-cli" "ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M" | ||
add_test "llama-mtmd-cli" "ggml-org/InternVL2_5-1B-GGUF:Q8_0" | ||
add_test "llama-mtmd-cli" "ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0" | ||
add_test_audio() { | ||
local hf=$1 | ||
arr_prefix+=("[audio] ") | ||
arr_hf+=("$hf") | ||
arr_tmpl+=("") # no need for chat tmpl | ||
arr_file+=("test-2.mp3") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_test
was removed but is still referenced later – script will exit with command not found
Functions were renamed to add_test_vision
/ add_test_audio
, yet lines 92-94 still invoke the old helper.
Running ./tests.sh huge
(or big
) will therefore fail before any test is executed.
-if [ "$RUN_HUGE_TESTS" = true ]; then
- add_test "llama-mtmd-cli" "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M"
- add_test "llama-mtmd-cli" "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S"
+if [ "$RUN_HUGE_TESTS" = true ]; then
+ add_test_vision "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M"
+ add_test_vision "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S"
fi
Also consider adding an add_test_huge()
wrapper to keep naming symmetrical.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
add_test_vision() { | |
local hf=$1 | |
local tmpl=${2:-""} # default to empty string if not provided | |
arr_prefix+=("[vision]") | |
arr_hf+=("$hf") | |
arr_tmpl+=("$tmpl") | |
arr_file+=("test-1.jpeg") | |
} | |
add_test "llama-mtmd-cli" "ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0" | |
add_test "llama-mtmd-cli" "ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M" | |
add_test "llama-mtmd-cli" "ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0" | |
add_test "llama-mtmd-cli" "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M" | |
add_test "llama-mtmd-cli" "THUDM/glm-edge-v-5b-gguf:Q4_K_M" | |
add_test "llama-mtmd-cli" "second-state/Llava-v1.5-7B-GGUF:Q2_K" "vicuna" | |
add_test "llama-mtmd-cli" "cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M" "vicuna" | |
add_test "llama-mtmd-cli" "ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M" | |
add_test "llama-mtmd-cli" "second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K" # model from openbmb is corrupted | |
add_test "llama-mtmd-cli" "openbmb/MiniCPM-V-2_6-gguf:Q2_K" | |
add_test "llama-mtmd-cli" "openbmb/MiniCPM-o-2_6-gguf:Q4_0" | |
add_test "llama-mtmd-cli" "bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M" | |
add_test "llama-mtmd-cli" "ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M" | |
add_test "llama-mtmd-cli" "ggml-org/InternVL2_5-1B-GGUF:Q8_0" | |
add_test "llama-mtmd-cli" "ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0" | |
add_test_audio() { | |
local hf=$1 | |
arr_prefix+=("[audio] ") | |
arr_hf+=("$hf") | |
arr_tmpl+=("") # no need for chat tmpl | |
arr_file+=("test-2.mp3") | |
} | |
if [ "$RUN_HUGE_TESTS" = true ]; then | |
add_test_vision "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M" | |
add_test_vision "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S" | |
fi |
🤖 Prompt for AI Agents
In tools/mtmd/tests.sh between lines 38 and 54, the script still references the
removed add_test function, which causes a command not found error. Update the
script to replace calls to add_test with the new function names add_test_vision
or add_test_audio as appropriate, or add a new add_test_huge wrapper for
consistency. This will prevent script failure before tests run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
tools/mtmd/tests.sh (1)
129-133
:⚠️ Potential issueUse correct input flag for audio vs. vision
Currently every test invocation uses--image
, even for.mp3
files. Audio inputs should use the proper flag (e.g.,--audio
). Otherwise the CLI will error out on audio tests.Example fix:
+ # choose flag based on file extension + if [[ "$inp_file" == *.mp3 ]]; then + input_flag="--audio" + else + input_flag="--image" + fi output=$(\ "$PROJ_ROOT/build/bin/$bin" \ -hf "$hf" \ - --image $SCRIPT_DIR/$inp_file \ + $input_flag $SCRIPT_DIR/$inp_file \ -p "what is the publisher name of the newspaper?" \ --temp 0 -n 128 \ ${tmpl:+--chat-template "$tmpl"} \
♻️ Duplicate comments (1)
tools/mtmd/tests.sh (1)
96-97
:add_test
is still referenced for huge tests
The genericadd_test
helper was removed in prior refactoring. These calls will result incommand not found
.
🧹 Nitpick comments (1)
tools/mtmd/tests.sh (1)
47-53
: Implementadd_test_audio()
helper
The audio helper correctly tracks model, template, and MP3 input. For consistency with the vision prefix, consider removing the trailing space in"[audio] "
.Suggested diff:
- arr_prefix+=("[audio] ") + arr_prefix+=("[audio]")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/mtmd/tests.sh
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (19)
- GitHub Check: macOS-latest-cmake-visionos
- GitHub Check: android-build
- GitHub Check: ubuntu-22-cmake-vulkan
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: server-windows
- GitHub Check: server (Release)
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
- GitHub Check: check-requirements
🔇 Additional comments (7)
tools/mtmd/tests.sh (7)
33-36
: Initialize arrays for test metadata
The newarr_prefix
andarr_file
arrays correctly complement the existingarr_hf
andarr_tmpl
arrays to track per-test prefixes and input files.
38-45
: Implementadd_test_vision()
helper
This function cleanly appends vision-model entries (HF path, template, default JPEG file, and “[vision]” prefix) to the test arrays.
55-73
: Populate initial vision and audio tests
Your series ofadd_test_vision
andadd_test_audio
invocations comprehensively covers small to mid-sized models. Listing them explicitly aids clarity; all entries look correct.
77-89
: Populate BIG-model vision and audio tests
The “big” block adds larger vision and audio tests consistently. Commenting out the known‐bad 32B model is a helpful flag for future runs.
111-111
: Build the CLI binary before testing
Invokingcmake --build ... --target llama-mtmd-cli
ensures the latest changes are compiled. This placement is correct.
115-121
: Iterate over registered tests
Looping overarr_hf
(and correspondingarr_prefix
,arr_tmpl
,arr_file
) is a clean, data-driven approach. Theinp_file
assignment is correct.
137-141
: Evaluate and format test results
The grep-based check for “new york” and ANSI-colored OK/FAIL strings work as expected. Consider centralizing the success pattern or making it configurable in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tools/mtmd/tests.sh (2)
47-53
: Normalizearr_prefix
formatting inadd_test_audio()
.Currently you append
"[audio] "
(with a trailing space), whereasadd_test_vision()
uses"[vision]"
. This causes inconsistent spacing in the final log lines:Suggestion: remove the trailing space and let the formatting layer insert spaces uniformly.
-arr_prefix+=("[audio] ") +arr_prefix+=("[audio]")
55-73
: Refactor repetitive model registrations.There’s a long list of
add_test_vision
/add_test_audio
calls. Consider DRYing this up by defining an array of model specs (including modality and optional template) and looping over it to call the appropriate helper. This will improve maintainability when adding or removing models.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/mtmd/tests.sh
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (19)
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-vulkan-cross
- GitHub Check: build-linux-cross / ubuntu-24-ppc64el-vulkan-cross
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-cross
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: windows-2019-cmake-cuda (12.4)
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: server-windows
- GitHub Check: server (Release)
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
- GitHub Check: check-requirements
🔇 Additional comments (4)
tools/mtmd/tests.sh (4)
28-28
: Skip echo change.The updated message clarifies inclusion of both BIG and HUGE models but doesn’t introduce any functional change that requires review.
33-36
: Array initialization is sound.You’ve correctly introduced
arr_file
alongside the existing arrays to track input filenames. The empty initializations ensure safe appends later in the script.
38-45
:add_test_vision()
helper is well-structured.
- Uses
local
for parameters.- Provides a sensible default for
tmpl
.- Appends all four arrays in lockstep, including the new
arr_file
entry for vision inputs.
137-141
: Verify output check works for both modalities.You grep
new york
to decide pass/fail indiscriminately. Audio tests may produce different expected keywords or confidence markers. Please confirm that"new york"
is the right heuristic for audio outputs, or parameterize the success pattern per modality.
@@ -107,7 +126,7 @@ for i in "${!arr_bin[@]}"; do | |||
output=$(\ | |||
"$PROJ_ROOT/build/bin/$bin" \ | |||
-hf "$hf" \ | |||
--image $SCRIPT_DIR/test-1.jpeg \ | |||
--image $SCRIPT_DIR/$inp_file \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Audio inputs are passed to --image
flag—likely a bug.
The invocation always uses --image $SCRIPT_DIR/$inp_file
, so audio tests will feed .mp3
files to the image flag. You must branch on modality (or file extension) to use the correct CLI flag, for example --audio
for MP3s.
Apply a conditional here, for example:
- --image $SCRIPT_DIR/$inp_file \
+ $( if [[ "${inp_file##*.}" == "mp3" ]]; then
+ echo "--audio $SCRIPT_DIR/$inp_file"
+ else
+ echo "--image $SCRIPT_DIR/$inp_file"
+ fi ) \
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
--image $SCRIPT_DIR/$inp_file \ | |
$( if [[ "${inp_file##*.}" == "mp3" ]]; then | |
echo "--audio $SCRIPT_DIR/$inp_file" | |
else | |
echo "--image $SCRIPT_DIR/$inp_file" | |
fi ) \ |
🤖 Prompt for AI Agents
In tools/mtmd/tests.sh at line 129, the script always passes audio files to the
--image flag, which is likely a bug since --image is intended for image inputs.
To fix this, add a conditional check based on the file extension or modality to
use --audio for MP3 files and --image for image files. Implement the condition
before the invocation to select the correct flag dynamically.
Make sure to read the contributing guidelines before submitting a PR
Summary by CodeRabbit
New Features
Improvements
Bug Fixes