Feature Request: Support multimodal LLMs such as Qwen2.5-VL as embedding models

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

llama.cpp should support multimodal models built upon architectures such as Qwen2.5-VL for image and text embeddings.

### Motivation

Multimodal LLMs demonstrate better alignment between image and text embeddings than constrastively trained models such as CLIP, which suffer from a modality gap (text compares better with unrelated text than it does with a related image).

Nomic's latest vision models are designed for PDF document retrieval. [nomic-embed-multimodal-3b](https://huggingface.co/nomic-ai/nomic-embed-multimodal-3b), which generates a single embedding per rasterized PDF page, is already supported by vLLM as it is compatible with the Qwen2-VL embedding model tested [here](https://github.com/vllm-project/vllm/blob/main/tests/models/multimodal/pooling/test_dse_qwen2_vl.py). It is not yet supported by llama.cpp.

### Possible Implementation

This would build upon #13209 which adds vision support for Qwen2.5-VL. Also relevant is #12898 which brings vision to the llama.cpp server and would make the embeddings useful in practice, since you can't do much with just one embedding generated via `llama-embedding` or similar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Support multimodal LLMs such as Qwen2.5-VL as embedding models #13247

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Support multimodal LLMs such as Qwen2.5-VL as embedding models #13247

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions