Skip to content

Feature Request: Support multimodal LLMs such as Qwen2.5-VL as embedding models #13247

Open
@cebtenzzre

Description

@cebtenzzre

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

llama.cpp should support multimodal models built upon architectures such as Qwen2.5-VL for image and text embeddings.

Motivation

Multimodal LLMs demonstrate better alignment between image and text embeddings than constrastively trained models such as CLIP, which suffer from a modality gap (text compares better with unrelated text than it does with a related image).

Nomic's latest vision models are designed for PDF document retrieval. nomic-embed-multimodal-3b, which generates a single embedding per rasterized PDF page, is already supported by vLLM as it is compatible with the Qwen2-VL embedding model tested here. It is not yet supported by llama.cpp.

Possible Implementation

This would build upon #13209 which adds vision support for Qwen2.5-VL. Also relevant is #12898 which brings vision to the llama.cpp server and would make the embeddings useful in practice, since you can't do much with just one embedding generated via llama-embedding or similar.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions