server : pad small embedding batches #13692

ggerganov · 2025-05-21T16:37:12Z

Temporary workaround until batching logic in libllama is improved.

ggml-ci

aviallon · 2025-05-22T12:12:02Z

It works very well. There is only one issue (which may come from me): when chunking the input using HF Tokenizers to ensure we feed at most n_ubatch/n_batch/n_ctx_per_seq tokens to the embedding model, we are always around 1 to 3 tokens too big.
For instance if HF Tokenizers predicts input will be 512 tokens, it might very well be considered to be 513 tokens by llama.cpp… which will make it return an error.

For now, I worked-around this other issue by simply adding a safety margin of 3 tokens.

ggerganov · 2025-05-22T13:33:34Z

Most likely when you tokenize in HF transformers you don't take into account special tokens such as BOS, EOS, CLS, etc. These are model-specific and are automatically added by the llama-server.

Though it would be nice to track down the root cause - it's also possible that we are doing something wrong.

aviallon · 2025-05-22T14:44:51Z

@ggerganov ~~I believe~~ you are right on the cause. ~~I'll experiment with that.~~ Thank you very much for your insanely good skills and rapidity.

server : pad small embedding batches

3c16df1

ggml-ci

ggerganov requested a review from ngxson as a code owner May 21, 2025 16:37

ggerganov mentioned this pull request May 21, 2025

GGML_ASSERT(seq_id < n_tokens && "seq_id cannot be larger than n_tokens with pooling_type == MEAN") failed #13689

Closed

ngxson approved these changes May 21, 2025

View reviewed changes

github-actions bot added examples server labels May 21, 2025

ggerganov merged commit cc74d5b into master May 22, 2025
53 checks passed

ggerganov deleted the gg/server-fix-pooling-small-batches branch May 22, 2025 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : pad small embedding batches #13692

server : pad small embedding batches #13692

ggerganov commented May 21, 2025

Uh oh!

aviallon commented May 22, 2025 •

edited

Loading

Uh oh!

ggerganov commented May 22, 2025

Uh oh!

Uh oh!

aviallon commented May 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

server : pad small embedding batches #13692

server : pad small embedding batches #13692

Conversation

ggerganov commented May 21, 2025

Uh oh!

aviallon commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented May 22, 2025

Uh oh!

Uh oh!

aviallon commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

aviallon commented May 22, 2025 •

edited

Loading

aviallon commented May 22, 2025 •

edited

Loading