Skip to content

server : pad small embedding batches #13692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 22, 2025

Conversation

ggerganov
Copy link
Member

fix #13689

Temporary workaround until batching logic in libllama is improved.

@aviallon
Copy link
Contributor

aviallon commented May 22, 2025

It works very well. There is only one issue (which may come from me): when chunking the input using HF Tokenizers to ensure we feed at most n_ubatch/n_batch/n_ctx_per_seq tokens to the embedding model, we are always around 1 to 3 tokens too big.
For instance if HF Tokenizers predicts input will be 512 tokens, it might very well be considered to be 513 tokens by llama.cpp… which will make it return an error.

For now, I worked-around this other issue by simply adding a safety margin of 3 tokens.

@ggerganov
Copy link
Member Author

Most likely when you tokenize in HF transformers you don't take into account special tokens such as BOS, EOS, CLS, etc. These are model-specific and are automatically added by the llama-server.

Though it would be nice to track down the root cause - it's also possible that we are doing something wrong.

@ggerganov ggerganov merged commit cc74d5b into master May 22, 2025
53 checks passed
@ggerganov ggerganov deleted the gg/server-fix-pooling-small-batches branch May 22, 2025 13:33
@aviallon
Copy link
Contributor

aviallon commented May 22, 2025

@ggerganov I believe you are right on the cause. I'll experiment with that. Thank you very much for your insanely good skills and rapidity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GGML_ASSERT(seq_id < n_tokens && "seq_id cannot be larger than n_tokens with pooling_type == MEAN") failed
3 participants