Closed
Description
Context
Call to following functions are blocking the main loop and the server stuck for all slots / requests in method update_slots
Global:
- llama_batch_clear
- llama_decode
- llama_kv_cache_seq_cp
Per slot:
- llama_batch_add
- llama_kv_cache_seq_rm
- llama_kv_cache_seq_add
- llama_kv_cache_seq_div
- llama_sampling_free
- llama_sampling_init
- llama_sampling_accept
- llama_sampling_reset
- llama_tokenize
If prompt is big enough, self extend or continuous batching are enabled.
Proposal
We need to separate slots state management, tokens retrieval from slots processing but keeping one batch for the whole server.
Firstly, it should be well tested and reproducible in the test server framework in a slow test with a real prompt and model (as in the passkey).
I see 3 options:
- We are fine with that, let's wait for the high-level llama api with its own thread pool
- Yet another threadpool (+ the http request pool). Initialized with
n_slots
which will call all this function asynchronously - Use the httplib request thread to call these blocking function
@ggerganov @ngxson please confirm the list of blocking method, which one must be thread safe (I meant only in the main loop).
I am welling to implement option 2 or 3, assign me back the issue if you are OK.