server: main loop blocked, server stuck

### Context

Call to following functions are blocking the main loop and the server stuck for all slots / requests in method `update_slots`

Global:
- llama_batch_clear
- llama_decode
- llama_kv_cache_seq_cp

Per slot:
- llama_batch_add
- llama_kv_cache_seq_rm
- llama_kv_cache_seq_add
- llama_kv_cache_seq_div
- llama_sampling_free
- llama_sampling_init
- llama_sampling_accept
- llama_sampling_reset
- llama_tokenize

If prompt is big enough, self extend or continuous batching are enabled.

### Proposal

We need to separate slots state management, tokens retrieval from slots processing but keeping one batch for the whole server.

Firstly, it should be well tested and reproducible in the test server framework in a slow test with a real prompt and model (as in the passkey).

I see 3 options:

1. We are fine with that, let's wait for the high-level llama api with its own thread pool
2. Yet another threadpool (+ the http request pool). Initialized with `n_slots` which will call all this function asynchronously
3. Use the httplib request thread to call these blocking function


@ggerganov @ngxson please confirm the list of blocking method, which one must be thread safe (I meant only in the main loop).
I am welling to implement option 2 or 3, assign me back the issue if you are OK.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: main loop blocked, server stuck #5851

Context

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

server: main loop blocked, server stuck #5851

Description

Context

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions