Skip to content

server: main loop blocked, server stuck #5851

Closed
@phymbert

Description

@phymbert

Context

Call to following functions are blocking the main loop and the server stuck for all slots / requests in method update_slots

Global:

  • llama_batch_clear
  • llama_decode
  • llama_kv_cache_seq_cp

Per slot:

  • llama_batch_add
  • llama_kv_cache_seq_rm
  • llama_kv_cache_seq_add
  • llama_kv_cache_seq_div
  • llama_sampling_free
  • llama_sampling_init
  • llama_sampling_accept
  • llama_sampling_reset
  • llama_tokenize

If prompt is big enough, self extend or continuous batching are enabled.

Proposal

We need to separate slots state management, tokens retrieval from slots processing but keeping one batch for the whole server.

Firstly, it should be well tested and reproducible in the test server framework in a slow test with a real prompt and model (as in the passkey).

I see 3 options:

  1. We are fine with that, let's wait for the high-level llama api with its own thread pool
  2. Yet another threadpool (+ the http request pool). Initialized with n_slots which will call all this function asynchronously
  3. Use the httplib request thread to call these blocking function

@ggerganov @ngxson please confirm the list of blocking method, which one must be thread safe (I meant only in the main loop).
I am welling to implement option 2 or 3, assign me back the issue if you are OK.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions