Skip to content

single client multi-prompt hangs on server #4583

Closed
@jxy

Description

@jxy

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Tried the example in #4232

Current Behavior

The example in #4232 hangs the server.

$ ./server -m models/mistral-7b-instruct-v0.2.Q8_0.gguf -c 32768 -t 1 -ngl 1 -np 2                                                                                                                                                                     
{"timestamp":1703215447,"level":"INFO","function":"main","line":2668,"message":"build info","build":1680,"commit":"afefa319"}
{"timestamp":1703215447,"level":"INFO","function":"main","line":2675,"message":"system info","n_threads":1,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/mistral-7b-instruct-v0.2.Q8_0.gguf (version GGUF V3 (latest))
[... omit ...]
Available slots:
 -> Slot 0 - max context: 16384
 -> Slot 1 - max context: 16384

llama server listening at http://127.0.0.1:8080

{"timestamp":1703215448,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 2]
slot 1 is processing [task id: 3]
slot 0 : kv cache rm - [0, end)
slot 1 : kv cache rm - [0, end)

print_timings: prompt eval time =     888.72 ms /    17 tokens (   52.28 ms per token,    19.13 tokens per second)
print_timings:        eval time =   16917.36 ms /    85 runs   (  199.03 ms per token,     5.02 tokens per second)
print_timings:       total time =   17806.08 ms
slot 0 released (103 tokens in cache)

print_timings: prompt eval time =     888.64 ms /    16 tokens (   55.54 ms per token,    18.01 tokens per second)
print_timings:        eval time =   19226.04 ms /   111 runs   (  173.21 ms per token,     5.77 tokens per second)
print_timings:       total time =   20114.68 ms

On the client side, it's the example in #4232, but there's nothing coming back

$  curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": ["<s>[INST] What is the capital of the US? [/INST]", "<s>[INST] What is the capital of France? [/INST]"], "n_predict": 2048}'

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions