Closed
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
Tried the example in #4232
Current Behavior
The example in #4232 hangs the server.
$ ./server -m models/mistral-7b-instruct-v0.2.Q8_0.gguf -c 32768 -t 1 -ngl 1 -np 2
{"timestamp":1703215447,"level":"INFO","function":"main","line":2668,"message":"build info","build":1680,"commit":"afefa319"}
{"timestamp":1703215447,"level":"INFO","function":"main","line":2675,"message":"system info","n_threads":1,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/mistral-7b-instruct-v0.2.Q8_0.gguf (version GGUF V3 (latest))
[... omit ...]
Available slots:
-> Slot 0 - max context: 16384
-> Slot 1 - max context: 16384
llama server listening at http://127.0.0.1:8080
{"timestamp":1703215448,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 2]
slot 1 is processing [task id: 3]
slot 0 : kv cache rm - [0, end)
slot 1 : kv cache rm - [0, end)
print_timings: prompt eval time = 888.72 ms / 17 tokens ( 52.28 ms per token, 19.13 tokens per second)
print_timings: eval time = 16917.36 ms / 85 runs ( 199.03 ms per token, 5.02 tokens per second)
print_timings: total time = 17806.08 ms
slot 0 released (103 tokens in cache)
print_timings: prompt eval time = 888.64 ms / 16 tokens ( 55.54 ms per token, 18.01 tokens per second)
print_timings: eval time = 19226.04 ms / 111 runs ( 173.21 ms per token, 5.77 tokens per second)
print_timings: total time = 20114.68 ms
On the client side, it's the example in #4232, but there's nothing coming back
$ curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": ["<s>[INST] What is the capital of the US? [/INST]", "<s>[INST] What is the capital of France? [/INST]"], "n_predict": 2048}'