Description
Name and Version
./build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
version: 4338 (7b1ec53f)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
Hello,
Short version :
When using llama-server with only one slot ( --threads-http 1 -np 1
) you can sequentially send prompts to process and there will be no speed degradation.
When you use multiple slots (starts showing up from slot # = 3, doesn't show up at slot = 2) generation will be slower and slower after each finished generation.
Used cli :
./build/bin/llama-server --host 0.0.0.0 --port 8080 --model /opt/IdExtend/models/llm/Qwen2.5-7B-Instruct-Q4_K_M.gguf --ctx-size 122880 --threads-http 15 -np 15 --tensor-split 1.0,0.0,0.0 -ngl 99999
Also gave a try with :
--cache-reuse 50000
INEFECTIVE--defrag-thold 0.0
or--defrag-thold 0.99
INEFECTIVE--model /opt/IdExtend/models/llm/Mistral-7B-Instruct-v0.3.Q8_0.gguf
INEFECTIVE-sm none
INEFECTIVE--flash-attn --cache-type-k q8_0 --cache-type-v q8_0
INEFECTIVE (was using it from start but decided to reduce to as few args as possible`
Yes I understand I have multiple slots and using them in sequence it dumb, issue is that I tried moving my backend from sequential use to parallel (so I had to create slots) but it doesn't go faster this is why i tried tracking down the issue cause and here I am.
Final run :
./build/bin/llama-server --host 0.0.0.0 --port 8080 --model /opt/IdExtend/models/llm/Qwen2.5-7B-Instruct-Q4_K_M.gguf --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 122880 --threads-http 15 -np 15 --tensor-split 1.0,0.0,0.0 -sm none -ngl 99999
Python script logs :
[0]Time taken: 1.1400415897369385
[1]Time taken: 0.9648196697235107
[2]Time taken: 1.002309799194336
[3]Time taken: 1.353079080581665
[4]Time taken: 0.8274390697479248
[5]Time taken: 1.4006707668304443
[6]Time taken: 1.5088953971862793
[7]Time taken: 2.5358529090881348
[8]Time taken: 1.6904234886169434
[9]Time taken: 2.6186017990112305
[10]Time taken: 2.290717601776123
[11]Time taken: 2.0220725536346436
[12]Time taken: 1.9455785751342773
[13]Time taken: 3.2140021324157715
[14]Time taken: 2.404296636581421
[15]Time taken: 2.5479960441589355
[16]Time taken: 3.0076818466186523
[17]Time taken: 6.665952205657959
TOTAL Time taken for sequential: 39.140857458114624
You can find a zip with the python script to reproduce it attached.
Full server logs : server-logs.txt
Cleaned server logs :
prompt eval time = 254.31 ms / 2310 tokens ( 0.11 ms per token, 9083.40 tokens per second)
eval time = 879.65 ms / 97 tokens ( 9.07 ms per token, 110.27 tokens per second)
total time = 1133.96 ms / 2407 tokens
prompt eval time = 261.95 ms / 2343 tokens ( 0.11 ms per token, 8944.49 tokens per second)
eval time = 694.21 ms / 85 tokens ( 8.17 ms per token, 122.44 tokens per second)
total time = 956.16 ms / 2428 tokens
prompt eval time = 284.46 ms / 2285 tokens ( 0.12 ms per token, 8032.76 tokens per second)
eval time = 707.39 ms / 80 tokens ( 8.84 ms per token, 113.09 tokens per second)
total time = 991.85 ms / 2365 tokens
prompt eval time = 409.38 ms / 2924 tokens ( 0.14 ms per token, 7142.46 tokens per second)
eval time = 930.37 ms / 95 tokens ( 9.79 ms per token, 102.11 tokens per second)
total time = 1339.75 ms / 3019 tokens
prompt eval time = 357.83 ms / 2282 tokens ( 0.16 ms per token, 6377.29 tokens per second)
eval time = 454.73 ms / 44 tokens ( 10.33 ms per token, 96.76 tokens per second)
total time = 812.57 ms / 2326 tokens
prompt eval time = 388.00 ms / 2277 tokens ( 0.17 ms per token, 5868.57 tokens per second)
eval time = 996.40 ms / 89 tokens ( 11.20 ms per token, 89.32 tokens per second)
total time = 1384.39 ms / 2366 tokens
prompt eval time = 556.35 ms / 3011 tokens ( 0.18 ms per token, 5412.09 tokens per second)
eval time = 930.15 ms / 76 tokens ( 12.24 ms per token, 81.71 tokens per second)
total time = 1486.50 ms / 3087 tokens
prompt eval time = 618.16 ms / 3027 tokens ( 0.20 ms per token, 4896.82 tokens per second)
eval time = 1890.54 ms / 144 tokens ( 13.13 ms per token, 76.17 tokens per second)
total time = 2508.70 ms / 3171 tokens
prompt eval time = 651.99 ms / 2935 tokens ( 0.22 ms per token, 4501.60 tokens per second)
eval time = 1008.49 ms / 72 tokens ( 14.01 ms per token, 71.39 tokens per second)
total time = 1660.48 ms / 3007 tokens
prompt eval time = 903.68 ms / 2957 tokens ( 0.31 ms per token, 3272.17 tokens per second)
eval time = 1681.54 ms / 112 tokens ( 15.01 ms per token, 66.61 tokens per second)
total time = 2585.22 ms / 3069 tokens
prompt eval time = 805.01 ms / 2965 tokens ( 0.27 ms per token, 3683.17 tokens per second)
eval time = 1447.53 ms / 91 tokens ( 15.91 ms per token, 62.87 tokens per second)
total time = 2252.55 ms / 3056 tokens
prompt eval time = 831.70 ms / 2965 tokens ( 0.28 ms per token, 3564.97 tokens per second)
eval time = 1149.78 ms / 69 tokens ( 16.66 ms per token, 60.01 tokens per second)
total time = 1981.48 ms / 3034 tokens
prompt eval time = 996.94 ms / 2940 tokens ( 0.34 ms per token, 2949.01 tokens per second)
eval time = 905.74 ms / 52 tokens ( 17.42 ms per token, 57.41 tokens per second)
total time = 1902.69 ms / 2992 tokens
prompt eval time = 960.80 ms / 3074 tokens ( 0.31 ms per token, 3199.42 tokens per second)
eval time = 2201.62 ms / 118 tokens ( 18.66 ms per token, 53.60 tokens per second)
total time = 3162.42 ms / 3192 tokens
prompt eval time = 1161.53 ms / 2977 tokens ( 0.39 ms per token, 2562.99 tokens per second)
eval time = 1189.15 ms / 62 tokens ( 19.18 ms per token, 52.14 tokens per second)
total time = 2350.68 ms / 3039 tokens
prompt eval time = 1017.35 ms / 2934 tokens ( 0.35 ms per token, 2883.97 tokens per second)
eval time = 1481.01 ms / 76 tokens ( 19.49 ms per token, 51.32 tokens per second)
total time = 2498.35 ms / 3010 tokens
prompt eval time = 1035.18 ms / 2966 tokens ( 0.35 ms per token, 2865.20 tokens per second)
eval time = 1915.50 ms / 97 tokens ( 19.75 ms per token, 50.64 tokens per second)
total time = 2950.68 ms / 3063 tokens
prompt eval time = 638.59 ms / 1778 tokens ( 0.36 ms per token, 2784.25 tokens per second)
eval time = 5996.03 ms / 303 tokens ( 19.79 ms per token, 50.53 tokens per second)
total time = 6634.62 ms / 2081 tokens
First Bad Commit
No response
Relevant log output
No response
Edit :
I gave a try on another machine with build
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: GRID A100-40C, compute capability 8.0, VMM: no
Device 1: GRID A100-40C, compute capability 8.0, VMM: no
Device 2: GRID A100-40C, compute capability 8.0, VMM: no
version: 4149 (1bb30bf2)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Issue persists
Edit: i'm performing a binary search
-> version: 4149 (1bb30bf2)
fail ❌
-> version: 4063 (505f3327)
fail ❌
-> version: 4024 (329ed914)
fail ❌
-> version: 4016 (42cadc74)
fail ❌
-> version: 4015 (45950415)
no issue ✔️
-> version: 4012 (7554aa46)
no issue ✔️
Related PR causing introducing the issue : #10126
I doubt it CREATED the bug, I think it just reveleated the existing bug
The more slots it used, the slowed it is :