Skip to content

Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used #10860

Open
@ExtReMLapin

Description

@ExtReMLapin

Name and Version

 ./build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
version: 4338 (7b1ec53f)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

Hello,

Short version :

When using llama-server with only one slot ( --threads-http 1 -np 1) you can sequentially send prompts to process and there will be no speed degradation.

When you use multiple slots (starts showing up from slot # = 3, doesn't show up at slot = 2) generation will be slower and slower after each finished generation.

Used cli :

./build/bin/llama-server --host 0.0.0.0 --port 8080 --model /opt/IdExtend/models/llm/Qwen2.5-7B-Instruct-Q4_K_M.gguf --ctx-size 122880 --threads-http 15 -np 15 --tensor-split 1.0,0.0,0.0 -ngl 99999

Also gave a try with :

  • --cache-reuse 50000 INEFECTIVE
  • --defrag-thold 0.0 or --defrag-thold 0.99 INEFECTIVE
  • --model /opt/IdExtend/models/llm/Mistral-7B-Instruct-v0.3.Q8_0.gguf INEFECTIVE
  • -sm none INEFECTIVE
  • --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 INEFECTIVE (was using it from start but decided to reduce to as few args as possible`

Yes I understand I have multiple slots and using them in sequence it dumb, issue is that I tried moving my backend from sequential use to parallel (so I had to create slots) but it doesn't go faster this is why i tried tracking down the issue cause and here I am.

Final run :

./build/bin/llama-server --host 0.0.0.0 --port 8080 --model /opt/IdExtend/models/llm/Qwen2.5-7B-Instruct-Q4_K_M.gguf --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 122880 --threads-http 15 -np 15 --tensor-split 1.0,0.0,0.0 -sm none -ngl 99999

Python script logs :

[0]Time taken:  1.1400415897369385
[1]Time taken:  0.9648196697235107
[2]Time taken:  1.002309799194336
[3]Time taken:  1.353079080581665
[4]Time taken:  0.8274390697479248
[5]Time taken:  1.4006707668304443
[6]Time taken:  1.5088953971862793
[7]Time taken:  2.5358529090881348
[8]Time taken:  1.6904234886169434
[9]Time taken:  2.6186017990112305
[10]Time taken:  2.290717601776123
[11]Time taken:  2.0220725536346436
[12]Time taken:  1.9455785751342773
[13]Time taken:  3.2140021324157715
[14]Time taken:  2.404296636581421
[15]Time taken:  2.5479960441589355
[16]Time taken:  3.0076818466186523
[17]Time taken:  6.665952205657959
TOTAL Time taken for sequential:  39.140857458114624

You can find a zip with the python script to reproduce it attached.

responses.zip

Full server logs : server-logs.txt

Cleaned server logs :

prompt eval time =     254.31 ms /  2310 tokens (    0.11 ms per token,  9083.40 tokens per second)
       eval time =     879.65 ms /    97 tokens (    9.07 ms per token,   110.27 tokens per second)
      total time =    1133.96 ms /  2407 tokens
prompt eval time =     261.95 ms /  2343 tokens (    0.11 ms per token,  8944.49 tokens per second)
       eval time =     694.21 ms /    85 tokens (    8.17 ms per token,   122.44 tokens per second)
      total time =     956.16 ms /  2428 tokens
prompt eval time =     284.46 ms /  2285 tokens (    0.12 ms per token,  8032.76 tokens per second)
       eval time =     707.39 ms /    80 tokens (    8.84 ms per token,   113.09 tokens per second)
      total time =     991.85 ms /  2365 tokens
prompt eval time =     409.38 ms /  2924 tokens (    0.14 ms per token,  7142.46 tokens per second)
       eval time =     930.37 ms /    95 tokens (    9.79 ms per token,   102.11 tokens per second)
      total time =    1339.75 ms /  3019 tokens
prompt eval time =     357.83 ms /  2282 tokens (    0.16 ms per token,  6377.29 tokens per second)
       eval time =     454.73 ms /    44 tokens (   10.33 ms per token,    96.76 tokens per second)
      total time =     812.57 ms /  2326 tokens
prompt eval time =     388.00 ms /  2277 tokens (    0.17 ms per token,  5868.57 tokens per second)
       eval time =     996.40 ms /    89 tokens (   11.20 ms per token,    89.32 tokens per second)
      total time =    1384.39 ms /  2366 tokens
prompt eval time =     556.35 ms /  3011 tokens (    0.18 ms per token,  5412.09 tokens per second)
       eval time =     930.15 ms /    76 tokens (   12.24 ms per token,    81.71 tokens per second)
      total time =    1486.50 ms /  3087 tokens
prompt eval time =     618.16 ms /  3027 tokens (    0.20 ms per token,  4896.82 tokens per second)
       eval time =    1890.54 ms /   144 tokens (   13.13 ms per token,    76.17 tokens per second)
      total time =    2508.70 ms /  3171 tokens
prompt eval time =     651.99 ms /  2935 tokens (    0.22 ms per token,  4501.60 tokens per second)
       eval time =    1008.49 ms /    72 tokens (   14.01 ms per token,    71.39 tokens per second)
      total time =    1660.48 ms /  3007 tokens
prompt eval time =     903.68 ms /  2957 tokens (    0.31 ms per token,  3272.17 tokens per second)
       eval time =    1681.54 ms /   112 tokens (   15.01 ms per token,    66.61 tokens per second)
      total time =    2585.22 ms /  3069 tokens
prompt eval time =     805.01 ms /  2965 tokens (    0.27 ms per token,  3683.17 tokens per second)
       eval time =    1447.53 ms /    91 tokens (   15.91 ms per token,    62.87 tokens per second)
      total time =    2252.55 ms /  3056 tokens
prompt eval time =     831.70 ms /  2965 tokens (    0.28 ms per token,  3564.97 tokens per second)
       eval time =    1149.78 ms /    69 tokens (   16.66 ms per token,    60.01 tokens per second)
      total time =    1981.48 ms /  3034 tokens
prompt eval time =     996.94 ms /  2940 tokens (    0.34 ms per token,  2949.01 tokens per second)
       eval time =     905.74 ms /    52 tokens (   17.42 ms per token,    57.41 tokens per second)
      total time =    1902.69 ms /  2992 tokens
prompt eval time =     960.80 ms /  3074 tokens (    0.31 ms per token,  3199.42 tokens per second)
       eval time =    2201.62 ms /   118 tokens (   18.66 ms per token,    53.60 tokens per second)
      total time =    3162.42 ms /  3192 tokens
prompt eval time =    1161.53 ms /  2977 tokens (    0.39 ms per token,  2562.99 tokens per second)
       eval time =    1189.15 ms /    62 tokens (   19.18 ms per token,    52.14 tokens per second)
      total time =    2350.68 ms /  3039 tokens
prompt eval time =    1017.35 ms /  2934 tokens (    0.35 ms per token,  2883.97 tokens per second)
       eval time =    1481.01 ms /    76 tokens (   19.49 ms per token,    51.32 tokens per second)
      total time =    2498.35 ms /  3010 tokens
prompt eval time =    1035.18 ms /  2966 tokens (    0.35 ms per token,  2865.20 tokens per second)
       eval time =    1915.50 ms /    97 tokens (   19.75 ms per token,    50.64 tokens per second)
      total time =    2950.68 ms /  3063 tokens
prompt eval time =     638.59 ms /  1778 tokens (    0.36 ms per token,  2784.25 tokens per second)
       eval time =    5996.03 ms /   303 tokens (   19.79 ms per token,    50.53 tokens per second)
      total time =    6634.62 ms /  2081 tokens

First Bad Commit

No response

Relevant log output

No response

Edit :

I gave a try on another machine with build

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: GRID A100-40C, compute capability 8.0, VMM: no
  Device 1: GRID A100-40C, compute capability 8.0, VMM: no
  Device 2: GRID A100-40C, compute capability 8.0, VMM: no
version: 4149 (1bb30bf2)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Issue persists

Edit: i'm performing a binary search

-> version: 4149 (1bb30bf2) fail ❌

-> version: 4063 (505f3327) fail ❌

-> version: 4024 (329ed914) fail ❌

-> version: 4016 (42cadc74) fail ❌

-> version: 4015 (45950415) no issue ✔️

-> version: 4012 (7554aa46) no issue ✔️

Related PR causing introducing the issue : #10126

I doubt it CREATED the bug, I think it just reveleated the existing bug

The more slots it used, the slowed it is :

image

image

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions