Description
Name and Version
b3990/b3989
Operating systems
Linux
GGML backends
CUDA
Hardware
NVidia RTX 3090 + 3x Tesla P40, full offload
Models
Meta-Llama-3.3-70B-Instruct-Q6_K.gguf, reproduced on multiple models.
Problem description & steps to reproduce
After updating to b3990 inference speed on the long context start falling down much faster.
Inference on affected build:
llama_perf_sampler_print: sampling time = 3849.30 ms / 19008 runs ( 0.20 ms per token, 4938.04 tokens per second)
llama_perf_context_print: load time = 10529.10 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 185789.79 ms / 454 runs ( 409.23 ms per token, 2.44 tokens per second)
llama_perf_context_print: total time = 192330.87 ms / 455 tokens
Inference on unaffected build :
llama_perf_context_print: load time = 10386.38 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 72821.88 ms / 454 runs ( 160.40 ms per token, 6.23 tokens per second)
llama_perf_context_print: total time = 79340.13 ms / 455 tokens
Inference with --split-mode layer is the same on the both commits:
llama_perf_sampler_print: sampling time = 3848.68 ms / 19008 runs ( 0.20 ms per token, 4938.84 tokens per second)
llama_perf_context_print: load time = 10364.01 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 228173.97 ms / 454 runs ( 502.59 ms per token, 1.99 tokens per second)
llama_perf_context_print: total time = 234689.68 ms / 455 tokens
llama_perf_sampler_print: sampling time = 3792.58 ms / 19008 runs ( 0.20 ms per token, 5011.89 tokens per second)
First Bad Commit
b3990
Relevant log output
llama-cli --n-gpu-layers 99 -ts '30,40,40,40' --main_gpu 0 --temp 1.5 --file prompt_random_long --ctx-size 19000 --model models/memory/llama3.3-q6_k.gguf --seed 11111111111 -sm row --prompt-cache models/memory/pc19000.tmp
llama-cli --n-gpu-layers 99 -ts '30,40,40,40' --main_gpu 0 --temp 1.5 --file prompt_random_long --ctx-size 19000 --model models/memory/llama3.3-q6_k.gguf --seed 11111111111 -sm layer --prompt-cache models/memory/pc19000.tmp