Skip to content

Eval bug: -sm row performance on NVidia multy-gpu config is extremely low on the long contexts after b3990 #11510

Closed
@m-arbaro

Description

@m-arbaro

Name and Version

b3990/b3989

Operating systems

Linux

GGML backends

CUDA

Hardware

NVidia RTX 3090 + 3x Tesla P40, full offload

Models

Meta-Llama-3.3-70B-Instruct-Q6_K.gguf, reproduced on multiple models.

Problem description & steps to reproduce

After updating to b3990 inference speed on the long context start falling down much faster.
Inference on affected build:

llama_perf_sampler_print: sampling time = 3849.30 ms / 19008 runs ( 0.20 ms per token, 4938.04 tokens per second)
llama_perf_context_print: load time = 10529.10 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 185789.79 ms / 454 runs ( 409.23 ms per token, 2.44 tokens per second)
llama_perf_context_print: total time = 192330.87 ms / 455 tokens

Inference on unaffected build :

llama_perf_context_print: load time = 10386.38 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 72821.88 ms / 454 runs ( 160.40 ms per token, 6.23 tokens per second)
llama_perf_context_print: total time = 79340.13 ms / 455 tokens

Inference with --split-mode layer is the same on the both commits:
llama_perf_sampler_print: sampling time = 3848.68 ms / 19008 runs ( 0.20 ms per token, 4938.84 tokens per second)
llama_perf_context_print: load time = 10364.01 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 228173.97 ms / 454 runs ( 502.59 ms per token, 1.99 tokens per second)
llama_perf_context_print: total time = 234689.68 ms / 455 tokens
llama_perf_sampler_print: sampling time = 3792.58 ms / 19008 runs ( 0.20 ms per token, 5011.89 tokens per second)

First Bad Commit

b3990

Relevant log output

llama-cli --n-gpu-layers 99 -ts '30,40,40,40' --main_gpu 0 --temp 1.5 --file prompt_random_long --ctx-size 19000 --model models/memory/llama3.3-q6_k.gguf --seed 11111111111  -sm row --prompt-cache models/memory/pc19000.tmp

llama-cli --n-gpu-layers 99 -ts '30,40,40,40' --main_gpu 0 --temp 1.5 --file prompt_random_long --ctx-size 19000 --model models/memory/llama3.3-q6_k.gguf --seed 11111111111  -sm layer --prompt-cache models/memory/pc19000.tmp

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions