Description
I am running the same Q4_K_M model (Mistral Small 3) in llama.cpp
and ollama
, seemingly with the same configuration, on an NVIDIA RTX 3060 12 GB VRAM and an AMD Ryzen 9 7900.
However, I get significantly faster (5x) prompt eval with ollama
: 512 t/s vs 100 t/s, while the eval time is ~2x faster with llama-cpp
.
This is reproducible across time and inputs.
Ollama
total duration: 3.046232704s
load duration: 11.114031ms
prompt eval count: 1119 token(s)
prompt eval duration: 2.185s
prompt eval rate: 512.13 tokens/s
eval count: 9 token(s)
eval duration: 847ms
eval rate: 10.63 tokens/s
llama-server
$ llama-server -m /ollama/data/ollama/models/blobs/sha256-dd3af152229f92a3d61f3f115217c9c72f4b94d8be6778156dab23f894703c28 --port 8080 -ngl 30 -fa --temp 0.15 -c 2048 -ctk q4_0 -ctv q4_0 -t 12
prompt eval time = 8734.21 ms / 971 tokens ( 9.00 ms per token, 111.17 tokens per second)
eval time = 1075.76 ms / 19 tokens ( 56.62 ms per token, 17.66 tokens per second)
total time = 9809.97 ms / 990 tokens
Interestingly, llama-server seems to be using all my CPU cores during prompt evaluation, no matter what value I use for the -t
flag:
It is nevertheless clearly using the GPU, as removing -ngl 30
massively extends the running time.
Logs comparison
Ollama:
system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=12
llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloaded 30/41 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 4121.89 MiB
llm_load_tensors: CUDA0 model buffer size = 9540.47 MiB
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'q4_0', type_v = 'q4_0', n_layer = 40, can_shift = 1
llama_kv_cache_init: CPU KV buffer size = 22.50 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 67.50 MiB
llama_new_context_with_model: KV self size = 90.00 MiB, K (q4_0): 45.00 MiB, V (q4_0): 45.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.52 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 791.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 14.01 MiB
llama_new_context_with_model: graph nodes = 1127
llama_new_context_with_model: graph splits = 114 (with bs=512), 3 (with bs=1)
llama-server:
system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
load_tensors: offloading 30 repeating layers to GPU
load_tensors: offloaded 30/41 layers to GPU
load_tensors: CPU_Mapped model buffer size = 4121.89 MiB
load_tensors: CUDA0 model buffer size = 9540.47 MiB
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'q4_0', type_v = 'q4_0', n_layer = 40, can_shift = 1
llama_kv_cache_init: CUDA0 KV buffer size = 67.50 MiB
llama_kv_cache_init: CPU KV buffer size = 22.50 MiB
llama_init_from_model: CPU output buffer size = 0.50 MiB
llama_init_from_model: CPU compute buffer size = 266.00 MiB
llama_init_from_model: CUDA0 compute buffer size = 160.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 22.01 MiB
llama_init_from_model: graph nodes = 1127
llama_init_from_model: graph splits = 164 (with bs=512), 3 (with bs=1)
The layers are distributed similarly on the devices: 0-9 CPU, 10-39 CUDA0, and 40 on CPU.
Two smoking guns I see:
- the
CUDA0 compute buffer size
is 790 MiB withollama
but only 90 MiB withllama-server.
The CPU compute buffer size is absent withollama
, but at 260 MiB forllama-server.
ollama
printstensor 'token_embd.weight' (q4_K) (and 92 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
whilellama-server
showstensor 'token_embd.weight' (q4_K) (and 92 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
. Why is the preferred buffer different?- Why is the number of graph splits different (164 vs 114)?
Do you have know what controls this? There are no other log messages than above regarding CPU.
Anything else that could explain the discrepancy in performance?
Versions
$ ollama --version
ollama version is 0.5.7-0-ga420a45-dirty
Warning: client version is 0.5.7
$ llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 4779 (d7cfe1ffe)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu