Closed
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
version: 4561 (6f53d8a)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
./llama-server -m ../../models/70B-IQ3_M.gguf -c 8192 -ngl 999 --rpc localhost:50052,localhost:50053,192.168.0.21:50052
Problem description & steps to reproduce
localhost: 24+12GB vram
192.168.0.21: 12GB vram
localhost: 2 rpc servers or 1 doesnt matter, always oom's, with 1 rpc server running for two devices the allocation is even worse at like 10% allocated
loading the 70b it allocates:
9.414Gi/12.0Gi
17.162Gi/24.0Gi
--
4.745Gi/12.0Gi
then OOM-crashes after a bit:
llama_init_from_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 80, can_shift = 1
llama_kv_cache_init: RPC[localhost:50052] KV buffer size = 384.00 MiB
llama_kv_cache_init: RPC[localhost:50053] KV buffer size = 736.00 MiB
llama_kv_cache_init: RPC[192.168.0.21:50052] KV buffer size = 384.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 736.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 320.00 MiB
llama_init_from_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.49 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1104.00 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1157632000
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1104.00 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1157632000
llama_init_from_model: failed to allocate compute buffers
even though there is more space on the 24 gigs and especially the third device 12 gigs
First Bad Commit
No response