Skip to content

Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices #11435

Closed
@lucyknada

Description

@lucyknada

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
version: 4561 (6f53d8a)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./llama-server -m ../../models/70B-IQ3_M.gguf -c 8192 -ngl 999 --rpc localhost:50052,localhost:50053,192.168.0.21:50052

Problem description & steps to reproduce

localhost: 24+12GB vram
192.168.0.21: 12GB vram

localhost: 2 rpc servers or 1 doesnt matter, always oom's, with 1 rpc server running for two devices the allocation is even worse at like 10% allocated

loading the 70b it allocates:

9.414Gi/12.0Gi
17.162Gi/24.0Gi
--
4.745Gi/12.0Gi

then OOM-crashes after a bit:

llama_init_from_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 80, can_shift = 1
llama_kv_cache_init: RPC[localhost:50052] KV buffer size =   384.00 MiB
llama_kv_cache_init: RPC[localhost:50053] KV buffer size =   736.00 MiB
llama_kv_cache_init: RPC[192.168.0.21:50052] KV buffer size =   384.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   736.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   320.00 MiB
llama_init_from_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.49 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1104.00 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1157632000
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1104.00 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1157632000
llama_init_from_model: failed to allocate compute buffers

even though there is more space on the 24 gigs and especially the third device 12 gigs

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions