Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices

### Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
version: 4561 (6f53d8a6)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
./llama-server -m ../../models/70B-IQ3_M.gguf -c 8192 -ngl 999 --rpc localhost:50052,localhost:50053,192.168.0.21:50052
```

### Problem description & steps to reproduce

```
localhost: 24+12GB vram
192.168.0.21: 12GB vram
```

localhost: 2 rpc servers or 1 doesnt matter, always oom's, with 1 rpc server running for two devices the allocation is even worse at like 10% allocated

loading the 70b it allocates:

```
9.414Gi/12.0Gi
17.162Gi/24.0Gi
--
4.745Gi/12.0Gi
```

then OOM-crashes after a bit:

```
llama_init_from_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 80, can_shift = 1
llama_kv_cache_init: RPC[localhost:50052] KV buffer size =   384.00 MiB
llama_kv_cache_init: RPC[localhost:50053] KV buffer size =   736.00 MiB
llama_kv_cache_init: RPC[192.168.0.21:50052] KV buffer size =   384.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   736.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   320.00 MiB
llama_init_from_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.49 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1104.00 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1157632000
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1104.00 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1157632000
llama_init_from_model: failed to allocate compute buffers
```

even though there is more space on the 24 gigs and especially the third device 12 gigs

### First Bad Commit

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices #11435

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices #11435

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions