NKVO argument leads to huge compute buffers in full Cublas offload on a heterogeneous dual GPU config.

Problem :

On build b2849 (and more ancient ones as well), the -nkvo argument to keep the KV cache in RAM gives a huge compute buffer size in full Cublas offload of a model's layers on an heterogeneous dual GPU configuration (3090 24GB + 3060 12GB). When the non-repeating layers are not offloaded, the size of the compute cash decreases massively, back to something more "normal".

Example on a Yi 34b model (60+1 layers) :

perplexity -m U:\text-generation-webui\models\Merged-RP-Stew-V2-34B.i1-IQ4_XS.gguf -f wiki.test.raw -ngl 61 -b 512 -ts 1,1 -fa --no-mmap -c 2048 --chunks 10

```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.83 MiB
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   232.42 MiB
llm_load_tensors:      CUDA0 buffer size =  8796.98 MiB
llm_load_tensors:      CUDA1 buffer size =  8588.34 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   248.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   232.00 MiB
llama_new_context_with_model: KV self size  =  480.00 MiB, K (f16):  240.00 MiB, V (f16):  240.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   196.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   203.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    30.02 MiB
llama_new_context_with_model: graph nodes  = 1687
llama_new_context_with_model: graph splits = 3
```


perplexity -m U:\text-generation-webui\models\Merged-RP-Stew-V2-34B.i1-IQ4_XS.gguf -f wiki.test.raw -ngl 61 -nkvo -b 512 -ts 1,1 -fa --no-mmap -c 2048 --chunks 10 (the problematic case)

```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.83 MiB
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   232.42 MiB
llm_load_tensors:      CUDA0 buffer size =  8796.98 MiB
llm_load_tensors:      CUDA1 buffer size =  8588.34 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   480.00 MiB
llama_new_context_with_model: KV self size  =  480.00 MiB, K (f16):  240.00 MiB, V (f16):  240.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =  1188.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  1131.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   990.02 MiB
llama_new_context_with_model: graph nodes  = 1687
llama_new_context_with_model: graph splits = 123
```


perplexity -m U:\text-generation-webui\models\Merged-RP-Stew-V2-34B.i1-IQ4_XS.gguf -f wiki.test.raw -ngl 60 -nkvo -b 512 -ts 1,1 -fa --no-mmap -c 2048 --chunks 10

```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.83 MiB
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloaded 60/61 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   591.34 MiB
llm_load_tensors:      CUDA0 buffer size =  8513.20 MiB
llm_load_tensors:      CUDA1 buffer size =  8513.20 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   480.00 MiB
llama_new_context_with_model: KV self size  =  480.00 MiB, K (f16):  240.00 MiB, V (f16):  240.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   497.89 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   132.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    18.01 MiB
llama_new_context_with_model: graph nodes  = 1687
llama_new_context_with_model: graph splits = 125
```

It also happens on a Llama 2 70b model.
And also happens in similar proportions without flash attention.

Is this "as intended"/"as necessary", or a bug?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NKVO argument leads to huge compute buffers in full Cublas offload on a heterogeneous dual GPU config. #7217

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NKVO argument leads to huge compute buffers in full Cublas offload on a heterogeneous dual GPU config. #7217

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions