Description
Problem :
On build b2849 (and more ancient ones as well), the -nkvo argument to keep the KV cache in RAM gives a huge compute buffer size in full Cublas offload of a model's layers on an heterogeneous dual GPU configuration (3090 24GB + 3060 12GB). When the non-repeating layers are not offloaded, the size of the compute cash decreases massively, back to something more "normal".
Example on a Yi 34b model (60+1 layers) :
perplexity -m U:\text-generation-webui\models\Merged-RP-Stew-V2-34B.i1-IQ4_XS.gguf -f wiki.test.raw -ngl 61 -b 512 -ts 1,1 -fa --no-mmap -c 2048 --chunks 10
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.83 MiB
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 232.42 MiB
llm_load_tensors: CUDA0 buffer size = 8796.98 MiB
llm_load_tensors: CUDA1 buffer size = 8588.34 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 248.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 232.00 MiB
llama_new_context_with_model: KV self size = 480.00 MiB, K (f16): 240.00 MiB, V (f16): 240.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model: CUDA0 compute buffer size = 196.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 203.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 30.02 MiB
llama_new_context_with_model: graph nodes = 1687
llama_new_context_with_model: graph splits = 3
perplexity -m U:\text-generation-webui\models\Merged-RP-Stew-V2-34B.i1-IQ4_XS.gguf -f wiki.test.raw -ngl 61 -nkvo -b 512 -ts 1,1 -fa --no-mmap -c 2048 --chunks 10 (the problematic case)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.83 MiB
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 232.42 MiB
llm_load_tensors: CUDA0 buffer size = 8796.98 MiB
llm_load_tensors: CUDA1 buffer size = 8588.34 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 480.00 MiB
llama_new_context_with_model: KV self size = 480.00 MiB, K (f16): 240.00 MiB, V (f16): 240.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model: CUDA0 compute buffer size = 1188.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 1131.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 990.02 MiB
llama_new_context_with_model: graph nodes = 1687
llama_new_context_with_model: graph splits = 123
perplexity -m U:\text-generation-webui\models\Merged-RP-Stew-V2-34B.i1-IQ4_XS.gguf -f wiki.test.raw -ngl 60 -nkvo -b 512 -ts 1,1 -fa --no-mmap -c 2048 --chunks 10
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.83 MiB
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloaded 60/61 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 591.34 MiB
llm_load_tensors: CUDA0 buffer size = 8513.20 MiB
llm_load_tensors: CUDA1 buffer size = 8513.20 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 480.00 MiB
llama_new_context_with_model: KV self size = 480.00 MiB, K (f16): 240.00 MiB, V (f16): 240.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.98 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 497.89 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 132.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 18.01 MiB
llama_new_context_with_model: graph nodes = 1687
llama_new_context_with_model: graph splits = 125
It also happens on a Llama 2 70b model.
And also happens in similar proportions without flash attention.
Is this "as intended"/"as necessary", or a bug?