KV_cache not offloading to GPU. Slows down the process a lot!

Hi,

I think a recent change might have caused this. I am using llama-2-7b-chat.Q4_K_M.gguf for a local Q&A RAG pipeline, created using LlamaIndex. I developed a proof of concept on a machine using 0.2.13 version and saw this in the output:

llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 2048.00 MB

When I recently installed llama-cpp-python on a new machine, I don't see this in output anymore and my process has slowed down significantly. Can you please advise? Let me know if you need anything additional.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV_cache not offloading to GPU. Slows down the process a lot! #999

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

KV_cache not offloading to GPU. Slows down the process a lot! #999

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions