Someone please help me work /slot/action?=save and /slot/action?=restore #9781
-
I have multiple long documents lets say In my case that is not working. Let me share all the steps and how I am concluding that its not working. I am using llama 3.1 8b gguf (from huggingface lmdeploy repo).
I have attached both of them in reference if you want to experiment yourself. Now, here are the steps I have performed for prefix cache storing and restoring. Step 0. llama server deployment
Step 1. checking slot status and prompt
As you can see initially "prompt": null is NULL. Step 2. Caching document
Above are my jupyter notebook functions. Assume URL is
Step 3. Verify cache document in file as well as slot status. Slot status
File status
As you can see the prompt has bin stored as Step 4. Asking random question
Step 5. Checking the slot state
As you can see current prompt in slot is Step 6. Restoring prompt and verifying Lets restore first and check slot status afterwards,
As you can see it has restored something as per output. Now lets see slot status again.
As you can see, the slot still has old prompt. But by logic, it should have restored the code.bin prompt and when I do ask question using Please help me understand my mistake if I am doing any. Reference:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi @ggerganov can you please answer. |
Beta Was this translation helpful? Give feedback.
The slot restore logic does not restore the text representation of the prompt. It restores only the KV cache state. So the
/slots
status reports a stale value for"prompt"
.I've made a workaround in #9800
But note that even though the reported state is incorrect, the actual KV cache should have been restored correctly. So if you try to send a new query, it should reuse the cached tokens from the initial run.