Someone please help me work /slot/action?=save and /slot/action?=restore #9781

dhandhalyabhavik · 2024-10-08T03:47:43Z

dhandhalyabhavik
Oct 8, 2024

I have multiple long documents lets say $d_1, d_2, d_3, ..., d_n$,
I want to have /slot/action?=save for each document as bin file $d_{bin1}, d_{bin2}, d_{bin3}, ..., d_{binn}$ (bin=binary) so that I don't have to process them when user queries.
User will just use the document id and then I can restore them using that ID so that user can ask question and it gets processed immediately due to pre-computation of kv values in slot/restore. But,

In my case that is not working.

Let me share all the steps and how I am concluding that its not working.

I am using llama 3.1 8b gguf (from huggingface lmdeploy repo).
I have two documents

A code which has missing values which I want llm to fill in place and correct it.
AI history document.

I have attached both of them in reference if you want to experiment yourself.

Now, here are the steps I have performed for prefix cache storing and restoring.

Step 0. llama server deployment

./llama-server -m ../ollama/gguf_files/llama31-8b.gguf -c 32768 --host 0.0.0.0 -np 2 --slot-save-path /tmp/promptcache/ --log-colors --log-timestamps -v -dkvc

Step 1. checking slot status and prompt

[
  {
    "n_ctx": 16384,
    "n_predict": -1,
    "model": "../ollama/gguf_files/llama31-8b.gguf",
    "seed": 4294967295,
    "seed_cur": 0,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.10000000149011612,
    "penalize_nl": false,
    "stop": [],
    "max_tokens": -1,
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": true,
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "top_k",
      "tfs_z",
      "typ_p",
      "top_p",
      "min_p",
      "temperature"
    ],
    "id": 0,
    "id_task": -1,
    "state": 0,
    "prompt": null,
    "next_token": {
      "has_next_token": true,
      "n_remain": -1,
      "n_decoded": 0,
      "stopped_eos": false,
      "stopped_word": false,
      "stopped_limit": false,
      "stopping_word": ""
    }
  }

As you can see initially "prompt": null is NULL.

Step 2. Caching document

def get_completion(
    prompt, 
    n_predict=128, 
    cache_prompt=False,
    stream=False
):
    headers = {
        "Content-Type": "application/json"
    }

    data = {
        "prompt": prompt,
        "n_predict": n_predict,
        "temperature": 0.1,
        "seed": 42,
        "cache_prompt": cache_prompt,
        "stream": stream,
    }

    if not stream:
        data["next_token"] = {"stopped_eos": "true"}

    try:
        if stream:
            return stream_response(url, headers, data)
        else:
            response = requests.post(url, headers=headers, json=data)
            response.raise_for_status()
            return response.json()
    except requests.exceptions.RequestException as e:
        print(f'ERROR: {e}')
        return None

def save_prompt_cache(slot_id, filename):
    save_url = f"{base_url}/slots/{slot_id}?action=save"
    data = {"filename": filename}
    try:
        response = requests.post(save_url, json=data, timeout=120)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f'ERROR saving prompt cache: {e}')
        return None

def cache_it(initial_prompt, slot_id, filename):
    print("Initializing cache...")
    init_response = get_completion(initial_prompt, n_predict=1, cache_prompt=True)
    if init_response:
        print("Cache initialized successfully.")
        save_response = save_prompt_cache(slot_id, filename)
        if save_response:
            print(f"Cache saved to slot {slot_id} with filename {filename}")
            print(f"Saved {save_response['n_saved']} tokens, {save_response['n_written']} bytes")
        else:
            print(f"Failed to save cache to slot {slot_id}")
    else:
        print("Failed to initialize cache.")

Above are my jupyter notebook functions. Assume URL is http://localhost:8080.
I am calling function cache_it with document as prompt, 0 as my slot id and filename.

# for document 1 I will use
cache_it(initial_prompt, slot_id=0, filename="code.bin")
# for document 2 I will use
cache_it(initial_prompt, slot_id=0, filename="ai-history.bin")

Step 3. Verify cache document in file as well as slot status.

Slot status

{
    "n_ctx": 16384,
    "n_predict": -1,
    "model": "../ollama/gguf_files/llama31-8b.gguf",
    "seed": 42,
    "seed_cur": 42,
    "temperature": 0.10000000149011612,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.10000000149011612,
    "penalize_nl": false,
    "stop": [],
    "max_tokens": 1,
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": false,
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "top_k",
      "tfs_z",
      "typ_p",
      "top_p",
      "min_p",
      "temperature"
    ],
    "id": 0,
    "id_task": 1,
    "state": 0,
    "prompt": "The code implements a simple console-based quiz game where the user can answer multiple-choice questions, get points for correct answers, and see their final score at the end.\n\n```\nclass QuizGame:\n    def __init__(self, questions):\n        self.questions = que
stions\n        self.score = 0\n\n    def start_quiz(self):\n        print(\"Welcome to the Quiz Game!\")\n        print(\"You will be asked 5 questions. Let's begin!\\n\")\n        \n        for question in self.questions:\n            self.ask_question(question)\n\n        self
.show_score()\n\n    def ask_question(self, question):\n        print(question['question'])\n        for i, option in enumerate(question['options'], start=1):\n            print(f\"{i}. {option}\")\n        \n        # Get user's answer\n        try:\n            user_answer = in
t(input(\"Your answer (1-4): \"))\n        except ValueError:\n            print(\"Invalid input! Please enter a number between 1 and 4.\")\n            return self.ask_question(question)  # Ask the question again if input is invalid\n\n        # Check if the answer is correct\n 
       if ____________:  # Fill this with the condition to check if the answer is correct\n            print(\"Correct!\\n\")\n            ____________  # Increment the score\n        else:\n            print(f\"Wrong! The correct answer was {question['correct']}\\n\")\n\n    def
 show_score(self):\n        # Show final score\n        print(f\"Quiz finished! Your final score is {__________} out of {len(self.questions)}\")\n\n# Sample set of questions\nquestions = [\n    {\n        'question': \"What is the capital of France?\",\n        'options': [\"Berl
in\", \"Madrid\", \"Paris\", \"Rome\"],\n        'correct': 3\n    },\n    {\n        'question': \"Which planet is known as the Red Planet?\",\n        'options': [\"Earth\", \"Mars\", \"Jupiter\", \"Saturn\"],\n        'correct': 2\n    },\n    {\n        'question': \"Who wrot
e 'Pride and Prejudice'?\",\n        'options': [\"Charles Dickens\", \"Jane Austen\", \"Mark Twain\", \"J.K. Rowling\"],\n        'correct': 2\n    },\n    {\n        'question': \"Which element has the chemical symbol 'O'?\",\n        'options': [\"Oxygen\", \"Gold\", \"Osmium\
", \"Silver\"],\n        'correct': 1\n    },\n    {\n        'question': \"In which year did World War II end?\",\n        'options': [\"1939\", \"1941\", \"1945\", \"1949\"],\n        'correct': 3\n    }\n]\n\n# Initialize the game\nquiz_game = ____________  # Create an instanc
e of QuizGame with the questions\n\n# Start the quiz\nquiz_game.____________  # Call the method to start the quiz\n```\n\nPlease understand above code.",
    "next_token": {
      "has_next_token": false,
      "n_remain": 0,

File status

🦁 => ls /tmp/promptcache/
code.bin

As you can see the prompt has bin stored as code.bin file.

Step 4. Asking random question

🦁 => curl -X POST http://localhost:8080/completion -d '{
  "prompt": "What is the capital of France?",
  "cache_prompt": false,
  "n_predict": 50
}'

Step 5. Checking the slot state

{
    "n_ctx": 16384,
    "n_predict": -1,
    "model": "../ollama/gguf_files/llama31-8b.gguf",
    "seed": 4294967295,
    "seed_cur": 1433452943,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.10000000149011612,
    "penalize_nl": false,
    "stop": [],
    "max_tokens": 50,
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": false,
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "top_k",
      "tfs_z",
      "typ_p",
      "top_p",
      "min_p",
      "temperature"
    ],
    "id": 0,
    "id_task": 5,
    "state": 0,
    "prompt": "What is the capital of France?",
    "next_token": {
      "has_next_token": false,
      "n_remain": 0,
      "n_decoded": 50,
      "stopped_eos": false,
      "stopped_word": false,
      "stopped_limit": true,
      "stopping_word": ""
    }
  }

As you can see current prompt in slot is "prompt": "What is the capital of France?".

Step 6. Restoring prompt and verifying
Now when I restore the initial code.bin prompt, slot status should have been changed right?

Lets restore first and check slot status afterwards,

 🦁 => curl --request POST   --url 'http://localhost:8080/slots/0?action=restore'   --header 'Content-Type: application/json'   --data '{
    "filename": "code.bin"
}'
{"id_slot":0,"filename":"code.bin","n_restored":603,"n_read":79044444,"timings":{"restore_ms":59.199}}

As you can see it has restored something as per output.

Now lets see slot status again.

{
    "n_ctx": 16384,
    "n_predict": -1,
    "model": "../ollama/gguf_files/llama31-8b.gguf",
    "seed": 4294967295,
    "seed_cur": 1433452943,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.10000000149011612,
    "penalize_nl": false,
    "stop": [],
    "max_tokens": 50,
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": false,
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "top_k",
      "tfs_z",
      "typ_p",
      "top_p",
      "min_p",
      "temperature"
    ],
    "id": 0,
    "id_task": 5,
    "state": 0,
    "prompt": "What is the capital of France?",
    "next_token": {
      "has_next_token": false,
      "n_remain": 0,
      "n_decoded": 50,
      "stopped_eos": false,
      "stopped_word": false,
      "stopped_limit": true,
      "stopping_word": ""
    }
  }

As you can see, the slot still has old prompt. But by logic, it should have restored the code.bin prompt and when I do ask question using cache_prompt=True, it should have used already present prompt and answer my question.

Please help me understand my mistake if I am doing any.

Reference:

Code document

The code implements a simple console-based quiz game where the user can answer multiple-choice questions, get points for correct answers, and see their final score at the end.

class QuizGame:
    def __init__(self, questions):
        self.questions = questions
        self.score = 0

    def start_quiz(self):
        print("Welcome to the Quiz Game!")
        print("You will be asked 5 questions. Let's begin!\n")
        
        for question in self.questions:
            self.ask_question(question)

        self.show_score()

    def ask_question(self, question):
        print(question['question'])
        for i, option in enumerate(question['options'], start=1):
            print(f"{i}. {option}")
        
        # Get user's answer
        try:
            user_answer = int(input("Your answer (1-4): "))
        except ValueError:
            print("Invalid input! Please enter a number between 1 and 4.")
            return self.ask_question(question)  # Ask the question again if input is invalid

        # Check if the answer is correct
        if ____________:  # Fill this with the condition to check if the answer is correct
            print("Correct!\n")
            ____________  # Increment the score
        else:
            print(f"Wrong! The correct answer was {question['correct']}\n")

    def show_score(self):
        # Show final score
        print(f"Quiz finished! Your final score is {__________} out of {len(self.questions)}")

# Sample set of questions
questions = [
    {
        'question': "What is the capital of France?",
        'options': ["Berlin", "Madrid", "Paris", "Rome"],
        'correct': 3
    },
    {
        'question': "Which planet is known as the Red Planet?",
        'options': ["Earth", "Mars", "Jupiter", "Saturn"],
        'correct': 2
    },
    {
        'question': "Who wrote 'Pride and Prejudice'?",
        'options': ["Charles Dickens", "Jane Austen", "Mark Twain", "J.K. Rowling"],
        'correct': 2
    },
    {
        'question': "Which element has the chemical symbol 'O'?",
        'options': ["Oxygen", "Gold", "Osmium", "Silver"],
        'correct': 1
    },
    {
        'question': "In which year did World War II end?",
        'options': ["1939", "1941", "1945", "1949"],
        'correct': 3
    }
]

# Initialize the game
quiz_game = ____________  # Create an instance of QuizGame with the questions

# Start the quiz
quiz_game.____________  # Call the method to start the quiz

Please understand above code.

AI history document

Here's a timeline of significant AI models, along with key details like their release date, authors, accuracy, and modality:

### 1. **Perceptron (1958)**
   - **Authors**: Frank Rosenblatt
   - **Modality**: General AI (Binary classification)
   - **Key Information**: The Perceptron was the earliest neural network, inspired by biological neurons. It could classify input data into one of two categories, such as image recognition tasks. While the Perceptron is not complex by today's standards, it laid the groundwork for more sophisticated models. It eventually faced limitations, especially in solving non-linear problems, which were later addressed by multi-layered neural networks.

### 2. **LeNet (1998)**
   - **Authors**: Yann LeCun et al.
   - **Modality**: Image (Handwritten digit recognition)
   - **Accuracy**: Around 99.2% on MNIST
   - **Key Information**: LeNet, designed for recognizing handwritten digits, marked a major milestone in image processing. It introduced convolutional neural networks (CNNs) that extract features from images through layers, making it more scalable for image classification. This model performed extremely well on the MNIST dataset, leading to the commercial success of optical character recognition systems.

### 3. **AlexNet (2012)**
   - **Authors**: Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton
   - **Modality**: Image (ImageNet classification)
   - **Accuracy**: 84.7% (Top-5 accuracy on ImageNet)
   - **Key Information**: AlexNet revolutionized computer vision by winning the ImageNet competition with a significant margin, using deep convolutional networks. This model demonstrated that deep learning could achieve state-of-the-art results by leveraging GPUs for training large models. AlexNet’s success sparked a deep learning revolution, leading to widespread adoption in image processing tasks.

### 4. **VGGNet (2014)**
   - **Authors**: Karen Simonyan and Andrew Zisserman
   - **Modality**: Image (ImageNet classification)
   - **Accuracy**: 92.7% (Top-5 accuracy on ImageNet)
   - **Key Information**: VGGNet built upon AlexNet by introducing a deeper network with up to 19 layers. The key innovation was the use of smaller convolutional filters (3x3), which improved performance while maintaining a simple and uniform architecture. VGGNet became one of the most popular architectures due to its balance between depth and simplicity in network design.

### 5. **GoogleNet (2014)**
   - **Authors**: Christian Szegedy et al.
   - **Modality**: Image (ImageNet classification)
   - **Accuracy**: 93.3% (Top-5 accuracy on ImageNet)
   - **Key Information**: GoogleNet introduced the "Inception" module, which allowed the model to perform convolutions with different kernel sizes in parallel. This allowed for more efficient computation with fewer parameters compared to earlier deep networks like VGGNet. GoogleNet was able to achieve a better trade-off between computational cost and accuracy, outperforming its predecessors.

### 6. **ResNet (2015)**
   - **Authors**: Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun
   - **Modality**: Image (ImageNet classification)
   - **Accuracy**: 96.4% (Top-5 accuracy on ImageNet)
   - **Key Information**: ResNet introduced residual learning, which solved the problem of vanishing gradients in deep networks by allowing the model to learn identity mappings. This innovation made it possible to train extremely deep networks (e.g., 152 layers) while achieving higher accuracy than previous architectures. ResNet is one of the most widely used architectures in deep learning.

### 7. **Seq2Seq (2014)**
   - **Authors**: Ilya Sutskever, Oriol Vinyals, and Quoc Le
   - **Modality**: NLP (Machine translation)
   - **Accuracy**: State-of-the-art on translation tasks
   - **Key Information**: The Sequence-to-Sequence (Seq2Seq) model, based on LSTM networks, transformed natural language processing by handling tasks like machine translation and text summarization. Seq2Seq introduced the concept of encoder-decoder architecture, where one model encodes an input sequence and another decodes it into an output sequence. This framework became the foundation for many advancements in NLP.

### 8. **GAN (2014)**
   - **Authors**: Ian Goodfellow et al.
   - **Modality**: Image/Video generation
   - **Accuracy**: Not traditionally measured by accuracy, but quality of generated data
   - **Key Information**: Generative Adversarial Networks (GANs) were a breakthrough in unsupervised learning. GANs consist of two neural networks—a generator and a discriminator—that compete against each other, resulting in the generation of highly realistic data. GANs have been used for tasks like image synthesis, video generation, and data augmentation, although they are notoriously difficult to train.

### 9. **BERT (2018)**
   - **Authors**: Jacob Devlin et al.
   - **Modality**: NLP (Language understanding)
   - **Accuracy**: State-of-the-art on multiple NLP benchmarks (e.g., 93% on SQuAD)
   - **Key Information**: BERT (Bidirectional Encoder Representations from Transformers) introduced the concept of bidirectional attention, allowing the model to understand the context from both sides of a word. BERT set new benchmarks on many NLP tasks, including question answering, text classification, and sentiment analysis. It transformed the way natural language processing tasks are approached, particularly by pre-training on large corpora.

### 10. **GPT-2 (2019)**
   - **Authors**: OpenAI
   - **Modality**: NLP (Text generation)
   - **Accuracy**: N/A (Evaluated by language quality and fluency)
   - **Key Information**: GPT-2 (Generative Pre-trained Transformer 2) marked a significant step forward in NLP for text generation. The model, trained on 40GB of internet text, was able to generate coherent and contextually relevant text, prompting concerns over its potential misuse. GPT-2's autoregressive architecture allows it to predict the next word in a sequence, making it capable of producing human-like text.

### 11. **GPT-3 (2020)**
   - **Authors**: OpenAI
   - **Modality**: NLP (Text generation)
   - **Accuracy**: N/A (Evaluated by language quality and coherence)
   - **Key Information**: GPT-3, with 175 billion parameters, became the largest language model at its time of release. Its vast scale allows it to generate highly coherent and contextually accurate text across a wide range of tasks, from writing essays to generating code. GPT-3's success popularized the concept of few-shot learning, where the model can perform tasks with minimal examples.

### 12. **CLIP (2021)**
   - **Authors**: OpenAI
   - **Modality**: Multimodal (Image and Text)
   - **Accuracy**: State-of-the-art on zero-shot classification tasks
   - **Key Information**: CLIP (Contrastive Language-Image Pretraining) is designed to understand images and text together, making it a breakthrough in multimodal learning. It uses contrastive learning to align images with their textual descriptions, enabling zero-shot learning for image classification. CLIP's ability to understand images based on natural language descriptions has wide applications in image retrieval, classification, and content creation.

### 13. **DALL·E (2021)**
   - **Authors**: OpenAI
   - **Modality**: Multimodal (Text-to-image generation)
   - **Accuracy**: Evaluated by quality of generated images
   - **Key Information**: DALL·E introduced the ability to generate high-quality images from text prompts, showcasing a new level of creativity in AI. By training on vast amounts of image-text pairs, the model can produce highly detailed and diverse images based solely on textual descriptions. DALL·E is often used in creative industries, content generation, and design.

### 14. **Stable Diffusion (2022)**
   - **Authors**: CompVis, Stability AI, and others
   - **Modality**: Image generation (Text-to-image)
   - **Accuracy**: Evaluated by quality of generated images
   - **Key Information**: Stable Diffusion is an open-source model for text-to-image generation that relies on a diffusion process to iteratively improve image quality. It has become a popular tool for artistic and design purposes due to its ability to generate highly detailed images based on textual input. The model's open-source nature has enabled widespread customization and application across industries.

This list represents some of the most influential models in AI's history, illustrating the field's rapid evolution across different modalities and tasks.

Answered by ggerganov

Oct 9, 2024

The slot restore logic does not restore the text representation of the prompt. It restores only the KV cache state. So the /slots status reports a stale value for "prompt".

I've made a workaround in #9800

But note that even though the reported state is incorrect, the actual KV cache should have been restored correctly. So if you try to send a new query, it should reuse the cached tokens from the initial run.

View full answer

dhandhalyabhavik · 2024-10-09T05:54:49Z

dhandhalyabhavik
Oct 9, 2024
Author

Hi @ggerganov can you please answer.

3 replies

ggerganov Oct 9, 2024
Maintainer

The slot restore logic does not restore the text representation of the prompt. It restores only the KV cache state. So the /slots status reports a stale value for "prompt".

I've made a workaround in #9800

But note that even though the reported state is incorrect, the actual KV cache should have been restored correctly. So if you try to send a new query, it should reuse the cached tokens from the initial run.

Answer selected by dhandhalyabhavik

dhandhalyabhavik Oct 9, 2024
Author

I think the readme file should have mentioned how to get this working.
Let say we have following syntax <initial_long_prompt> + <question>.
Even after storing <initial_long_prompt>, we need to pass it after restoring it though .bin file with question.
After trial and error, I got it working.

ggerganov Oct 9, 2024
Maintainer

Yes, that needs a bit of clarification indeed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Someone please help me work /slot/action?=save and /slot/action?=restore #9781

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Someone please help me work /slot/action?=save and /slot/action?=restore #9781

Uh oh!

dhandhalyabhavik Oct 8, 2024

Replies: 1 comment · 3 replies

Uh oh!

dhandhalyabhavik Oct 9, 2024 Author

Uh oh!

ggerganov Oct 9, 2024 Maintainer

Uh oh!

dhandhalyabhavik Oct 9, 2024 Author

Uh oh!

ggerganov Oct 9, 2024 Maintainer

dhandhalyabhavik
Oct 8, 2024

Replies: 1 comment 3 replies

dhandhalyabhavik
Oct 9, 2024
Author

ggerganov Oct 9, 2024
Maintainer

dhandhalyabhavik Oct 9, 2024
Author

ggerganov Oct 9, 2024
Maintainer