Skip to content

Feature Request: Ability to cancel during prompt processing (llama_decode) #10509

Closed
@kingbri1

Description

@kingbri1

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Currently, there is no way to cancel when a prompt is being processed through llama.cpp. In the event of a large prompt, the user must wait for the entire prompt to be processed before cancellation can occur. This bottleneck happens when calling llama_decode on the ctx.

This is being proposed in server via #9679. However, I believe this behavior should be in the core library as well.

Motivation

API servers are generally used for many requests at a time (as evident by llama-server). Therefore, there should be a way to abort requests at any point. The main bottleneck that cannot easily be aborted is decoding. By having this bottleneck, there is a desync (race) between the server and the client which causes a segfault and crash once another request is sent.

In addition, not having processing cancellation increases load on a system even with a batching server because it causes extra resources to be used for a cancelled request.

Giving a cursory look over the llama-cpp-python repo, there are others that have the same problem:

Possible Implementation

Issue 313 in llama-cpp-python suggests to use signals which is a valid option. Another method would be to use a callback like in llama_model_params that allows for cancellation. Other opinions are welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions