Feature Request: Ability to cancel during prompt processing (`llama_decode`)

### Prerequisites

- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Currently, there is no way to cancel when a prompt is being processed through llama.cpp. In the event of a large prompt, the user must wait for the entire prompt to be processed before cancellation can occur. This bottleneck happens when calling `llama_decode` on the ctx.

This is being proposed in server via https://github.com/ggerganov/llama.cpp/pull/9679. However, I believe this behavior should be in the core library as well.

### Motivation

API servers are generally used for many requests at a time (as evident by llama-server). Therefore, there should be a way to abort requests at any point. The main bottleneck that cannot easily be aborted is decoding. By having this bottleneck, there is a desync (race) between the server and the client which causes a segfault and crash once another request is sent.

In addition, not having processing cancellation increases load on a system even with a batching server because it causes extra resources to be used for a cancelled request.

Giving a cursory look over the llama-cpp-python repo, there are others that have the same problem:
- https://github.com/abetlen/llama-cpp-python/issues/313
- https://github.com/abetlen/llama-cpp-python/pull/733

### Possible Implementation

Issue 313 in llama-cpp-python suggests to use signals which is a valid option. Another method would be to use a callback like in `llama_model_params` that allows for cancellation. Other opinions are welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Ability to cancel during prompt processing (`llama_decode`) #10509

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Ability to cancel during prompt processing (llama_decode) #10509

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Feature Request: Ability to cancel during prompt processing (`llama_decode`) #10509