Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Currently, there is no way to cancel when a prompt is being processed through llama.cpp. In the event of a large prompt, the user must wait for the entire prompt to be processed before cancellation can occur. This bottleneck happens when calling llama_decode
on the ctx.
This is being proposed in server via #9679. However, I believe this behavior should be in the core library as well.
Motivation
API servers are generally used for many requests at a time (as evident by llama-server). Therefore, there should be a way to abort requests at any point. The main bottleneck that cannot easily be aborted is decoding. By having this bottleneck, there is a desync (race) between the server and the client which causes a segfault and crash once another request is sent.
In addition, not having processing cancellation increases load on a system even with a batching server because it causes extra resources to be used for a cancelled request.
Giving a cursory look over the llama-cpp-python repo, there are others that have the same problem:
- Cancelled HTTP connection does not stop model execution abetlen/llama-cpp-python#313
- Add cancel() method to interrupt a stream abetlen/llama-cpp-python#733
Possible Implementation
Issue 313 in llama-cpp-python suggests to use signals which is a valid option. Another method would be to use a callback like in llama_model_params
that allows for cancellation. Other opinions are welcome.