Feature Request: add per-request "reasoning" options in llama-server

### Feature Description

As reasoning models are becoming mainstream, we start to see some pattern:
- Most models use `<think>`, `<reasoning>`, etc, basically a set of known tokens now
- The "reasoning budget" can technically be supported by any models, not just Qwen, by keeping track of number of tokens between `<think>` and `</think>`
- "no think" is just a reasoning budget == 0

So I'm thinking about accepting an object like this for each request:

```"reasoning": {
"reasoning": {
    "budget": -1, // number of reasoning tokens budget
                     default: -1 (inf) ; 0 for no think
    "format": "", // equivalent of --reasoning-format
                     if set to "deepseek", reasoning will be returned in "message.reasoning_content"
                     if set to "hide", it will be completely hidden
                     default: "none", return the reasoning with the message as normal
}
```

The reasoning format "hide" can be implemented via https://github.com/ggml-org/llama.cpp/pull/13214 ; the "deepseek" format current only supported for non-stream, but I think we can modify a bit to support this.

For the budget, we don't yet have the logic to handle it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: add per-request "reasoning" options in llama-server #13272

Feature Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: add per-request "reasoning" options in llama-server #13272

Description

Feature Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions