Skip to content

Feature Request: add per-request "reasoning" options in llama-server #13272

Open
@ngxson

Description

@ngxson

Feature Description

As reasoning models are becoming mainstream, we start to see some pattern:

  • Most models use <think>, <reasoning>, etc, basically a set of known tokens now
  • The "reasoning budget" can technically be supported by any models, not just Qwen, by keeping track of number of tokens between <think> and </think>
  • "no think" is just a reasoning budget == 0

So I'm thinking about accepting an object like this for each request:

"reasoning": {
    "budget": -1, // number of reasoning tokens budget
                     default: -1 (inf) ; 0 for no think
    "format": "", // equivalent of --reasoning-format
                     if set to "deepseek", reasoning will be returned in "message.reasoning_content"
                     if set to "hide", it will be completely hidden
                     default: "none", return the reasoning with the message as normal
}

The reasoning format "hide" can be implemented via #13214 ; the "deepseek" format current only supported for non-stream, but I think we can modify a bit to support this.

For the budget, we don't yet have the logic to handle it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions