Description
Name and Version
Operating systems
Linux
GGML backends
CUDA
Hardware
NVIDIA
Models
Qwen3
Problem description & steps to reproduce
I don't know if it's too soon but I'm opening this to keep track of the issue.
The original qwen3 template is not supported but the bug can be tested by using a modified template
The qwen3 template contains the following check (stripped down to the relevant section):
{%- if loop.index0 > ns.last_query_index %}
{%- if loop.last or (not loop.last and reasoning_content) %}
KEEP REASONING TOKENS
Meaning that in the common case the tokens are kept when the last role is assistant
and the tokens are discarded when the last role is user
.
The problem is that at the start of the turn, the following pseudo-code is executed:
- messages.append(user_message)
- fmt_past_msg = apply_chat_template(messages)
- messages.append(assistant_messages)
- fmt_new_msg = apply_chat_template(messages)
- diff = fmt_new_msg - fmt_past_msg
The diff
is not computed correctly since the assistant message used in v1 has the thinking tokens preserved and the assistant message in v2 has the thinking tokens removed.
Relevant section of the code:
Line 320 in 611aa91
First Bad Commit
No response
Relevant log output
std::string common_chat_format_single(...) {
[...]
fmt_past_msg = common_chat_templates_apply(tmpls, inputs).prompt;
[...]
inputs.messages.push_back(new_msg);
[...]
auto fmt_new_msg = common_chat_templates_apply(tmpls, inputs).prompt;
// get the diff part
ss << fmt_new_msg.substr(fmt_past_msg.size(), fmt_new_msg.size() - fmt_past_msg.size());