Description
Motivation
From the day I added llama_chat_apply_template
#5538 , I already started thinking about adding it into main.cpp
for replacing the current -cml
option. However, it is not as easy as it seems. The main reason is because main.cpp
still rely on antiprompt and static prefix/postfix/infix to work with "chat".
The whole reason why antiprompt exist in the first place was because in the early era of LLMs:
- We don't have a good way to differentiate different roles (user - assistant)
- We don't have a good way to know when to stop the generation
However, a lot of things changed since then: we're now having the notion of "chat template", newer models have special tokens like <|user|>
to replace Human:
, most models are fine-tuned to stop generation by outputting EOS token...
For that reason, using antiprompt and static prefix/postfix/infix is no longer a viable option to add chat template into main.cpp
. That force us to be a bit more creative.
Possible Implementation
- The prefix/postfix can be changed dynamically based on message role. For example:
Chatml uses"<|im_start|>" + role + "\n"
as prefix (role
is dynamic based on current message);<|im_end|>\n
is the postfix.
This idea is being implemented in Refactor chat template API #6822 - Use
llama_token_is_eog()
to replace antiprompt. Additionally, for compatibility reason, we can translate EOG token to antiprompt, because some models output the antiprompt as a sequence of multiple tokens (newer models never do this).
Old proposal (outdated)
Possible Implementation
My idea is trying to use llama_chat_apply_template
in main.cpp
. This will effectively deprecate antiprompt, prompt prefix/postfix and cml options.
Format the chat on-the-go
For now, llama_chat_apply_template
produce very "additive" result when a new message is added to the list.
An additive means for example if I have [msg1, msg2]
, then I get formatted chat msg1_msg2
. When I add msg3
to the list, it must add the formatted msg3
to the end of the formatted chat, without touching the existing content, it results in msg1_msg2_msg3
in this example. A wrong result maybe msg1+++msg2_msg3
This is very important. Unlike server.cpp
where we clear the KV cache and re-format a new prompt each time, main.cpp
add new tokens on top of existing ones, then continues the generation until a condition is met (maybe EOS token or a stop sequence).
So, to use llama_chat_apply_template
in main.cpp
, a test case must be added to test all chat templates to make sure they are all "additive".
main.cpp
can then keep track of a list of messages, re-apply chat template each time and take only the "added" part.
Example:
- Messages:
[msg1, msg2]
==> Formatted:<user>msg1<assistant>msg2
- Messages:
[msg1, msg2, msg3]
==> Formatted:<user>msg1<assistant>msg2<user>msg3
==> Part to evaluate:<user>msg3
Manage stop sequences
While it is ideal to use stop token (for example, EOS or <|im_end|>
) to stop generation, not all models support this (some models still breaks <|im_end|>
into <|
, im
, _end
, |>
), so using stop token is not an option.
llama_chat_apply_template
should returns the stop sequence along side with the formatted chat template ==> this is what we need to add.