Skip to content

Implement (properly) different chat templates in main.cpp #6391

Closed
@ngxson

Description

@ngxson

Motivation

From the day I added llama_chat_apply_template #5538 , I already started thinking about adding it into main.cpp for replacing the current -cml option. However, it is not as easy as it seems. The main reason is because main.cpp still rely on antiprompt and static prefix/postfix/infix to work with "chat".

The whole reason why antiprompt exist in the first place was because in the early era of LLMs:

  • We don't have a good way to differentiate different roles (user - assistant)
  • We don't have a good way to know when to stop the generation

However, a lot of things changed since then: we're now having the notion of "chat template", newer models have special tokens like <|user|> to replace Human:, most models are fine-tuned to stop generation by outputting EOS token...

For that reason, using antiprompt and static prefix/postfix/infix is no longer a viable option to add chat template into main.cpp. That force us to be a bit more creative.

Possible Implementation

  • The prefix/postfix can be changed dynamically based on message role. For example:
    Chatml uses "<|im_start|>" + role + "\n" as prefix (role is dynamic based on current message); <|im_end|>\n is the postfix.
    This idea is being implemented in Refactor chat template API #6822
  • Use llama_token_is_eog() to replace antiprompt. Additionally, for compatibility reason, we can translate EOG token to antiprompt, because some models output the antiprompt as a sequence of multiple tokens (newer models never do this).
Old proposal (outdated)

Possible Implementation

My idea is trying to use llama_chat_apply_template in main.cpp. This will effectively deprecate antiprompt, prompt prefix/postfix and cml options.

Format the chat on-the-go

For now, llama_chat_apply_template produce very "additive" result when a new message is added to the list.

An additive means for example if I have [msg1, msg2], then I get formatted chat msg1_msg2. When I add msg3 to the list, it must add the formatted msg3 to the end of the formatted chat, without touching the existing content, it results in msg1_msg2_msg3 in this example. A wrong result maybe msg1+++msg2_msg3

This is very important. Unlike server.cpp where we clear the KV cache and re-format a new prompt each time, main.cpp add new tokens on top of existing ones, then continues the generation until a condition is met (maybe EOS token or a stop sequence).

So, to use llama_chat_apply_template in main.cpp, a test case must be added to test all chat templates to make sure they are all "additive".

main.cpp can then keep track of a list of messages, re-apply chat template each time and take only the "added" part.

Example:

  • Messages: [msg1, msg2] ==> Formatted: <user>msg1<assistant>msg2
  • Messages: [msg1, msg2, msg3] ==> Formatted: <user>msg1<assistant>msg2<user>msg3 ==> Part to evaluate: <user>msg3

Manage stop sequences

While it is ideal to use stop token (for example, EOS or <|im_end|>) to stop generation, not all models support this (some models still breaks <|im_end|> into <|, im, _end, |>), so using stop token is not an option.

llama_chat_apply_template should returns the stop sequence along side with the formatted chat template ==> this is what we need to add.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions