Implement (properly) different chat templates in main.cpp

# Motivation

From the day I added `llama_chat_apply_template` #5538 , I already started thinking about adding it into `main.cpp` for replacing the current `-cml` option. However, it is not as easy as it seems. The main reason is because `main.cpp` still rely on antiprompt and static prefix/postfix/infix to work with "chat".

The whole reason why antiprompt exist in the first place was because in the early era of LLMs:
- We don't have a good way to differentiate different roles (user - assistant)
- We don't have a good way to know when to stop the generation

However, a lot of things changed since then: we're now having the notion of "chat template", newer models have special tokens like `<|user|>` to replace `Human:`, most models are fine-tuned to stop generation by outputting EOS token...

For that reason, using antiprompt and static prefix/postfix/infix is no longer a viable option to add chat template into `main.cpp`. That force us to be a bit more creative.

# Possible Implementation

- The prefix/postfix can be changed **dynamically** based on message role. For example:  
  Chatml uses `"<|im_start|>" + role + "\n"` as prefix (`role` is dynamic based on current message); `<|im_end|>\n` is the postfix.  
  This idea is being implemented in #6822
- Use `llama_token_is_eog()` to replace antiprompt. Additionally, for compatibility reason, we can translate EOG token to antiprompt, because some models output the antiprompt as a sequence of multiple tokens (newer models never do this).

<details>
  <summary>Old proposal (outdated)</summary>


# Possible Implementation

My idea is trying to use `llama_chat_apply_template` in `main.cpp`. This will effectively deprecate antiprompt, prompt prefix/postfix and cml options.

## Format the chat on-the-go

For now, `llama_chat_apply_template` produce very "additive" result when a new message is added to the list.

An additive means for example if I have `[msg1, msg2]`, then I get formatted chat `msg1_msg2`. When I add `msg3` to the list, it must add the formatted `msg3` to the end of the formatted chat, **without touching the existing content**, it results in `msg1_msg2_msg3` in this example. A wrong result maybe `msg1+++msg2_msg3`

This is very important. Unlike `server.cpp` where we clear the KV cache and re-format a new prompt each time, `main.cpp` add new tokens on top of existing ones, then continues the generation until a condition is met (maybe EOS token or a stop sequence).

So, to use `llama_chat_apply_template` in `main.cpp`, a test case must be added to test all chat templates to make sure they are all "additive".

`main.cpp` can then keep track of a list of messages, re-apply chat template each time and take only the "added" part.

Example:
- Messages: `[msg1, msg2]` ==> Formatted: `<user>msg1<assistant>msg2`
- Messages: `[msg1, msg2, msg3]` ==> Formatted: `<user>msg1<assistant>msg2<user>msg3` ==> Part to evaluate: `<user>msg3`

## Manage stop sequences

While it is ideal to use stop token (for example, EOS or `<|im_end|>`) to stop generation, not all models support this (some models still breaks `<|im_end|>` into `<|`,  `im`, `_end`, `|>`), so using stop **token** is not an option.

`llama_chat_apply_template` should returns the stop sequence along side with the formatted chat template ==> this is what we need to add.

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement (properly) different chat templates in main.cpp #6391

Motivation

Possible Implementation

Possible Implementation

Format the chat on-the-go

Manage stop sequences

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement (properly) different chat templates in main.cpp #6391

Description

Motivation

Possible Implementation

Possible Implementation

Format the chat on-the-go

Manage stop sequences

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions