Eval bug: llama-cli, spurious token added to assistant response

### Name and Version

version: 5327 (27ebfcac)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu


### Operating systems

Linux

### GGML backends

CUDA

### Hardware

nvidia

### Models

all

### Problem description & steps to reproduce

After the user prompt is provided, the code enters this branch:
https://github.com/ggml-org/llama.cpp/blob/0cf6725e9f9a164c39f7a87214d60342f7f946d8/tools/main/main.cpp#L716

No new tokens are generated.

However, the following code assumes that there is a new token and it is inserted in the assistant response:

https://github.com/ggml-org/llama.cpp/blob/0cf6725e9f9a164c39f7a87214d60342f7f946d8/tools/main/main.cpp#L824

### First Bad Commit

_No response_

### Relevant log output

```shell
The easiest way is to set a breakpoint here and wait for the assistant message:

https://github.com/ggml-org/llama.cpp/blob/0cf6725e9f9a164c39f7a87214d60342f7f946d8/tools/main/main.cpp#L270
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: llama-cli, spurious token added to assistant response #13402

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: llama-cli, spurious token added to assistant response #13402

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions