Infill Incorrect Tokenization

I am comparing the tokenization of the codellama repository with the infill example of this repository.

The first example prompt from the codellama repository consists of the strings:

- Prefix: 'def remove_non_ascii(s: str) -> str:\n    \"\"\" '
- Suffix: '\n    return result\n'

Comparing the tokenization of both implementations results in:

- CodeLlama: 1 32007 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 13 1678 736 1121 13 32009
- Llama.cpp: 32007 1 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 1 29871 13 1678 736 1121 13 32009

There are two differences:

- The first two tokens are swapped (those are `prefix_id` and `bos` I think)
- Llama.cpp adds a `bos` token again after the `suffix_id` token and an additional 29871 (is this a space?)

I believe the latter is definitely wrong, as the [paper](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) states on page 4:

> To limit the distribution shift between autoregressive and infilling training, we suppress the implicit leading space that SentencePiece tokenizers add upon encoding the middle part and the suffix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Infill Incorrect Tokenization #3503

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Infill Incorrect Tokenization #3503

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions