Closed
Description
I am comparing the tokenization of the codellama repository with the infill example of this repository.
The first example prompt from the codellama repository consists of the strings:
- Prefix: 'def remove_non_ascii(s: str) -> str:\n """ '
- Suffix: '\n return result\n'
Comparing the tokenization of both implementations results in:
- CodeLlama: 1 32007 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 13 1678 736 1121 13 32009
- Llama.cpp: 32007 1 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 1 29871 13 1678 736 1121 13 32009
There are two differences:
- The first two tokens are swapped (those are
prefix_id
andbos
I think) - Llama.cpp adds a
bos
token again after thesuffix_id
token and an additional 29871 (is this a space?)
I believe the latter is definitely wrong, as the paper states on page 4:
To limit the distribution shift between autoregressive and infilling training, we suppress the implicit leading space that SentencePiece tokenizers add upon encoding the middle part and the suffix