Skip to content

Infill Incorrect Tokenization #3503

Closed
@kherud

Description

@kherud

I am comparing the tokenization of the codellama repository with the infill example of this repository.

The first example prompt from the codellama repository consists of the strings:

  • Prefix: 'def remove_non_ascii(s: str) -> str:\n """ '
  • Suffix: '\n return result\n'

Comparing the tokenization of both implementations results in:

  • CodeLlama: 1 32007 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 13 1678 736 1121 13 32009
  • Llama.cpp: 32007 1 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 1 29871 13 1678 736 1121 13 32009

There are two differences:

  • The first two tokens are swapped (those are prefix_id and bos I think)
  • Llama.cpp adds a bos token again after the suffix_id token and an additional 29871 (is this a space?)

I believe the latter is definitely wrong, as the paper states on page 4:

To limit the distribution shift between autoregressive and infilling training, we suppress the implicit leading space that SentencePiece tokenizers add upon encoding the middle part and the suffix

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghigh priorityVery important issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions