Skip to content

Differences with the llama tokenizer #167

Closed
@slaren

Description

@slaren

In this case the llama.cpp and the llama tokenizers produce different output:

main: prompt: 'This is 🦙.cpp'
main: number of tokens in prompt = 10
     1 -> ''
  4013 -> 'This'
   338 -> ' is'
 29871 -> ' '
   243 -> '�'
   162 -> '�'
   169 -> '�'
   156 -> '�'
 29889 -> '.'
  8223 -> 'cpp'

Meanwhile the llama tokenizer produces:

text = "This is 🦙.cpp"
t = tokenizer.encode(text, bos=True, eos=False)

[1, 910, 338, 29871, 243, 162, 169, 156, 29889, 8223]

So in one case "This" is encoded as 4013 and other as 910. I have verified that both ids decode to the same text:

t1 = tokenizer.decode([4013])
t2 = tokenizer.decode([910])
print(t1, [int(b) for b in bytes(t1, "UTF-8")])
print(t2, [int(b) for b in bytes(t2, "UTF-8")])

This [84, 104, 105, 115]
This [84, 104, 105, 115]

I am not sure if this causes any significant differences in the generation but it may be a good idea to check it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions