Closed
Description
In this case the llama.cpp and the llama tokenizers produce different output:
main: prompt: 'This is 🦙.cpp'
main: number of tokens in prompt = 10
1 -> ''
4013 -> 'This'
338 -> ' is'
29871 -> ' '
243 -> '�'
162 -> '�'
169 -> '�'
156 -> '�'
29889 -> '.'
8223 -> 'cpp'
Meanwhile the llama tokenizer produces:
text = "This is 🦙.cpp"
t = tokenizer.encode(text, bos=True, eos=False)
[1, 910, 338, 29871, 243, 162, 169, 156, 29889, 8223]
So in one case "This" is encoded as 4013 and other as 910. I have verified that both ids decode to the same text:
t1 = tokenizer.decode([4013])
t2 = tokenizer.decode([910])
print(t1, [int(b) for b in bytes(t1, "UTF-8")])
print(t2, [int(b) for b in bytes(t2, "UTF-8")])
This [84, 104, 105, 115]
This [84, 104, 105, 115]
I am not sure if this causes any significant differences in the generation but it may be a good idea to check it.