Differences with the llama tokenizer

In this case the llama.cpp and the llama tokenizers produce different output:

```
main: prompt: 'This is 🦙.cpp'
main: number of tokens in prompt = 10
     1 -> ''
  4013 -> 'This'
   338 -> ' is'
 29871 -> ' '
   243 -> '�'
   162 -> '�'
   169 -> '�'
   156 -> '�'
 29889 -> '.'
  8223 -> 'cpp'
```

Meanwhile the llama tokenizer produces:

```
text = "This is 🦙.cpp"
t = tokenizer.encode(text, bos=True, eos=False)

[1, 910, 338, 29871, 243, 162, 169, 156, 29889, 8223]
```

So in one case "This" is encoded as 4013 and other as 910. I have verified that both ids decode to the same text:

```
t1 = tokenizer.decode([4013])
t2 = tokenizer.decode([910])
print(t1, [int(b) for b in bytes(t1, "UTF-8")])
print(t2, [int(b) for b in bytes(t2, "UTF-8")])

This [84, 104, 105, 115]
This [84, 104, 105, 115]
```

I am not sure if this causes any significant differences in the generation but it may be a good idea to check it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Differences with the llama tokenizer #167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Differences with the llama tokenizer #167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions