Description
I'm comparing the tokenization between original Meta repo and llama.cpp with LLaMA (also had same issue with LLaMA v2).
For example, tokenizing the prompt "Hello world" and " Hello world" gives the following:
For prompt "Hello world":
llama.cpp tokenizer: [10994, 3186]
Meta tokenizer: [15043, 3186]For prompt " Hello world":
llama.cpp tokenizer: [15043, 3186]
Meta tokenizer: [29871, 15043, 3186]
Exploring the tokens, doing the detokenization, I got:
For tokens "[10994, 3186]":
llama.cpp tokenizer: |b'Hello world'|
Meta tokenizer: |Hello world|For tokens "[15043, 3186]":
llama.cpp tokenizer: |b' Hello world'|
Meta tokenizer: |Hello world|For tokens "[29871, 15043, 3186]":
llama.cpp tokenizer: |b' Hello world'|
Meta tokenizer: | Hello world|*Adding | to ease visualization.
Exploring each token above with the id_to_piece
functionality:
Looking the id_to_piece for llama.cpp:
id 10994 |b'Hello'|
id 3186 |b' world'|
id 15043 |b' Hello'|
id 29871 |b' '|Looking the id_to_piece for Meta:
id 10994 |Hello|
id 3186 |▁world|
id 15043 |▁Hello|
id 29871 |▁|*Adding | to ease visualization.
Note, the 29871 token is not the underline character but "\u2581" (See more about this here).
But, using the detokenizer in each id individually:
Using the llama.cpp detokenizer:
id 10994 |b'Hello'|
id 3186 |b' world'|
id 15043 |b' Hello'|
id 29871 |b' '|Using the Meta detokenizer:
id 10994 |Hello|
id 3186 |world|
id 15043 |Hello|
id 29871 ||
The code used to produce this results can be seen here.
Use this file for the Meta tokenizer.
The model ggml-model-f16.bin
is the 7B LLaMA model after using the convert.py
script as mentioned here.