Description
Current Behavior
Since #2810, space is inserted into any non-empty text. This breaks multiple use cases:
- Infill Incorrect Tokenization #3503 (comment)
- Prompt size control: splitting long text into pieces and adding these pieces to the prompt until certain token limit is reached. It is desirable to know precise token count of each piece, but space insertion gets in the way here.
- Prompt formats that use added tokens. Whether a space should be inserted after a special token depends on particular model used (how it was trained/finetuned and perhaps, alignment of stars). The decision should be left to the code that composes/formats prompts.
- Yi tokenizer does not insert space.
Recently, space insertion was disabled in the case when text representation of special tokens is recognized: https://github.com/ggerganov/llama.cpp/blob/1a159553f921a9209fed8c714494e57b3649f232/llama.cpp#L6729 This works for toying with main
example when escape processing is enabled, but leaves other scenarios broken. In particular, when special token identifiers are added to the prompt by client and passed to server
(which is a proper way to handle added tokens), it inserts space into each piece of text between special tokens.
Space insertion was made to match original Python implementation, but that behavior is itself not optimal, as evidenced by people having to hack around it.
Proposals
Option 1
Insert space only when BOS token is also inserted. There is a clear intersection between cases where BOS and space need to be inserted: when the whole prompt is one chunk of text. All broken cases that I listed here involve splitting and recombining the prompt.
The argument bos
can be optionally renamed to something like full_prompt
, or inverse partial
, with meaning that would encompass the behaviors controlled by it.
Option 2
Add a separate argument that controls insertion of space. It would be used by /tokenize in server and in other places.
Option 3
Add another tokenization function with options that would control many aspects of the process. Suggested by @staviq.