[feature] Make space insertion optional in SentencePiece tokenizer

# Current Behavior

Since #2810, space is inserted into any non-empty text. This breaks multiple use cases:
- https://github.com/ggerganov/llama.cpp/issues/3503#issuecomment-1751177828
- Prompt size control: splitting long text into pieces and adding these pieces to the prompt until certain token limit is reached. It is desirable to know precise token count of each piece, but space insertion gets in the way here.
- Prompt formats that use added tokens. Whether a space should be inserted after a special token depends on particular model used (how it was trained/finetuned and perhaps, alignment of stars). The decision should be left to the code that composes/formats prompts.
- [Yi](https://github.com/01-ai/Yi) tokenizer does not insert space.

Recently, space insertion was disabled in the case when text representation of special tokens is recognized: https://github.com/ggerganov/llama.cpp/blob/1a159553f921a9209fed8c714494e57b3649f232/llama.cpp#L6729 This works for toying with `main` example when escape processing is enabled, but leaves other scenarios broken. In particular, when special token identifiers are added to the prompt by client and passed to `server` (which is a proper way to handle added tokens), it inserts space into each piece of text between special tokens.

Space insertion was made to match original Python implementation, but that behavior is itself not optimal, as evidenced by [people having to hack around it](https://github.com/facebookresearch/codellama/blob/6acfb714c174708bcf806f4612caa62f53c30e46/llama/tokenizer.py#L52).

# Proposals

## Option 1

Insert space only when BOS token is also inserted. There is a clear intersection between cases where BOS and space need to be inserted: when the whole prompt is one chunk of text. All broken cases that I listed here involve splitting and recombining the prompt.

The argument `bos` can be optionally renamed to something like `full_prompt`, or inverse `partial`, with meaning that would encompass the behaviors controlled by it.

## Option 2

Add a separate argument that controls insertion of space. It would be used by /tokenize in server and in other places.

## Option 3

Add another tokenization function with options that would control many aspects of the process. Suggested by @staviq.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature] Make space insertion optional in SentencePiece tokenizer #3664

Current Behavior

Proposals

Option 1

Option 2

Option 3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feature] Make space insertion optional in SentencePiece tokenizer #3664

Description

Current Behavior

Proposals

Option 1

Option 2

Option 3

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions