Skip to content

Support start strings, the opposite of stop tokens. #13214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

matteoserva
Copy link
Contributor

This allows setting one or multiple start strings that behave the opposite of stop tokens.
The output is discarded until a start string is reached, then it is sent to client as usual even in streaming mode.
If no start string is found, the entire generated text is sent to client at the end of generation.

Use case:

  • Support for broken clients that don't behave well with reasoning models (github copilot) while still allowing the model to perform its reasoning. Example: set start string to "</think>

From command line: llama-server --start-string "</think>" --start-string "</thinking>"

From client:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Hi"}
  ],
  "start_strings": ["</think>","</thinking>"]
}'

@ngxson
Copy link
Collaborator

ngxson commented Apr 30, 2025

Please also add a test case under server/tests/unit/test_chat_completion.py

@github-actions github-actions bot added the python python script changes label Apr 30, 2025
@matteoserva
Copy link
Contributor Author

Please also add a test case under server/tests/unit/test_chat_completion.py

done

@matteoserva matteoserva force-pushed the start branch 2 times, most recently from 0bcef2a to c52ce58 Compare May 2, 2025 07:07
@@ -2200,7 +2241,7 @@ struct server_context {
pos = std::min(slot.n_sent_text, slot.generated_text.size());
} else if (slot.has_next_token) {
stop_pos = slot.find_stopping_strings(str_test, token_str.size(), false);
send_text = stop_pos == std::string::npos;
send_text = send_text && stop_pos == std::string::npos;
}

// check if there is any token to predict
Copy link
Collaborator

@ngxson ngxson May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon closer inspection, I think manipulating send_text is not an optimum solution here.

If you look at the streamed response, when start string is not yet found, the server will still send an empty string for each token generated, which is very wasteful. See the else branch of if (send_text) below.

Instead, I think this logic should be placed outside this if (!incomplete) block

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for slow review, this feature is very cool but I'm always not very confident about the code for processing the generated token. The logic has always been a bit fragile, so I may take it slow to think about all of the possible cases.

I think I will take over this PR for now, will push some modification directly here.

std::string found_string = slot.params.start_strings[search_result.second];
slot.generated_text.erase(
slot.generated_text.begin(),
slot.generated_text.begin() + found_pos + found_string.size());
Copy link
Collaborator

@ngxson ngxson May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the final thing we need is found_pos + found_string.size(), so probably find_first_substring don't even need to return the found_pos and found_string. Instead, it could return the position right after the found start string. This way, we don't even need to care about which start string is selected.

For example, my start string is </think>:

Ok I'm ready to give the answer</think>The final answer is
                                       ^ return this position

@matteoserva matteoserva marked this pull request as draft May 3, 2025 16:24
@matteoserva
Copy link
Contributor Author

Sorry for slow review, this feature is very cool but I'm always not very confident about the code for processing the generated token. The logic has always been a bit fragile, so I may take it slow to think about all of the possible cases.

I think I will take over this PR for now, will push some modification directly here.

I took some time to refactor the server code and try to handle all the funny edge cases.
Maybe you could take a look at it.

@ngxson
Copy link
Collaborator

ngxson commented May 3, 2025

I don't have time for this rn, but tbh the new version looks even more hacky and risky than before. I would expect something very simple

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants