-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Support start strings, the opposite of stop tokens. #13214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Please also add a test case under |
done |
0bcef2a
to
c52ce58
Compare
examples/server/server.cpp
Outdated
@@ -2200,7 +2241,7 @@ struct server_context { | |||
pos = std::min(slot.n_sent_text, slot.generated_text.size()); | |||
} else if (slot.has_next_token) { | |||
stop_pos = slot.find_stopping_strings(str_test, token_str.size(), false); | |||
send_text = stop_pos == std::string::npos; | |||
send_text = send_text && stop_pos == std::string::npos; | |||
} | |||
|
|||
// check if there is any token to predict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon closer inspection, I think manipulating send_text
is not an optimum solution here.
If you look at the streamed response, when start string is not yet found, the server will still send an empty string for each token generated, which is very wasteful. See the else
branch of if (send_text)
below.
Instead, I think this logic should be placed outside this if (!incomplete)
block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for slow review, this feature is very cool but I'm always not very confident about the code for processing the generated token. The logic has always been a bit fragile, so I may take it slow to think about all of the possible cases.
I think I will take over this PR for now, will push some modification directly here.
examples/server/server.cpp
Outdated
std::string found_string = slot.params.start_strings[search_result.second]; | ||
slot.generated_text.erase( | ||
slot.generated_text.begin(), | ||
slot.generated_text.begin() + found_pos + found_string.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the final thing we need is found_pos + found_string.size()
, so probably find_first_substring
don't even need to return the found_pos
and found_string
. Instead, it could return the position right after the found start string. This way, we don't even need to care about which start string is selected.
For example, my start string is </think>
:
Ok I'm ready to give the answer</think>The final answer is
^ return this position
I took some time to refactor the server code and try to handle all the funny edge cases. |
I don't have time for this rn, but tbh the new version looks even more hacky and risky than before. I would expect something very simple |
This allows setting one or multiple start strings that behave the opposite of stop tokens.
The output is discarded until a start string is reached, then it is sent to client as usual even in streaming mode.
If no start string is found, the entire generated text is sent to client at the end of generation.
Use case:
"</think>
From command line:
llama-server --start-string "</think>" --start-string "</thinking>"
From client: