Open
Description
There have been a few reports where the grammar sampling can significantly degrade the performance.
It would be nice to profile and optimize the implementation - there should be room for improvements.
Already on-going efforts:
reserve
space indecode_utf8
#4210- Allow reusing results from
llama_token_to_piece
when sampling grammars #4213
Probably worth looking in multi-threading the implementation as well.