llama : speed-up grammar sampling

There have been a few reports where the grammar sampling can significantly degrade the performance.
It would be nice to profile and optimize the implementation - there should be room for improvements.

Already on-going efforts:

- #4210 
- #4213

Probably worth looking in multi-threading the implementation as well.