-
Notifications
You must be signed in to change notification settings - Fork 11.9k
tts : implement sesame CSM + Mimi decoder #12648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Really nice! I'm having some issues with longer sentences, or is that just the model's limitations?
Works, but:
Will go in an infinite loop of token generation. |
I think my implementation still have some problems, but not sure where. I never get logits to 100% match what the safetensors model generates. Will reach out to Sesame team to confirm if I'm doing this correctly |
Ok so I added support for multi-turns text input, but the generated audio has a silence gap between 2 turns. I observed kinda same thing on the python demo, so I think it's something to do with the model. |
I am doing some testing and I think what is confusing it is the new lines in the input. If I remove the new lines, it seems to work better: csm-demo.txt
Maybe double-check that the tokenization is correct, compared to the HF space demo? |
I have a deeper look into the code of HF demo space. Seems like for each turn, they re-evaluate the whole "chat" history: https://huggingface.co/spaces/sesame/csm-1b/blob/main/app.py#L150-L156 But that does not change much though. What I understand is that this is the same thing with text chat templates. The only difference is that in this case, with audio embd, our chat template looks like this:
So seems like we're just missing What I'm speculating this that we're also missing the "system prompt" (same idea with speaker in outeTTS), which shows the model how the voice should behave. In the official demo, they have a casual voice and a more serious news reader voice. I'll give it a try. |
@ngxson is it be possible to have audio streaming? |
We don't support streaming for simplification. It can be added in the future when the implementation become more stable. |
@ngxson Appreciate the through work here. |
@ngxson Should I review or wait for the:
|
@ggerganov please go ahead and review this PR. The system prompt will be simple to add, I will try to do that a bit later (that requires me to use the Mimi encoder via transformers) |
@ggerganov I added the speaker reference and it works well. You were right about the new line stuff, the model is very sensitive to newline characters, it usually add a long pause in place of the newline. output.mp4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's useful to have a Mimi implementation, even just as an example for now. The LLAMA_CSM
graph has some hacks - we should try to avoid them in the future.
A bit concerned that we are merging a lot of code that probably won't get exercised often by users.
What is preventing to implement the entire Mimi decoder as a libllama
model?
Is it correct that after the Mimi encoder is implemented we would be able to pass previous audio as input to the context?
} | ||
|
||
} else { | ||
// otherwise, dummy output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reach this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% why but the warmup call runs this will a ggml_nelements(cur) == 0
, which will trigger this branch. I assume that's because there is no output token being set in the batch
Yes I was also thinking about possibility to allow multi output head in the future. As you may already know, ChatGPT can generate image natively using diffusion head. I have a feeling that some models in the future may follow the same path.
The main reason be because the cgraph for RVQ and SeaNet are quite complicated. I'm currently using some hacks (like supporting depth-wise model in But ofc we can consider bringing it to
Yes, the Mini encoder is required to generate the speaker reference (voice cloning). But in fact, my main goal is to support real-time speech-to-speech like Kyutai Moshi, or the not-yet-released Sesame model that they used on the online demo. If it ever get released, I think it will spark some usage for this example. Otherwise we could also consider removing this (in the future) if it adds too much maintenance burden |
@ggerganov On second thought, I think I'll keep the PR open for a while to see if there are more people interested in it. I also want to see if Sesame is gonna release the speech-to-speech model soon or later. It's not very urgent to merge because most of the code is outside of the main library anyway, and people seem to interested more in audio input instead of audio out. |
@ngxson Thank you for tour work! I tried to run csm inference command through llama.cpp but the output took around 1min to get generated. I had to add --no-mmap, maybe because there was a mismatch between my CPU and architecture. Out of curiosity, based on tour tests, is it viable for real-time conversations on high-end mobile device? What minimum VRAM/RAM would you suggest? |
The released model is TTS, speech-to-speech conversational like GPT advanced voice mode or Kyutai Moshi, so there is no accurate estimation on what's needed But what we can expect is that a normal speech-to-speech model can be around 7b, so about 10GB in Q8_0. According to Moshi, processing need to be at least 12.5 token/s to make it realtime |
Thanks for the effort! I compiled it on Windows and used your example. Works fine! And fairly fast (> 30 token/s on a Quadro T1000), Just to let you know, it logs some prompts that I did not ask for:
This happens in both debug and release build. But it's really awesome that this works on my poor 4GB GPU! Now I just need a German finetune... ;)
And btw, I absolutely disagree ;) |
I'm seeing the same extra prompts that @Horschig mentioned, for an M1 Mac build. Pinging here so this doesn't get too stale.
I also strongly disagree that STT is more desired than TTS. Having this be the backbone for a new 'conversational' voice system like SesameAI's demo (as mentioned in the OP issue) would be extremely popular. PS. 🙌 Thank you for your hard work on this! It's very cool to see it running so well. Edit: Those logs look to be coming from that data file here: examples/tts/tts-csm-data.h |
I am also wondering where the random prompts are coming from? |
The "random" prompt is the speaker reference (it's not random, but I hard-coded it), it acts as an example of how the voice of 2 people should sound like In theory, you can swap it which whatever voice and now it become essentially a voice cloning To generate it, however, you need to go though the python code, and atm I don't have time to document it |
Related to #12392
Tbh it is more complicated than expected.
This PR only contains the backbone + decoder:
How to try this?
By default, all GGUF files are downloaded from ggml-org Hugging Face's account
Alternatively, GGUF files can be converted using
convert_mimi_to_gguf.py
andconvert_csm_to_gguf.py
underexample/tts
directory. These script usestransformers.AutoModel
under the hood, so they will also handle downloading safetensors file automatically.Note: it pronounces "Xuan" incorrectly, but the rest is OK
output.mp4
How Sesame CSM works?
The model contains a backbone and a decoder, both are based on llama 3.x architecture (auto-aggressive).