Skip to content

tts : implement sesame CSM + Mimi decoder #12648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Mar 29, 2025

Related to #12392

Tbh it is more complicated than expected.

This PR only contains the backbone + decoder:

How to try this?

By default, all GGUF files are downloaded from ggml-org Hugging Face's account

# build (make sure to have LLAMA_CURL enabled)
cmake -B build -DLLAMA_CURL=ON
cmake --build build -j --target llama-tts-csm

# run it
./build/bin/llama-tts-csm -p "[0]Hi, my name is Xuan Son. I am software engineer at Hugging Face."

Alternatively, GGUF files can be converted using convert_mimi_to_gguf.py and convert_csm_to_gguf.py under example/tts directory. These script uses transformers.AutoModel under the hood, so they will also handle downloading safetensors file automatically.

Note: it pronounces "Xuan" incorrectly, but the rest is OK

output.mp4

How Sesame CSM works?

The model contains a backbone and a decoder, both are based on llama 3.x architecture (auto-aggressive).

  1. The input text will firstly be processed by backbone, the output is (1) a RVQ semantic code and (2) the raw embedding from last layer, after norm
  2. These 2 output from backbone then get passed into decoder as input. The decoder then generate the next 31 RVQ acoustic tokens
  3. At this point, 32 RVQ are generated, it then get "squash" back into one single vector, then pass back the the backbone
  4. Repeat from step 1 to generate the next codes
flowchart TD
    A[Input Text, vocab 128_256 tokens] -- prompt input --> B

    subgraph Backbone
        B[Backbone transformer]
        B --> C[Output logits, vocab 65632 tokens]
        B --> D[Output Raw embd, vector of 2048 elem]
    end

    D -- vector input --> Proj
    C -- sampling --> Stoken[RVQ semantic token]
    Stoken --> Fin
    Stoken --> H

    subgraph Decoder
        Proj[Projector, reduce size to 1024]
        Fin[Input vocab: 65632 tokens] -- vector dim 2048 --> Proj
        Proj --> F[Decoder transformer]
        F --> G[Output logits: vocab 2051 tokens]
    end

    G -- sampling --> HH[RVQ acoustic token]
    HH -- generate next token --> Fin
    HH -- repeated 31 times --> H[Collected 32 RVQ tokens & audio embeddings, matrix: 2048 x 32]

    H -- sum all vectors --> I[single vector of 2048]
    I -- generate next token --> B
    I -- is zero vec? --> K[Stop generation]

Loading

@github-actions github-actions bot added examples python python script changes labels Mar 29, 2025
@ngxson ngxson mentioned this pull request Mar 30, 2025
4 tasks
@ngxson ngxson changed the title tts : implement sesame backbone + decoder tts : implement sesame CSM + Mimi decoder Mar 30, 2025
@ngxson ngxson marked this pull request as ready for review March 30, 2025 12:30
@arch-btw
Copy link
Contributor

Really nice!

I'm having some issues with longer sentences, or is that just the model's limitations?
For example:

-p "[0]Hi! How are you? I hope you"

Works, but:

-p "[0]Hi! How are you? I hope you are doing well"

Will go in an infinite loop of token generation.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 30, 2025

I think my implementation still have some problems, but not sure where. I never get logits to 100% match what the safetensors model generates.

Will reach out to Sesame team to confirm if I'm doing this correctly

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 2, 2025

Ok so I added support for multi-turns text input, but the generated audio has a silence gap between 2 turns.

I observed kinda same thing on the python demo, so I think it's something to do with the model.

@ngxson ngxson requested a review from ggerganov April 2, 2025 15:33
@ggerganov
Copy link
Member

but the generated audio has a silence gap between 2 turns.

I am doing some testing and I think what is confusing it is the new lines in the input. If I remove the new lines, it seems to work better:

csm-demo.txt

[0]Hey how are you doing.[1]Pretty good, pretty good.[0]I'm great, so happy to be speaking to you. What about you?[1]Me too, this is some cool stuff huh?

Maybe double-check that the tokenization is correct, compared to the HF space demo?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 3, 2025

I have a deeper look into the code of HF demo space. Seems like for each turn, they re-evaluate the whole "chat" history: https://huggingface.co/spaces/sesame/csm-1b/blob/main/app.py#L150-L156

But that does not change much though. What I understand is that this is the same thing with text chat templates. The only difference is that in this case, with audio embd, our chat template looks like this:

<bos> ... text1 ... <text_eos> ... audio_embd ... <audio_eos><bos> ... text12... <text_eos> ... audio_embd ... <audio_eos> ...

So seems like we're just missing <audio_eos>, I added it in my last commit but it does not change much. The only difference so far was that now it's able to generate male/female voice for separated turn (which it was unable to do beforehand)

What I'm speculating this that we're also missing the "system prompt" (same idea with speaker in outeTTS), which shows the model how the voice should behave. In the official demo, they have a casual voice and a more serious news reader voice. I'll give it a try.

@Desir-Armann
Copy link

@ngxson is it be possible to have audio streaming?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 5, 2025

We don't support streaming for simplification. It can be added in the future when the implementation become more stable.

@ShaanveerS
Copy link

@ngxson Appreciate the through work here.
You mentioned that streaming could be added once things stabilize... would you be open to briefly describe what steps or components would be involved to support it?
Thanks a lot.

@ggerganov
Copy link
Member

@ngxson Should I review or wait for the:

What I'm speculating this that we're also missing the "system prompt" (same idea with speaker in outeTTS), which shows the model how the voice should behave. In the official demo, they have a casual voice and a more serious news reader voice. I'll give it a try.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 8, 2025

@ggerganov please go ahead and review this PR. The system prompt will be simple to add, I will try to do that a bit later (that requires me to use the Mimi encoder via transformers)

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 9, 2025

@ggerganov I added the speaker reference and it works well. You were right about the new line stuff, the model is very sensitive to newline characters, it usually add a long pause in place of the newline.

output.mp4

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful to have a Mimi implementation, even just as an example for now. The LLAMA_CSM graph has some hacks - we should try to avoid them in the future.

A bit concerned that we are merging a lot of code that probably won't get exercised often by users.

What is preventing to implement the entire Mimi decoder as a libllama model?

Is it correct that after the Mimi encoder is implemented we would be able to pass previous audio as input to the context?

}

} else {
// otherwise, dummy output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reach this branch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% why but the warmup call runs this will a ggml_nelements(cur) == 0, which will trigger this branch. I assume that's because there is no output token being set in the batch

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 23, 2025

It's useful to have a Mimi implementation, even just as an example for now. The LLAMA_CSM graph has some hacks - we should try to avoid them in the future.

Yes I was also thinking about possibility to allow multi output head in the future.

As you may already know, ChatGPT can generate image natively using diffusion head. I have a feeling that some models in the future may follow the same path.

What is preventing to implement the entire Mimi decoder as a libllama model?

The main reason be because the cgraph for RVQ and SeaNet are quite complicated. I'm currently using some hacks (like supporting depth-wise model in ggml_conv_transpose_1d), so adding them to libllama right now pollute the code base quite a lot.

But ofc we can consider bringing it to libllama at some point if it gets more usage.

Is it correct that after the Mimi encoder is implemented we would be able to pass previous audio as input to the context?

Yes, the Mini encoder is required to generate the speaker reference (voice cloning). But in fact, my main goal is to support real-time speech-to-speech like Kyutai Moshi, or the not-yet-released Sesame model that they used on the online demo.

If it ever get released, I think it will spark some usage for this example. Otherwise we could also consider removing this (in the future) if it adds too much maintenance burden

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 23, 2025

@ggerganov On second thought, I think I'll keep the PR open for a while to see if there are more people interested in it. I also want to see if Sesame is gonna release the speech-to-speech model soon or later.

It's not very urgent to merge because most of the code is outside of the main library anyway, and people seem to interested more in audio input instead of audio out.

@Ashoka74
Copy link

Ashoka74 commented Apr 23, 2025

@ngxson Thank you for tour work!

I tried to run csm inference command through llama.cpp but the output took around 1min to get generated. I had to add --no-mmap, maybe because there was a mismatch between my CPU and architecture.

Out of curiosity, based on tour tests, is it viable for real-time conversations on high-end mobile device? What minimum VRAM/RAM would you suggest?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 23, 2025

Out of curiosity, based on tour tests, is it viable for real-time conversations on high-end mobile device?

The released model is TTS, speech-to-speech conversational like GPT advanced voice mode or Kyutai Moshi, so there is no accurate estimation on what's needed

But what we can expect is that a normal speech-to-speech model can be around 7b, so about 10GB in Q8_0. According to Moshi, processing need to be at least 12.5 token/s to make it realtime

@Horschig
Copy link

Horschig commented Apr 25, 2025

Thanks for the effort! I compiled it on Windows and used your example.
./llama-tts-csm.exe -p "[0]Hi, my name is Xuan Son. I am software engineer at Hugging Face."

Works fine! And fairly fast (> 30 token/s on a Quadro T1000),

Just to let you know, it logs some prompts that I did not ask for:

---

turn: [0]like revising for an exam I'd have to try and like keep up the momentum because I'd start really early I'd be like okay I'm gonna start revising now and then like you're revising for ages and then I just like start losing steam I didn't do that for the exam we had recently to be fair that was a more of a last minute scenario but like yeah I'm trying to like yeah I noticed this yesterday that like Mondays I sort of start the day with this not like a panic but like a

prompt (111 tokens): 
128000, 58, 15, 60, 4908, 17951, 287, 369, 459, 7151, 358, 4265, 617, 311, 1456, 323, 1093, 2567, 709, 279, 24151, 1606, 358, 4265, 1212, 2216, 4216, 358, 4265, 387, 1093, 17339, 358, 2846, 16926, 1212, 17951, 287, 1457, 323, 1243, 1093, 499, 2351, 17951, 287, 369, 17051, 323, 1243, 358, 1120, 1093, 1212, 13490, 20930, 358, 3287, 956, 656, 430, 369, 279, 7151, 584, 1047, 6051, 311, 387, 6762, 430, 574, 264, 810, 315, 264, 1566, 9568, 15398, 719, 1093, 22371, 358, 2846, 4560, 311, 1093, 22371, 358, 14000, 420, 13985, 430, 1093, 91271, 358, 3460, 315, 1212, 279, 1938, 449, 420, 539, 1093, 264, 22743, 719, 1093, 264, 128001, 


---

turn: [1]like a super Mario level. Like it's very like high detail. And like, once you get into the park, it just like, everything looks like a computer game and they have all these, like, you know, if, if there's like a, you know, like in a Mario game, they will have like a question block. And if you like, you know, punch it, a coin will come out. So like everyone, when they come into the park, they get like this little bracelet and then you can go punching question blocks around.

prompt (119 tokens): 
128000, 58, 16, 60, 4908, 264, 2307, 24270, 2237, 13, 9086, 433, 596, 1633, 1093, 1579, 7872, 13, 1628, 1093, 11, 3131, 499, 636, 1139, 279, 6246, 11, 433, 1120, 1093, 11, 4395, 5992, 1093, 264, 6500, 1847, 323, 814, 617, 682, 1521, 11, 1093, 11, 499, 1440, 11, 422, 11, 422, 1070, 596, 1093, 264, 11, 499, 1440, 11, 1093, 304, 264, 24270, 1847, 11, 814, 690, 617, 1093, 264, 3488, 2565, 13, 1628, 422, 499, 1093, 11, 499, 1440, 11, 21004, 433, 11, 264, 16652, 690, 2586, 704, 13, 2100, 1093, 5127, 11, 994, 814, 2586, 1139, 279, 6246, 11, 814, 636, 1093, 420, 2697, 59519, 323, 1243, 499, 649, 733, 68981, 3488, 10215, 2212, 13, 128001, 


---

turn: [0]Hi, my name is Xuan Son. I am software engineer at Hugging Face.

prompt (23 tokens): 
128000, 58, 15, 60, 13347, 11, 856, 836, 374, 1630, 10602, 12103, 13, 358, 1097, 3241, 24490, 520, 473, 36368, 19109, 13, 128001, 

This happens in both debug and release build.

But it's really awesome that this works on my poor 4GB GPU! Now I just need a German finetune... ;)

It's not very urgent to merge because most of the code is outside of the main library anyway, and people seem to interested more in audio input instead of audio out.

And btw, I absolutely disagree ;)

@NathanMarq
Copy link

NathanMarq commented May 9, 2025

I'm seeing the same extra prompts that @Horschig mentioned, for an M1 Mac build. Pinging here so this doesn't get too stale.

It's not very urgent to merge because most of the code is outside of the main library anyway, and people seem to interested more in audio input instead of audio out.

I also strongly disagree that STT is more desired than TTS. Having this be the backbone for a new 'conversational' voice system like SesameAI's demo (as mentioned in the OP issue) would be extremely popular.

PS. 🙌 Thank you for your hard work on this! It's very cool to see it running so well.

Edit: Those logs look to be coming from that data file here: examples/tts/tts-csm-data.h

@farris
Copy link

farris commented May 16, 2025

Thanks for the effort! I compiled it on Windows and used your example. ./llama-tts-csm.exe -p "[0]Hi, my name is Xuan Son. I am software engineer at Hugging Face."

Works fine! And fairly fast (> 30 token/s on a Quadro T1000),

Just to let you know, it logs some prompts that I did not ask for:

---

turn: [0]like revising for an exam I'd have to try and like keep up the momentum because I'd start really early I'd be like okay I'm gonna start revising now and then like you're revising for ages and then I just like start losing steam I didn't do that for the exam we had recently to be fair that was a more of a last minute scenario but like yeah I'm trying to like yeah I noticed this yesterday that like Mondays I sort of start the day with this not like a panic but like a

prompt (111 tokens): 
128000, 58, 15, 60, 4908, 17951, 287, 369, 459, 7151, 358, 4265, 617, 311, 1456, 323, 1093, 2567, 709, 279, 24151, 1606, 358, 4265, 1212, 2216, 4216, 358, 4265, 387, 1093, 17339, 358, 2846, 16926, 1212, 17951, 287, 1457, 323, 1243, 1093, 499, 2351, 17951, 287, 369, 17051, 323, 1243, 358, 1120, 1093, 1212, 13490, 20930, 358, 3287, 956, 656, 430, 369, 279, 7151, 584, 1047, 6051, 311, 387, 6762, 430, 574, 264, 810, 315, 264, 1566, 9568, 15398, 719, 1093, 22371, 358, 2846, 4560, 311, 1093, 22371, 358, 14000, 420, 13985, 430, 1093, 91271, 358, 3460, 315, 1212, 279, 1938, 449, 420, 539, 1093, 264, 22743, 719, 1093, 264, 128001, 


---

turn: [1]like a super Mario level. Like it's very like high detail. And like, once you get into the park, it just like, everything looks like a computer game and they have all these, like, you know, if, if there's like a, you know, like in a Mario game, they will have like a question block. And if you like, you know, punch it, a coin will come out. So like everyone, when they come into the park, they get like this little bracelet and then you can go punching question blocks around.

prompt (119 tokens): 
128000, 58, 16, 60, 4908, 264, 2307, 24270, 2237, 13, 9086, 433, 596, 1633, 1093, 1579, 7872, 13, 1628, 1093, 11, 3131, 499, 636, 1139, 279, 6246, 11, 433, 1120, 1093, 11, 4395, 5992, 1093, 264, 6500, 1847, 323, 814, 617, 682, 1521, 11, 1093, 11, 499, 1440, 11, 422, 11, 422, 1070, 596, 1093, 264, 11, 499, 1440, 11, 1093, 304, 264, 24270, 1847, 11, 814, 690, 617, 1093, 264, 3488, 2565, 13, 1628, 422, 499, 1093, 11, 499, 1440, 11, 21004, 433, 11, 264, 16652, 690, 2586, 704, 13, 2100, 1093, 5127, 11, 994, 814, 2586, 1139, 279, 6246, 11, 814, 636, 1093, 420, 2697, 59519, 323, 1243, 499, 649, 733, 68981, 3488, 10215, 2212, 13, 128001, 


---

turn: [0]Hi, my name is Xuan Son. I am software engineer at Hugging Face.

prompt (23 tokens): 
128000, 58, 15, 60, 13347, 11, 856, 836, 374, 1630, 10602, 12103, 13, 358, 1097, 3241, 24490, 520, 473, 36368, 19109, 13, 128001, 

This happens in both debug and release build.

But it's really awesome that this works on my poor 4GB GPU! Now I just need a German finetune... ;)

It's not very urgent to merge because most of the code is outside of the main library anyway, and people seem to interested more in audio input instead of audio out.

And btw, I absolutely disagree ;)

I am also wondering where the random prompts are coming from?

@ngxson
Copy link
Collaborator Author

ngxson commented May 16, 2025

The "random" prompt is the speaker reference (it's not random, but I hard-coded it), it acts as an example of how the voice of 2 people should sound like

In theory, you can swap it which whatever voice and now it become essentially a voice cloning

To generate it, however, you need to go though the python code, and atm I don't have time to document it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants