tts : implement sesame CSM + Mimi decoder #12648

ngxson · 2025-03-29T23:36:28Z

Related to #12392

Tbh it is more complicated than expected.

This PR only contains the backbone + decoder:

Input: text
Output: audio codes, which can then converted to waveform via tts : implement mimi decoder #12636

How to try this?

By default, all GGUF files are downloaded from ggml-org Hugging Face's account

# build (make sure to have LLAMA_CURL enabled)
cmake -B build -DLLAMA_CURL=ON
cmake --build build -j --target llama-tts-csm

# run it
./build/bin/llama-tts-csm -p "[0]Hi, my name is Xuan Son. I am software engineer at Hugging Face."

Alternatively, GGUF files can be converted using convert_mimi_to_gguf.py and convert_csm_to_gguf.py under example/tts directory. These script uses transformers.AutoModel under the hood, so they will also handle downloading safetensors file automatically.

Note: it pronounces "Xuan" incorrectly, but the rest is OK

output.mp4

How Sesame CSM works?

The model contains a backbone and a decoder, both are based on llama 3.x architecture (auto-aggressive).

The input text will firstly be processed by backbone, the output is (1) a RVQ semantic code and (2) the raw embedding from last layer, after norm
These 2 output from backbone then get passed into decoder as input. The decoder then generate the next 31 RVQ acoustic tokens
At this point, 32 RVQ are generated, it then get "squash" back into one single vector, then pass back the the backbone
Repeat from step 1 to generate the next codes

flowchart TD
    A[Input Text, vocab 128_256 tokens] -- prompt input --> B

    subgraph Backbone
        B[Backbone transformer]
        B --> C[Output logits, vocab 65632 tokens]
        B --> D[Output Raw embd, vector of 2048 elem]
    end

    D -- vector input --> Proj
    C -- sampling --> Stoken[RVQ semantic token]
    Stoken --> Fin
    Stoken --> H

    subgraph Decoder
        Proj[Projector, reduce size to 1024]
        Fin[Input vocab: 65632 tokens] -- vector dim 2048 --> Proj
        Proj --> F[Decoder transformer]
        F --> G[Output logits: vocab 2051 tokens]
    end

    G -- sampling --> HH[RVQ acoustic token]
    HH -- generate next token --> Fin
    HH -- repeated 31 times --> H[Collected 32 RVQ tokens & audio embeddings, matrix: 2048 x 32]

    H -- sum all vectors --> I[single vector of 2048]
    I -- generate next token --> B
    I -- is zero vec? --> K[Stop generation]

arch-btw · 2025-03-30T17:19:07Z

Really nice!

I'm having some issues with longer sentences, or is that just the model's limitations?
For example:

-p "[0]Hi! How are you? I hope you"

Works, but:

-p "[0]Hi! How are you? I hope you are doing well"

Will go in an infinite loop of token generation.

ngxson · 2025-03-30T19:23:21Z

I think my implementation still have some problems, but not sure where. I never get logits to 100% match what the safetensors model generates.

Will reach out to Sesame team to confirm if I'm doing this correctly

ngxson · 2025-04-02T15:32:50Z

Ok so I added support for multi-turns text input, but the generated audio has a silence gap between 2 turns.

I observed kinda same thing on the python demo, so I think it's something to do with the model.

ggerganov · 2025-04-02T17:26:10Z

but the generated audio has a silence gap between 2 turns.

I am doing some testing and I think what is confusing it is the new lines in the input. If I remove the new lines, it seems to work better:

csm-demo.txt

[0]Hey how are you doing.[1]Pretty good, pretty good.[0]I'm great, so happy to be speaking to you. What about you?[1]Me too, this is some cool stuff huh?

Maybe double-check that the tokenization is correct, compared to the HF space demo?

ngxson · 2025-04-03T12:56:07Z

I have a deeper look into the code of HF demo space. Seems like for each turn, they re-evaluate the whole "chat" history: https://huggingface.co/spaces/sesame/csm-1b/blob/main/app.py#L150-L156

But that does not change much though. What I understand is that this is the same thing with text chat templates. The only difference is that in this case, with audio embd, our chat template looks like this:

<bos> ... text1 ... <text_eos> ... audio_embd ... <audio_eos><bos> ... text12... <text_eos> ... audio_embd ... <audio_eos> ...

So seems like we're just missing <audio_eos>, I added it in my last commit but it does not change much. The only difference so far was that now it's able to generate male/female voice for separated turn (which it was unable to do beforehand)

What I'm speculating this that we're also missing the "system prompt" (same idea with speaker in outeTTS), which shows the model how the voice should behave. In the official demo, they have a casual voice and a more serious news reader voice. I'll give it a try.

Desir-Armann · 2025-04-05T14:40:24Z

@ngxson is it be possible to have audio streaming?

ngxson · 2025-04-05T15:25:24Z

We don't support streaming for simplification. It can be added in the future when the implementation become more stable.

ShaanveerS · 2025-04-05T18:27:12Z

@ngxson Appreciate the through work here.
You mentioned that streaming could be added once things stabilize... would you be open to briefly describe what steps or components would be involved to support it?
Thanks a lot.

ggerganov · 2025-04-08T08:05:55Z

@ngxson Should I review or wait for the:

What I'm speculating this that we're also missing the "system prompt" (same idea with speaker in outeTTS), which shows the model how the voice should behave. In the official demo, they have a casual voice and a more serious news reader voice. I'll give it a try.

ngxson · 2025-04-08T08:23:38Z

@ggerganov please go ahead and review this PR. The system prompt will be simple to add, I will try to do that a bit later (that requires me to use the Mimi encoder via transformers)

ngxson · 2025-04-09T15:34:05Z

@ggerganov I added the speaker reference and it works well. You were right about the new line stuff, the model is very sensitive to newline characters, it usually add a long pause in place of the newline.

output.mp4

ggerganov

It's useful to have a Mimi implementation, even just as an example for now. The LLAMA_CSM graph has some hacks - we should try to avoid them in the future.

A bit concerned that we are merging a lot of code that probably won't get exercised often by users.

What is preventing to implement the entire Mimi decoder as a libllama model?

Is it correct that after the Mimi encoder is implemented we would be able to pass previous audio as input to the context?

ggerganov · 2025-04-22T12:21:16Z

src/llama-model.cpp

+            }
+
+        } else {
+            // otherwise, dummy output


Can we reach this branch?

I'm not 100% why but the warmup call runs this will a ggml_nelements(cur) == 0, which will trigger this branch. I assume that's because there is no output token being set in the batch

ngxson · 2025-04-23T10:28:58Z

It's useful to have a Mimi implementation, even just as an example for now. The LLAMA_CSM graph has some hacks - we should try to avoid them in the future.

Yes I was also thinking about possibility to allow multi output head in the future.

As you may already know, ChatGPT can generate image natively using diffusion head. I have a feeling that some models in the future may follow the same path.

What is preventing to implement the entire Mimi decoder as a libllama model?

The main reason be because the cgraph for RVQ and SeaNet are quite complicated. I'm currently using some hacks (like supporting depth-wise model in ggml_conv_transpose_1d), so adding them to libllama right now pollute the code base quite a lot.

But ofc we can consider bringing it to libllama at some point if it gets more usage.

Is it correct that after the Mimi encoder is implemented we would be able to pass previous audio as input to the context?

Yes, the Mini encoder is required to generate the speaker reference (voice cloning). But in fact, my main goal is to support real-time speech-to-speech like Kyutai Moshi, or the not-yet-released Sesame model that they used on the online demo.

If it ever get released, I think it will spark some usage for this example. Otherwise we could also consider removing this (in the future) if it adds too much maintenance burden

ngxson · 2025-04-23T13:15:36Z

@ggerganov On second thought, I think I'll keep the PR open for a while to see if there are more people interested in it. I also want to see if Sesame is gonna release the speech-to-speech model soon or later.

It's not very urgent to merge because most of the code is outside of the main library anyway, and people seem to interested more in audio input instead of audio out.

Ashoka74 · 2025-04-23T16:45:21Z

@ngxson Thank you for tour work!

I tried to run csm inference command through llama.cpp but the output took around 1min to get generated. I had to add --no-mmap, maybe because there was a mismatch between my CPU and architecture.

Out of curiosity, based on tour tests, is it viable for real-time conversations on high-end mobile device? What minimum VRAM/RAM would you suggest?

ngxson · 2025-04-23T16:53:48Z

Out of curiosity, based on tour tests, is it viable for real-time conversations on high-end mobile device?

The released model is TTS, speech-to-speech conversational like GPT advanced voice mode or Kyutai Moshi, so there is no accurate estimation on what's needed

But what we can expect is that a normal speech-to-speech model can be around 7b, so about 10GB in Q8_0. According to Moshi, processing need to be at least 12.5 token/s to make it realtime

Horschig · 2025-04-25T08:25:11Z

Thanks for the effort! I compiled it on Windows and used your example.
./llama-tts-csm.exe -p "[0]Hi, my name is Xuan Son. I am software engineer at Hugging Face."

Works fine! And fairly fast (> 30 token/s on a Quadro T1000),

Just to let you know, it logs some prompts that I did not ask for:

---

turn: [0]like revising for an exam I'd have to try and like keep up the momentum because I'd start really early I'd be like okay I'm gonna start revising now and then like you're revising for ages and then I just like start losing steam I didn't do that for the exam we had recently to be fair that was a more of a last minute scenario but like yeah I'm trying to like yeah I noticed this yesterday that like Mondays I sort of start the day with this not like a panic but like a

prompt (111 tokens): 
128000, 58, 15, 60, 4908, 17951, 287, 369, 459, 7151, 358, 4265, 617, 311, 1456, 323, 1093, 2567, 709, 279, 24151, 1606, 358, 4265, 1212, 2216, 4216, 358, 4265, 387, 1093, 17339, 358, 2846, 16926, 1212, 17951, 287, 1457, 323, 1243, 1093, 499, 2351, 17951, 287, 369, 17051, 323, 1243, 358, 1120, 1093, 1212, 13490, 20930, 358, 3287, 956, 656, 430, 369, 279, 7151, 584, 1047, 6051, 311, 387, 6762, 430, 574, 264, 810, 315, 264, 1566, 9568, 15398, 719, 1093, 22371, 358, 2846, 4560, 311, 1093, 22371, 358, 14000, 420, 13985, 430, 1093, 91271, 358, 3460, 315, 1212, 279, 1938, 449, 420, 539, 1093, 264, 22743, 719, 1093, 264, 128001, 


---

turn: [1]like a super Mario level. Like it's very like high detail. And like, once you get into the park, it just like, everything looks like a computer game and they have all these, like, you know, if, if there's like a, you know, like in a Mario game, they will have like a question block. And if you like, you know, punch it, a coin will come out. So like everyone, when they come into the park, they get like this little bracelet and then you can go punching question blocks around.

prompt (119 tokens): 
128000, 58, 16, 60, 4908, 264, 2307, 24270, 2237, 13, 9086, 433, 596, 1633, 1093, 1579, 7872, 13, 1628, 1093, 11, 3131, 499, 636, 1139, 279, 6246, 11, 433, 1120, 1093, 11, 4395, 5992, 1093, 264, 6500, 1847, 323, 814, 617, 682, 1521, 11, 1093, 11, 499, 1440, 11, 422, 11, 422, 1070, 596, 1093, 264, 11, 499, 1440, 11, 1093, 304, 264, 24270, 1847, 11, 814, 690, 617, 1093, 264, 3488, 2565, 13, 1628, 422, 499, 1093, 11, 499, 1440, 11, 21004, 433, 11, 264, 16652, 690, 2586, 704, 13, 2100, 1093, 5127, 11, 994, 814, 2586, 1139, 279, 6246, 11, 814, 636, 1093, 420, 2697, 59519, 323, 1243, 499, 649, 733, 68981, 3488, 10215, 2212, 13, 128001, 


---

turn: [0]Hi, my name is Xuan Son. I am software engineer at Hugging Face.

prompt (23 tokens): 
128000, 58, 15, 60, 13347, 11, 856, 836, 374, 1630, 10602, 12103, 13, 358, 1097, 3241, 24490, 520, 473, 36368, 19109, 13, 128001,

This happens in both debug and release build.

But it's really awesome that this works on my poor 4GB GPU! Now I just need a German finetune... ;)

It's not very urgent to merge because most of the code is outside of the main library anyway, and people seem to interested more in audio input instead of audio out.

And btw, I absolutely disagree ;)

NathanMarq · 2025-05-09T17:00:30Z

I'm seeing the same extra prompts that @Horschig mentioned, for an M1 Mac build. Pinging here so this doesn't get too stale.

It's not very urgent to merge because most of the code is outside of the main library anyway, and people seem to interested more in audio input instead of audio out.

I also strongly disagree that STT is more desired than TTS. Having this be the backbone for a new 'conversational' voice system like SesameAI's demo (as mentioned in the OP issue) would be extremely popular.

PS. 🙌 Thank you for your hard work on this! It's very cool to see it running so well.

Edit: Those logs look to be coming from that data file here: examples/tts/tts-csm-data.h

farris · 2025-05-16T18:14:08Z

Thanks for the effort! I compiled it on Windows and used your example. ./llama-tts-csm.exe -p "[0]Hi, my name is Xuan Son. I am software engineer at Hugging Face."

Works fine! And fairly fast (> 30 token/s on a Quadro T1000),

Just to let you know, it logs some prompts that I did not ask for:

---

turn: [0]like revising for an exam I'd have to try and like keep up the momentum because I'd start really early I'd be like okay I'm gonna start revising now and then like you're revising for ages and then I just like start losing steam I didn't do that for the exam we had recently to be fair that was a more of a last minute scenario but like yeah I'm trying to like yeah I noticed this yesterday that like Mondays I sort of start the day with this not like a panic but like a

prompt (111 tokens): 
128000, 58, 15, 60, 4908, 17951, 287, 369, 459, 7151, 358, 4265, 617, 311, 1456, 323, 1093, 2567, 709, 279, 24151, 1606, 358, 4265, 1212, 2216, 4216, 358, 4265, 387, 1093, 17339, 358, 2846, 16926, 1212, 17951, 287, 1457, 323, 1243, 1093, 499, 2351, 17951, 287, 369, 17051, 323, 1243, 358, 1120, 1093, 1212, 13490, 20930, 358, 3287, 956, 656, 430, 369, 279, 7151, 584, 1047, 6051, 311, 387, 6762, 430, 574, 264, 810, 315, 264, 1566, 9568, 15398, 719, 1093, 22371, 358, 2846, 4560, 311, 1093, 22371, 358, 14000, 420, 13985, 430, 1093, 91271, 358, 3460, 315, 1212, 279, 1938, 449, 420, 539, 1093, 264, 22743, 719, 1093, 264, 128001, 


---

turn: [1]like a super Mario level. Like it's very like high detail. And like, once you get into the park, it just like, everything looks like a computer game and they have all these, like, you know, if, if there's like a, you know, like in a Mario game, they will have like a question block. And if you like, you know, punch it, a coin will come out. So like everyone, when they come into the park, they get like this little bracelet and then you can go punching question blocks around.

prompt (119 tokens): 
128000, 58, 16, 60, 4908, 264, 2307, 24270, 2237, 13, 9086, 433, 596, 1633, 1093, 1579, 7872, 13, 1628, 1093, 11, 3131, 499, 636, 1139, 279, 6246, 11, 433, 1120, 1093, 11, 4395, 5992, 1093, 264, 6500, 1847, 323, 814, 617, 682, 1521, 11, 1093, 11, 499, 1440, 11, 422, 11, 422, 1070, 596, 1093, 264, 11, 499, 1440, 11, 1093, 304, 264, 24270, 1847, 11, 814, 690, 617, 1093, 264, 3488, 2565, 13, 1628, 422, 499, 1093, 11, 499, 1440, 11, 21004, 433, 11, 264, 16652, 690, 2586, 704, 13, 2100, 1093, 5127, 11, 994, 814, 2586, 1139, 279, 6246, 11, 814, 636, 1093, 420, 2697, 59519, 323, 1243, 499, 649, 733, 68981, 3488, 10215, 2212, 13, 128001, 


---

turn: [0]Hi, my name is Xuan Son. I am software engineer at Hugging Face.

prompt (23 tokens): 
128000, 58, 15, 60, 13347, 11, 856, 836, 374, 1630, 10602, 12103, 13, 358, 1097, 3241, 24490, 520, 473, 36368, 19109, 13, 128001,

This happens in both debug and release build.

But it's really awesome that this works on my poor 4GB GPU! Now I just need a German finetune... ;)

It's not very urgent to merge because most of the code is outside of the main library anyway, and people seem to interested more in audio input instead of audio out.

And btw, I absolutely disagree ;)

I am also wondering where the random prompts are coming from?

ngxson · 2025-05-16T19:14:54Z

The "random" prompt is the speaker reference (it's not random, but I hard-coded it), it acts as an example of how the voice of 2 people should sound like

In theory, you can swap it which whatever voice and now it become essentially a voice cloning

To generate it, however, you need to go though the python code, and atm I don't have time to document it

ngxson added 6 commits March 29, 2025 01:30

tts : implement mimi decoder

24a07ab

fix llama-tts

efeaa57

put mimi_model into a shared header

a98f199

mimi : non-transposed input codes

891273c

tts : add sesame csm

6dca237

wip

2d743b6

github-actions bot added examples python python script changes labels Mar 29, 2025

ngxson added 8 commits March 30, 2025 01:03

wip

f9162e7

add mimi_model::transpose_input

eae5f0e

fix build

43bf237

fix build (2)

e618405

fix build (3)

e185e0a

fix strcmp

ce83041

fix compilation on linux

61d8ad6

clean up

4012054

ngxson mentioned this pull request Mar 30, 2025

tts : implement mimi decoder #12636

Closed

4 tasks

ngxson added 3 commits March 30, 2025 13:52

Merge branch 'xsn/mimi_dec' into xsn/csm_tts

b97fd3e

working now

7ecce76

update readme

6976682

ngxson changed the title ~~tts : implement sesame backbone + decoder~~ tts : implement sesame CSM + Mimi decoder Mar 30, 2025

ngxson marked this pull request as ready for review March 30, 2025 12:30

ngxson mentioned this pull request Mar 30, 2025

csm : implement Sesame-based conversation example #12392

Closed

ngxson added 4 commits March 30, 2025 14:45

nits

1e9afd9

Merge branch 'master' into xsn/csm_tts

9f05741

fix mul_mat_id read out-of-bound

40ab1ab

will this fix windows build?

eaba2bf

(try) fixing problem with long text

5fe27ef

ngxson requested a review from ggerganov April 2, 2025 15:33

ngxson added 2 commits April 3, 2025 14:10

Merge branch 'master' into xsn/csm_tts

142b545

add audio EOS token

d178099

ngxson mentioned this pull request Apr 4, 2025

llama : add llama_batch_ext #11875

Open

Merge branch 'master' into xsn/csm_tts

0b55d8b

johnbenac mentioned this pull request Apr 6, 2025

(wip) support ultravox audio input #12745

Closed

ngxson added 2 commits April 9, 2025 15:35

Merge branch 'master' into xsn/csm_tts

1219827

add speaker reference

d1de6cc

ggerganov approved these changes Apr 22, 2025

View reviewed changes

ngxson added 4 commits April 23, 2025 14:13

Merge branch 'master' into xsn/csm_tts

31b5d22

fix build_attn

9533fb7

rm print

e5bb560

fix pyright

c1cd710

tts : implement sesame CSM + Mimi decoder #12648

Are you sure you want to change the base?

tts : implement sesame CSM + Mimi decoder #12648

Conversation

ngxson commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to try this?

How Sesame CSM works?

Uh oh!

arch-btw commented Mar 30, 2025

Uh oh!

ngxson commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Apr 2, 2025

Uh oh!

ggerganov commented Apr 2, 2025

Uh oh!

ngxson commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Desir-Armann commented Apr 5, 2025

Uh oh!

ngxson commented Apr 5, 2025

Uh oh!

ShaanveerS commented Apr 5, 2025

Uh oh!

ggerganov commented Apr 8, 2025

Uh oh!

ngxson commented Apr 8, 2025

Uh oh!

ngxson commented Apr 9, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ashoka74 commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Horschig commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanMarq commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

farris commented May 16, 2025

Uh oh!

ngxson commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ngxson commented Mar 29, 2025 •

edited

Loading

ngxson commented Mar 30, 2025 •

edited

Loading

ngxson commented Apr 3, 2025 •

edited

Loading

ngxson commented Apr 23, 2025 •

edited

Loading

ngxson commented Apr 23, 2025 •

edited

Loading

Ashoka74 commented Apr 23, 2025 •

edited

Loading

ngxson commented Apr 23, 2025 •

edited

Loading

Horschig commented Apr 25, 2025 •

edited

Loading

NathanMarq commented May 9, 2025 •

edited

Loading

ngxson commented May 16, 2025 •

edited

Loading