Skip to content

(draft) tts: Orpheus support #12487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from
Draft

Conversation

jamorphy
Copy link

A rough draft of SNAC conversion to .gguf with convert_hf_to_gguf.py, will add support for this model incrementally, otherwise the PR may be helpful to others.

The upstream config.json (https://huggingface.co/hubertsiuzdak/snac_24khz/resolve/main/config.json) does not contain the following which I added manually:

    "n_layers": 4,
    "architectures": ["SNACDec"],
    "decoder_channel_dims": [1024, 512, 256, 128, 64]

This gets conversion working, but will need to make some tweaks to infer this information from the weights and avoid changes to config.json. Next steps are to try decoding with some sample orpheus tokens.

reference issue: #12476

Working on each part incrementally, added a rough draft for SNAC convertion to .gguf
@github-actions github-actions bot added the python python script changes label Mar 21, 2025
@ggerganov
Copy link
Member

Good job. Let us know if you have any questions. You might also find some answers by looking at the commits of the OuteTTS PR: #10784

SNAC uses the snake activation function. Added scaffolding to include
`GGML_OP_SNAKE` as a new op. Should this be a unary op?

The SNAC decoder uses noise blocks to enhance outputs, its optional,
so omitting it for now until the model is integrated e2e.

Next steps: write the `llm_graph_context` for SNAC
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 22, 2025
now, integrate the LM (seems straightforward it's llama3), rewrite/extend/add
to tts.cpp, then fix bugs and optimize.
@jamorphy
Copy link
Author

I'm still working on this PR. Orpheus is outputting tokens fine - now ironing out issues in the SNAC graph. I'm aiming to get a reviewable PR out in a few days.

@@ -1391,6 +1392,55 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N
{ LLM_TENSOR_POS_NET_ATTN_OUT, "posnet.%d.attn_output" },
},
},
{
LLM_ARCH_SNAC_DEC,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix this. do we really need to create a new tensor type for every sub block and res unit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now yes, but this will probably be reworked soon. For now follow the existing pattern.

jamorphy and others added 4 commits April 1, 2025 22:02
Run forward passes with dummy codes. Output tensor shapes (raw audio samples) seem to match expected shape given number of input frames. Attempts with Orpheus to be done soon.

The gguf used in this commit is at: https://huggingface.co/jamorphy/snac-fwd-pass-devel-gguf
@jamorphy
Copy link
Author

jamorphy commented Apr 8, 2025

Running into speed troubles during graph compute, likely due to some operations being done on the CPU. Is there a profiling tool for the compute graph or something similar? For now I'm logging in ggml_backend_sched_graph_compute_async, but was wondering if there's something better I may have missed

@ggerganov
Copy link
Member

For profiling individual ops, you can use:

# profile the GGML_OP_ADD (see the source code for the defined perf tests)
./bin/test-backend-ops -o ADD perf

Although for now I think you can just focus on correctness and leave the performance optimizations for later.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re. your question about performance, some conv_1d may not be available on all backends, so I suspect there are many copies back and forth between CPU and GPU.

My kyutai-mimi.cpp implementation does run much faster on CPU compared to GPU because of this. And btw I usually experiment things on ggml-easy first, as there are many debugging tool there, then copy the cgraph over llama.cpp once I'm happy with it. Probably this could help you do faster experiments on ggml.

cur = ggml_snake(ctx0, cur, alpha);

ggml_tensor * w = layer.decoder_blocks[1].up_weight;
ggml_tensor * s = ggml_cpy(ctx0, layer.decoder_blocks[1].up_scale,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why we need to copy the tensor here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran into many type mismatches, some ops expecting f16 and others f32. ggml_cpy is just a workaround and I suspect this may be the cause of slowness. If I remember correctly the bottleneck was ggml mul running on CPU.

Copy link
Collaborator

@ngxson ngxson May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, you can use ggml_cast. But yeah the best is to force the dtype of this tensor to F16 upon converting to GGUF

@jamorphy
Copy link
Author

jamorphy commented May 2, 2025

Been some time since I looked at this but I'll check out ggml-easy.

@logikstate
Copy link

logikstate commented May 18, 2025

has somebody already done this? https://github.com/foldl/chatllm.cpp/blob/master/models/orpheus.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants