-
Notifications
You must be signed in to change notification settings - Fork 11.9k
(draft) tts: Orpheus support #12487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
(draft) tts: Orpheus support #12487
Conversation
Working on each part incrementally, added a rough draft for SNAC convertion to .gguf
Good job. Let us know if you have any questions. You might also find some answers by looking at the commits of the OuteTTS PR: #10784 |
SNAC uses the snake activation function. Added scaffolding to include `GGML_OP_SNAKE` as a new op. Should this be a unary op? The SNAC decoder uses noise blocks to enhance outputs, its optional, so omitting it for now until the model is integrated e2e. Next steps: write the `llm_graph_context` for SNAC
now, integrate the LM (seems straightforward it's llama3), rewrite/extend/add to tts.cpp, then fix bugs and optimize.
I'm still working on this PR. Orpheus is outputting tokens fine - now ironing out issues in the SNAC graph. I'm aiming to get a reviewable PR out in a few days. |
WIP orpheus tts
@@ -1391,6 +1392,55 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N | |||
{ LLM_TENSOR_POS_NET_ATTN_OUT, "posnet.%d.attn_output" }, | |||
}, | |||
}, | |||
{ | |||
LLM_ARCH_SNAC_DEC, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix this. do we really need to create a new tensor type for every sub block and res unit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now yes, but this will probably be reworked soon. For now follow the existing pattern.
Run forward passes with dummy codes. Output tensor shapes (raw audio samples) seem to match expected shape given number of input frames. Attempts with Orpheus to be done soon. The gguf used in this commit is at: https://huggingface.co/jamorphy/snac-fwd-pass-devel-gguf
A forward pass
Running into speed troubles during graph compute, likely due to some operations being done on the CPU. Is there a profiling tool for the compute graph or something similar? For now I'm logging in |
For profiling individual ops, you can use: # profile the GGML_OP_ADD (see the source code for the defined perf tests)
./bin/test-backend-ops -o ADD perf Although for now I think you can just focus on correctness and leave the performance optimizations for later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re. your question about performance, some conv_1d may not be available on all backends, so I suspect there are many copies back and forth between CPU and GPU.
My kyutai-mimi.cpp implementation does run much faster on CPU compared to GPU because of this. And btw I usually experiment things on ggml-easy first, as there are many debugging tool there, then copy the cgraph over llama.cpp once I'm happy with it. Probably this could help you do faster experiments on ggml.
cur = ggml_snake(ctx0, cur, alpha); | ||
|
||
ggml_tensor * w = layer.decoder_blocks[1].up_weight; | ||
ggml_tensor * s = ggml_cpy(ctx0, layer.decoder_blocks[1].up_scale, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why we need to copy the tensor here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran into many type mismatches, some ops expecting f16 and others f32. ggml_cpy
is just a workaround and I suspect this may be the cause of slowness. If I remember correctly the bottleneck was ggml mul running on CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, you can use ggml_cast
. But yeah the best is to force the dtype of this tensor to F16 upon converting to GGUF
Been some time since I looked at this but I'll check out ggml-easy. |
has somebody already done this? https://github.com/foldl/chatllm.cpp/blob/master/models/orpheus.cpp |
A rough draft of SNAC conversion to .gguf with
convert_hf_to_gguf.py
, will add support for this model incrementally, otherwise the PR may be helpful to others.The upstream
config.json
(https://huggingface.co/hubertsiuzdak/snac_24khz/resolve/main/config.json) does not contain the following which I added manually:This gets conversion working, but will need to make some tweaks to infer this information from the weights and avoid changes to
config.json
. Next steps are to try decoding with some sample orpheus tokens.reference issue: #12476