Open
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
This issue is to track work to support IBM's Granite 4 model architecture (GraniteMoEHybrid
in transformers
). The model uses a number of components that are not yet supported in llama.cpp
, but are being worked independently, so I'm raising this issue to triangulate the different work streams that will be needed to support the model.
Necessary Components
- Mamba2 layers
- Ongoing work by @compilade: llama : initial Mamba-2 support #9126
- Refactored KV Cache to an abstract interface: kv-cache : separate recurrent vs non-recurrent impl #12799
- Support for hybrid attention / recurrent cache
- Initial implementation for
jamba
by @compilade: llama : support Jamba hybrid Transformer-Mamba models #7531 - Initial implementation for
bamba
: Bamba architecture #10810 - Updated implementation for
bamba
that's also out-of-date: https://github.com/gabe-l-hart/llama.cpp/tree/BambaArchitectureRefactor - First cut implementation against current abstract interfaces: https://github.com/gabe-l-hart/llama.cpp/tree/HybridCache
- Initial implementation for
- Support for
GraniteMoEShared
layers: Model: Granite MoE shared #13269 - Support for
mamba2
in non-CPU backends- I'm not totally clear on the state here, so there may well be ongoing work
- CUDA support for some of the necessary features was added in Faster ssm scan #10558
- Some of the
metal
backend needs look like they're addressed already in llama : initial Mamba-2 support #9126, but for me that still doesn't work on my M3 (assertion error about non-contiguous data).
- Support for NoPE positional encoding instead of RoPE
- I haven't fully investigated what is required for this, so it may already work as-is, but putting this here as a placeholder in case further work is needed
- End-to-end
GraniteMoEHybrid
support tying all of the other pieces together
Motivation
I lead IBM's efforts to ensure that Granite models work everywhere, and llama.cpp
is a critical part of "everywhere!"
Possible Implementation
No response