Skip to content

Hybrid recurrent cache #13904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

gabe-l-hart
Copy link
Contributor

@gabe-l-hart gabe-l-hart commented May 29, 2025

Dependencies

Description

This PR replaces my initial (more generic) attempt in #13276. Following Georgi's comment, this version is explicit about owning two child caches, one unified and one recurrent. This greatly simplifies the logic and avoids the need for additional interface changes to support managing child caches in the abstract.

This PR has been verified to be minimally functional in #13550 (Granite 4 / Bamba).

@ggerganov ggerganov force-pushed the gg/kv-cache-simplify-part3 branch 2 times, most recently from 256f1b7 to 9d05381 Compare May 30, 2025 08:22
@ggerganov ggerganov force-pushed the gg/kv-cache-simplify-part3 branch from 9d05381 to 2b984f4 Compare May 30, 2025 08:29
@ggerganov
Copy link
Member

Sorry about the massive changes, hopefully we are almost converging to the final version of the implementation. See #13746 (comment) for the latest short-term plan.

@gabe-l-hart
Copy link
Contributor Author

No problem, and thanks for the quick work getting these pieces overhauled! I'll plan to keep this branch up to date as the changes roll in so that it will hopefully be ready for review quickly once its turn comes.

@gabe-l-hart gabe-l-hart mentioned this pull request May 30, 2025
2 tasks
Also, split llama_model_is_recurrent into llm_arch_is_recurrent in
llama-arch with llama_model_is_recurrent delegating to
llm_arch_is_recurrent. The same split is done for hybird. This is needed
because there are places where the llama_model has not yet been initialized
but we need to check if the model is recurrent (specifically for the
per-layer recurrent check array in hparams).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…s in hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…l is recurrent

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
The implementation of the hybrid cache intentionally does not specify the
types of the child caches, so there was a naming mismatch with these
predicate functions that used "hybrid" to imply "hybrid recurrent."

Branch: HybridCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This will be needed by other cache types as well, so centralizing the
definition will make it more reusable.

Branch: HybridCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This follows the pattern in iswa where the two child caches are held
explicitly to support the case where a model requires a single attention
cache and a single recurrent cache where each layer uses exactly one of the
caches.

This is a rewrite of the more generic approach in the original hybrid cache
PR: ggml-org#13276

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This includes a refactor of the create_memory logic to avoid needing to use
the arch enum explicitly unless a model needs explicit cache instantiation
logic beyond the standard logic for recurrent, hybrid, unified, and iswa.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart gabe-l-hart force-pushed the HybridRecurrentCache branch from 089c968 to 4a2709f Compare May 30, 2025 15:38
@ggerganov ggerganov force-pushed the gg/kv-cache-simplify-part3 branch from f23e4cc to 71619f2 Compare May 31, 2025 07:05
@ggerganov ggerganov deleted the branch ggml-org:gg/kv-cache-simplify-part3 May 31, 2025 07:24
@ggerganov ggerganov closed this May 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants