Hybrid recurrent cache #13904

gabe-l-hart · 2025-05-29T22:16:45Z

Dependencies

kv-cache : refactor + add llama_memory_state_i #13746

Description

This PR replaces my initial (more generic) attempt in #13276. Following Georgi's comment, this version is explicit about owning two child caches, one unified and one recurrent. This greatly simplifies the logic and avoids the need for additional interface changes to support managing child caches in the abstract.

This PR has been verified to be minimally functional in #13550 (Granite 4 / Bamba).

ggml-ci

ggerganov · 2025-05-30T10:33:52Z

Sorry about the massive changes, hopefully we are almost converging to the final version of the implementation. See #13746 (comment) for the latest short-term plan.

gabe-l-hart · 2025-05-30T13:50:42Z

No problem, and thanks for the quick work getting these pieces overhauled! I'll plan to keep this branch up to date as the changes roll in so that it will hopefully be ready for review quickly once its turn comes.

Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: GraniteFour

…s in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…l is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This will be needed by other cache types as well, so centralizing the definition will make it more reusable. Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: ggml-org#13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This was referenced May 29, 2025

feat: Hybrid unified/recurrent cache #13276

Closed

kv-cache : refactor + add llama_memory_state_i #13746

Merged

ggerganov added 9 commits May 30, 2025 11:08

kv-cache : simplify the "struct llama_kv_cache" interface

582792b

ggml-ci

kv-cache : revert the (n_swa + n_ubatch) change (for next PR)

99653c3

ggml-ci

kv-cache : some comments

052f3f3

ggml-ci

context : fix graph reserve for multiple sequences

5693eb6

ggml-ci

kv-cache : fix typo [no ci]

cb2175f

kv-cache : fix find_slot() logic for free slots

3c6b330

ggml-ci

llama : add TODO for deprecating the defrag API in the future

f98b8d0

kv-cache : improve find_slot() using min/max seq pos info

7e6d403

ggml-ci

llama : handle aborts and compute errors

47e570c

ggml-ci

ggerganov force-pushed the gg/kv-cache-simplify-part3 branch 2 times, most recently from 256f1b7 to 9d05381 Compare May 30, 2025 08:22

memory : extract state into llama_memory_state

2b984f4

ggml-ci

ggerganov force-pushed the gg/kv-cache-simplify-part3 branch from 9d05381 to 2b984f4 Compare May 30, 2025 08:29

kv-cache : add comments

f23e4cc

ggml-ci

gabe-l-hart mentioned this pull request May 30, 2025

Granite Four #13550

Draft

2 tasks

gabe-l-hart added 11 commits May 30, 2025 09:18

feat: Add c++ side constants for attention layer indices hparam

dbad513

Branch: GraniteFour

feat: Add support for distinguishing recurrent vs non-recurrent layer…

453d253

…s in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Auto-fill hparams.recurrent_layer_arr based on whether the mode…

26e51f4

…l is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Move layer_filter_cb up to llama_kv_cache

162639c

This will be needed by other cache types as well, so centralizing the definition will make it more reusable. Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Add layer filter to recurrent cache

a886cc1

Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Fix indexing into k_l for recurrent cache with filter

5c149d2

Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Use per-layer sizing everywhere in kv caches

4470221

Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart added 2 commits May 30, 2025 09:35

fix: Fix wrong bool condition for split equal in hybrid cache

b58351e

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Support hybrid recurrent cache in llm_graph_context

4a2709f

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart force-pushed the HybridRecurrentCache branch from 089c968 to 4a2709f Compare May 30, 2025 15:38

ggerganov force-pushed the gg/kv-cache-simplify-part3 branch from f23e4cc to 71619f2 Compare May 31, 2025 07:05

ggerganov deleted the branch ggml-org:gg/kv-cache-simplify-part3 May 31, 2025 07:24

ggerganov closed this May 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hybrid recurrent cache #13904

Hybrid recurrent cache #13904

Uh oh!

gabe-l-hart commented May 29, 2025 •

edited

Loading

Uh oh!

ggerganov commented May 30, 2025

Uh oh!

gabe-l-hart commented May 30, 2025

Uh oh!

Uh oh!

Hybrid recurrent cache #13904

Hybrid recurrent cache #13904

Uh oh!

Conversation

gabe-l-hart commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependencies

Description

Uh oh!

ggerganov commented May 30, 2025

Uh oh!

gabe-l-hart commented May 30, 2025

Uh oh!

Uh oh!

gabe-l-hart commented May 29, 2025 •

edited

Loading