Description
Feature Description
Modify llama.cpp to support on-the-fly "Frankenmerging" of the model in memory with itself.
Motivation
Frankenmerges, including auto-Frankenmerges, are becoming increasingly popular and appear to have properties that merit further study; it's Rich Sutton's "bitter lesson" in the small: stacking more decoder blocks means a greater total amount of computation in a single inference pass and, perhaps surprisingly, under the right circumstances, that greater accessible computation outweighs the 'noise' induced by performing fairly brutal surgery on the order of decoder blocks.
Right now experimentation is taking place at the level of building new models with mergekit. This is slow. The ability to mix-and-match decoder blocks on the fly in llama.cpp would speed up iteration and experimentation, helping better understand the tradeoff between greater available net computation and decoder surgery induced noise.
Possible Implementation
Something like this:
https://github.com/semiring/IRL-llama.cpp/blob/master/llama.cpp#L4346