Open
Description
Now that we support a large amount of architectures, we can clearly see the patterns when constructing the compute graphs - i.e. optional biases, different norm types, QKV vs Q+K+V, etc.
We should deduplicate the copy-paste portions in functions such as llm.build_llama()
, llm.build_falcon()
, etc.
The advantage of the current code is that it is easy to look into the graph of a specific architecture. When we refactor this, we will lose this convenience to some extend. So we should think about making this refactoring in such a way that we don't completely obscure which parts of the graph belong to which architectures
Open for ideas and suggestions how to do this best