-
Notifications
You must be signed in to change notification settings - Fork 12k
Fuse matrix multiplication + SiLU #5413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fuse matrix multiplication + SiLU #5413
Conversation
Some notes:
So I think that addressing this properly will require a lot of changes to ggml. For now, it may be ok to hack this in the |
The most significant gain from fusing would be in the attention. There is already a PR with the necessary framework for fusing setup: #5021
Is there any other way of fusing ops, other than the way we used for |
The alternative would be to implement fusing automatically in the backends without changing the ops. In that way, implementing fused ops would be an optional optimization for the backends. At the time |
I agree, but Steward (FSSRepo) is already working on this and I think onboarding more devs is good for the project long-term. So if possible I would prefer to let him handle it and gain experience; if the performance is suboptimal I can still take a look afterwards. Other than FlashAttention there should also be performance gains from fusing branching matrix multiplications (in LLaMA). Firstly from fusing the 3 pre-attention KQV matrix multiplications as well as the following RoPE into a single operation. Secondly from fusing the up and gate matrix multiplications as well as the SiLU and multiplication.
Are we talking about the same thing? I want to work out the best way of handling fused operations going forward. In particular when it comes to how to handle them across different backends. Unless I'm misunderstanding something in that PR there is currently a simple define to enable/disable FlashAttention which (I assume) would not work for e.g. Vulkan since there is no implementation.
You could provide more information during the build phase e.g. by extending Or, as slaren said, you could add a backend-specific optimization step prior to execution where non-fused operations are replaced with their fused variants if possible.
Are there plans to allow the combination of different backends, and if so, what is the ETA? I plan to implement fused operations for CPU and CUDA. For now we could maybe make it so that fused operations are only used over their non-fused counterparts if none of the other backends are used. Specifically, I would add |
Yes, slaren's suggestion for automatic backend-specific fusing should be the long-term goal. #5021 can be viewed as another exception due to the importance of having faster and memory-efficient attention. Currently, it is a compile-time define because it's simpler this way, but the FA op could be used based on runtime parameters depending on the backends used
I'm afraid the gains from this might not justify the increase in technical debt. Would recommend to attempt this at a later stage, likely when we have some variant of the graph optimization step mentioned earlier |
No worries, I'm willing to implement this. But since this adds a certain level of complexity I wanted to first discuss possible alternatives. |
This wouldn't work with
It is something that I would like to add after the logic for automatic offloading of large matrix multiplications is moved to |
@mofosyne was the merging of master into this branch automatic or did you do it manually? In either case, please don't do this. |
It was done manually as the conflict appears to be minor and would like to make it easier to keep PRs going. (Or as a general rule leave it to the PR authors? As they would generally have the full context?) |
I personally don't want other people to make any changes to one of my PRs without prior coordination or good reason. I use my own llama.cpp fork on Github to share code between my machines so if there are changes that I am unaware of that can potentially cause me to lose a lot of time debugging. |
I think that the simplest way to support this would be to forget about trying to make the operation fusion automatic and just add a bunch of fused ops to ggml with helper functions that can decompose the fused op into its individual ops for cases where the backend does not implement the fused op. Then backends could implement support as such: // node was created with ggml_mul_mat_bias(ctx, a, b, c) = ggml_add(ggml_mul_mat(a, b), c)
case GGML_OP_FUSED:
if (ggml_fused_n_ops(node) == 2 &&
ggml_fused_op(node, 0) == GGML_OP_MUL_MAT &&
ggml_fused_op(node, 1) == GGML_OP_ADD) {
ggml_cuda_mul_mat_bias(ctx, node);
} else {
// fused op not supported, decompose into the individual ops and compute
for (int i = 0; i < ggml_fused_n_ops(node); i++) {
ggml_tensor op = ggml_unfuse(node, i);
ggml_cuda_compute_forward(ctx, &op);
}
}
break; |
That sounds like a good approach. Do you think this same approach can be applied to the existing convolution operators (related to this ggml-org/ggml#873)? I.e. the |
When implementing fused ops, keep in mind that if the I/O scales linearly with batch size then you only really get a performance improvement for small batch sizes so that is what the focus should be. (At the same time it is the kernels for large batch sizes that contribute the most to compilation time/binary size.) For simple fused ops in the context of llama.cpp I think making them opt-in is fine. Long-term one of my goals is to enable training via ggml though so that would require a user writing training code for a model architecture to know those fused ops. For the more complicated cases like batching multiple matrix multiplications that will probably get relatively tricky (though with the stream.k MMQ implementation that has become much less important). Automatic optimization of ggml graph is not mutually exclusive with opt-in fused operations though; it would definitely be possible to implement the fused ops and only implement automatic optimization at a later point. |
The approach would work well for operations such as mul_mat + activation + bias + scale.. etc, where the additional operations can be run inplace, since that doesn't require allocating new tensors. I don't think that this would work very well in the im2col case because the unfused path requires more memory than the fused path, since im2col requires allocating an intermediate tensor to store the result. To be able to do this efficiently, the selection between fused and unfused path would need to be done before the graph allocation. |
Currently I think there is a lot of potential for performance improvements from fusing tensors together since this would reduce the amount of I/O needed. I'm interested in working on this but I want to determine the implementation details in advance in order to minimize the time spent on refactoring. I think there are two different approaches to do this: the user code could either explicitly opt in by building the graph with some fused kernel or the graph could be optimized afterwards by replacing certain tensor sequences with fused kernels. In this PR I tried a simple graph optimization implementation for fusing matrix multiplication and SiLU but that is already relatively complicated because you have issues with the graph size changing afterwards and branching (branching not considered for this PR). It's possible that I'm simply ignorant with regards to the best way to implement graph optimization but so far my impression is that an explicit opt-in would be preferable. @ggerganov @slaren your feedback would be appreciated.