-
Notifications
You must be signed in to change notification settings - Fork 12.1k
vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations #11595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
llvmpipe seems to have issues with the
adding a bounds check makes it happy shared uvec2 iq2xxs_grid[256];
void init_iq_shmem(uvec3 wgsize)
{
// copy the table into shared memory and sync
[[unroll]] for (uint i = 0; i < iq2xxs_grid.length(); i += wgsize.x) {
if (i + gl_LocalInvocationIndex.x < iq2xxs_grid.length())
iq2xxs_grid[i + gl_LocalInvocationIndex.x] = iq2xxs_grid_const[i + gl_LocalInvocationIndex.x];
}
barrier();
} |
I didn't realize we were using such large workgroup sizes with these init functions for getrows. Maybe the branch condition should do something like |
That's why I love the llvmpipe test as it finds all those issues which get ignored by regular GPUs or traditional subgroup sizes. BTW have you noticed an improvement on your end with |
In that case I believe the issue also appears with actual GPU, but it is probably hidden by hardware bounds checking which is not in llvmpipe. |
0c2ff18
to
8608322
Compare
It should be OK now, llvmpipe seems happy. |
I've verified all tests are passing with the latest commit on RTX 4070. |
Tests are passing and I'm seeing good performance improvements on my end. |
I am sorry if this is offtopic, but seeing that it was due to this pull request and using this branch, but may I ask if this is normal: I did a simple benchmark, using a 3B llm model, and these were the results:
is it expected for IQ4_XS to be so slow? |
@0cc4m any concerns with merging this? |
Sorry about the delay on my side. I tested it and found that it's mostly fine, but I see a significant performance drop on iq3_s and iq4_xs for batches > 1 in the MMV shaders. For matrix matrix multiplication (batches > 8) I only see a difference with coopmat and coopmat2, not without them. Nvidia RTX 3090Coopmat2:
Coopmat1:
AMD Radeon Pro VII
Intel A770
Comparisons generated using modified compare.py by @daniandtheweb |
Here's the before/after for RTX 4070. The only clear decrease is for iq4_xs, and for that type the only change was to NUM_ROWS.
|
Yeah, I don't think batching performance is important enough to hold up this PR. Overall it looks fine to me. |
…ns (ggml-org#11595) * vulkan: implement specialized MMV kernels for IQ2 quantizations * vulkan: add MMV kernels for IQ3 quants * vulkan: Increase MMV batch size and unroll IQ LUT setup * vulkan: fix init_iq_shmem for WG sizes larger than tables * vulkan: common batch size for all I-quants
…ns (ggml-org#11595) * vulkan: implement specialized MMV kernels for IQ2 quantizations * vulkan: add MMV kernels for IQ3 quants * vulkan: Increase MMV batch size and unroll IQ LUT setup * vulkan: fix init_iq_shmem for WG sizes larger than tables * vulkan: common batch size for all I-quants
…ns (ggml-org#11595) * vulkan: implement specialized MMV kernels for IQ2 quantizations * vulkan: add MMV kernels for IQ3 quants * vulkan: Increase MMV batch size and unroll IQ LUT setup * vulkan: fix init_iq_shmem for WG sizes larger than tables * vulkan: common batch size for all I-quants
(This is a draft written on top of #11501 and #11528 )
This PR introduces MMV kernels for IQ2 and IQ3 quantizations. It also includes optimizations suggested by @jeffbolznv (unrolled
init_iq_shmem
and 2x block size inmul_mat_vec
).After this PR the performance of IQ2/IQ3 seems in line with comparable K-quants (
model size × t/s
is similar).Note that the kernels for IQ1 quants are included in #11528
Performance before all optimizations
(both Mesa compilers for AMD target are shown: ACO and LLVM)
(llama-bench output is annotated by the estimate bandwidth model size × t/s)
(Qwen IQ1 model files are from https://huggingface.co/legraphista/Qwen2.5-Coder-7B-Instruct-IMat-GGUF)
(model files from
bartowski/Mistral-Small-24B-Instruct-2501-GGUF
have wrong name "llama 13B")Performance after: