Skip to content

vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations #11595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 28, 2025

Conversation

remyoudompheng
Copy link
Contributor

@remyoudompheng remyoudompheng commented Feb 2, 2025

(This is a draft written on top of #11501 and #11528 )

This PR introduces MMV kernels for IQ2 and IQ3 quantizations. It also includes optimizations suggested by @jeffbolznv (unrolled init_iq_shmem and 2x block size in mul_mat_vec).

After this PR the performance of IQ2/IQ3 seems in line with comparable K-quants (model size × t/s is similar).
Note that the kernels for IQ1 quants are included in #11528

Performance before all optimizations
(both Mesa compilers for AMD target are shown: ACO and LLVM)
(llama-bench output is annotated by the estimate bandwidth model size × t/s)
(Qwen IQ1 model files are from https://huggingface.co/legraphista/Qwen2.5-Coder-7B-Instruct-IMat-GGUF)
(model files from bartowski/Mistral-Small-24B-Instruct-2501-GGUF have wrong name "llama 13B")

Backend 1/2: Vulkan0
  Device description: AMD Radeon 780M (RADV GFX1103_R1)
  Device memory: 17066 MB (17066 MB free)

  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):      41.57 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):      75.72 GFLOPS

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    450.75 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    349.44 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    274.34 GFLOPS

  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   344.50 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   288.32 GFLOPS

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 345.72 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  325.93 GFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   262.45 GFLOPS

  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 358.35 GFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   310.26 GFLOPS

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  274.33 GFLOPS
  MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  265.44 GFLOPS

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        238.80 ± 4.48 |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         17.82 ± 0.38 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        233.74 ± 0.83 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         16.20 ± 0.03 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         59.33 ± 0.02 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.43 ± 0.07 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         59.93 ± 0.35 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          4.66 ± 0.02 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         55.63 ± 0.22 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.64 ± 0.10 |
| llama 13B Q2_K - Medium        |   8.88 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         56.05 ± 0.23 |
| llama 13B Q2_K - Medium        |   8.88 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          8.28 ± 0.06 | 73.5 GiB/s
| llama 13B Q3_K - Large         |  11.54 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         47.16 ± 0.02 |
| llama 13B Q3_K - Large         |  11.54 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.20 ± 0.03 | 71.5 GiB/s

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1 (LLVM 19.1.7)) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        133.73 ± 1.47 |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         12.92 ± 0.00 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        128.73 ± 2.73 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         11.15 ± 0.02 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         40.82 ± 0.06 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          3.49 ± 0.00 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         35.25 ± 0.19 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          2.00 ± 0.01 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         38.51 ± 0.02 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          4.03 ± 0.00 |
| llama 13B Q2_K - Medium        |   8.88 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         30.34 ± 0.03 |
| llama 13B Q2_K - Medium        |   8.88 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.08 ± 0.00 |
| llama 13B Q3_K - Large         |  11.54 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         27.12 ± 0.01 |
| llama 13B Q3_K - Large         |  11.54 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          4.56 ± 0.00 |

Performance after:

  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   707.53 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   639.12 GFLOPS

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 524.20 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  507.47 GFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   458.70 GFLOPS

  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 375.33 GFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   337.94 GFLOPS

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  257.80 GFLOPS

legraphista/Qwen2.5-Coder-7B-Instruct-IMat-GGUF
bartowski/Mistral-Small-24B-Instruct-2501-GGUF

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        248.47 ± 0.47 |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         34.39 ± 0.12 | 60.9 GiB/s
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        228.57 ± 6.27 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         32.25 ± 0.22 | 61.3 GiB/s
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         62.63 ± 0.06 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         tg128 |         10.06 ± 0.01 | 70.0 GiB/s
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         55.94 ± 0.29 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          8.75 ± 0.18 | 66.1 GiB/s
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         57.35 ± 0.05 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          7.61 ± 0.00 | 70.2 GiB/s

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1 (LLVM 19.1.7)) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        135.52 ± 0.62 |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         31.07 ± 0.53 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        122.89 ± 0.04 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         28.14 ± 0.07 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         40.84 ± 0.06 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.37 ± 0.01 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         35.53 ± 0.02 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          4.64 ± 0.00 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         39.29 ± 0.04 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.22 ± 0.00 |

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Feb 2, 2025
@remyoudompheng
Copy link
Contributor Author

llvmpipe seems to have issues with the init_iq_shmem of iq2_xxs

  GET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b=7,v=1): OK
  GET_ROWS(type=iq2_xs,n=256,m=5,r=4,b=1,v=0): AddressSanitizer: CHECK failed: asan_allocator.cpp:190 "((old)) == ((kAllocBegMagic))" (0x2b2b2b1908081908, 0xcc6e96b9cc6e96b9) (tid=2409713)
    #0 0x56059d6dac9b in __asan::CheckUnwind() asan_rtl.cpp.o
    #1 0x56059d6fac00 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) (llama.cpp/build/bin/test-backend-ops+0x15cc00) (BuildId: b8c3518bde2946e83d4f9b8f4732cf76ed58a79a)

adding a bounds check makes it happy

shared uvec2 iq2xxs_grid[256];

void init_iq_shmem(uvec3 wgsize)
{
    // copy the table into shared memory and sync
    [[unroll]] for (uint i = 0; i < iq2xxs_grid.length(); i += wgsize.x) {
        if (i + gl_LocalInvocationIndex.x < iq2xxs_grid.length())
        iq2xxs_grid[i + gl_LocalInvocationIndex.x] = iq2xxs_grid_const[i + gl_LocalInvocationIndex.x];
    }
    barrier();
}

@jeffbolznv
Copy link
Collaborator

llvmpipe seems to have issues with the init_iq_shmem of iq2_xxs

adding a bounds check makes it happy

I didn't realize we were using such large workgroup sizes with these init functions for getrows. Maybe the branch condition should do something like ((length % wgsize.x) != 0) && so it's optimized away in the mul mat shaders.

@netrunnereve
Copy link
Collaborator

netrunnereve commented Feb 3, 2025

llvmpipe seems to have issues with the init_iq_shmem of iq2_xxs

That's why I love the llvmpipe test as it finds all those issues which get ignored by regular GPUs or traditional subgroup sizes.

BTW have you noticed an improvement on your end with bitfieldExtract? I've tried it in the past but ended up not bothering with it as the compiler was always smart enough to use the bfe hardware instruction instead of a shift and and. At the same time I've also seen it mess up the ternary operator and insert real branches sometimes which was why I got rid of all of them in #11081. Compilers are weird.

@remyoudompheng
Copy link
Contributor Author

In that case I believe the issue also appears with actual GPU, but it is probably hidden by hardware bounds checking which is not in llvmpipe.
I don't think bitfieldExtract is necessary here but as a matter of personal taste, it feels a bit clearer than shifts and mask (avoiding too many parentheses). Here the ternary operator pattern is simple enough to compile to 2 instructions (test bit, then v_cndmask mask, -x, x) on AMD.

@remyoudompheng remyoudompheng marked this pull request as ready for review February 15, 2025 18:04
@remyoudompheng
Copy link
Contributor Author

It should be OK now, llvmpipe seems happy.

@jeffbolznv
Copy link
Collaborator

I've verified all tests are passing with the latest commit on RTX 4070.

@netrunnereve
Copy link
Collaborator

Tests are passing and I'm seeing good performance improvements on my end.

@alexjp
Copy link

alexjp commented Feb 18, 2025

I am sorry if this is offtopic, but seeing that it was due to this pull request and using this branch, but may I ask if this is normal:

I did a simple benchmark, using a 3B llm model, and these were the results:

IQ4_NL = 60.4 t/s
IQ4_XS = 36.4 t/s
IQ3_M = 57.9 t/s
IQ3_S = 57.4 t/s
IQ3_XS = 60.4 t/s
IQ2_M = 60.0 t/s

is it expected for IQ4_XS to be so slow?

@jeffbolznv
Copy link
Collaborator

@0cc4m any concerns with merging this?

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 25, 2025

Sorry about the delay on my side. I tested it and found that it's mostly fine, but I see a significant performance drop on iq3_s and iq4_xs for batches > 1 in the MMV shaders. For matrix matrix multiplication (batches > 8) I only see a difference with coopmat and coopmat2, not without them.

Nvidia RTX 3090

Coopmat2:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | matrix cores: NV_coopmat2
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 3090
  Device memory: 24576 MB (24576 MB free)


MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      449.63 GFLOPS�[0m      450.80 GFLOPS�[0m        1.17 GFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      874.27 GFLOPS�[0m      877.04 GFLOPS�[0m        2.77 GFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.36 TFLOPS�[0m        2.39 TFLOPS�[0m        0.03 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.14 TFLOPS�[0m        2.15 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.61 TFLOPS�[0m        1.63 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.55 TFLOPS�[0m        1.58 TFLOPS�[0m        0.03 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.43 TFLOPS�[0m        1.43 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.51 TFLOPS�[0m        1.52 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.36 TFLOPS�[0m        1.38 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.96 TFLOPS�[0m        1.98 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.76 TFLOPS�[0m        1.78 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.58 TFLOPS�[0m        1.59 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.25 TFLOPS�[0m        1.53 TFLOPS�[0m        0.28 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.18 TFLOPS�[0m        1.60 TFLOPS�[0m        0.42 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.85 TFLOPS�[0m        1.25 TFLOPS�[0m        0.40 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.26 TFLOPS�[0m        1.77 TFLOPS�[0m        0.51 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.96 TFLOPS�[0m        1.24 TFLOPS�[0m        0.28 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.95 TFLOPS�[0m        1.27 TFLOPS�[0m        0.32 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.00 TFLOPS�[0m        2.11 TFLOPS�[0m        0.11 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.99 TFLOPS�[0m        1.57 TFLOPS�[0m        0.58 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.20 TFLOPS�[0m        1.17 TFLOPS�[0m       -0.03 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      894.50 GFLOPS�[0m      899.29 GFLOPS�[0m        4.79 GFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.73 TFLOPS�[0m        1.73 TFLOPS�[0m        0.00 TFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.80 TFLOPS�[0m        3.82 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.96 TFLOPS�[0m        3.01 TFLOPS�[0m        0.05 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.78 TFLOPS�[0m        2.78 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.42 TFLOPS�[0m        2.41 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.39 TFLOPS�[0m        2.39 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.48 TFLOPS�[0m        2.48 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.23 TFLOPS�[0m        2.24 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.27 TFLOPS�[0m        3.30 TFLOPS�[0m        0.03 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.96 TFLOPS�[0m        2.97 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.68 TFLOPS�[0m        2.68 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2.76 TFLOPS�[0m        2.96 TFLOPS�[0m        0.20 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.99 TFLOPS�[0m        2.46 TFLOOPS�[0m       -0.03 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.02 TFLOPS�[0m        4.02 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.62 TFLOPS�[0m        3.63 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.47 TFLOPS�[0m        3.52 TFLOPS�[0m        0.05 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.51 TFLOPS�[0m        3.58 TFLOPS�[0m        0.07 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.95 TFLOPS�[0m        3.36 TFLOPS�[0m        1.41 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.42 TFLOPS�[0m        3.07 TFLOPS�[0m        0.65 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2.80 TFLOPS�[0m        3.26 TFLOPS�[0m        0.46 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.80 TFLOPS�[0m        2.49 TFLOPS�[0m        0.69 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.18 TFLOPS�[0m        1.97 TFLOPS�[0m        0.79 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.94 TFLOPS�[0m        4.65 TFLOPS�[0m        0.71 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.85 TFLOPS�[0m        2.18 TFLOPS�[0m       -0.67 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.89 TFLOPS�[0m        2.52 TFLOPS�[0m       -0.37 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.78 TFLOPS�[0m        1.78 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        3.17 TFLOPS�[0m        3.16 TFLOPS�[0m       -0.01 TFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.76 TFLOPS�[0m        4.83 TFLOPS�[0m        0.07 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.92 TFLOPS�[0m        3.92 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.03 TFLOPS�[0m        4.06 TFLOPS�[0m        0.03 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.33 TFLOPS�[0m        3.32 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.98 TFLOPS�[0m        2.99 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.07 TFLOPS�[0m        3.06 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.24 TFLOPS�[0m        2.27 TFLOPS�[0m        0.03 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.45 TFLOPS�[0m        4.42 TFLOPS�[0m       -0.03 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.16 TFLOPS�[0m        4.15 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.87 TFLOPS�[0m        3.86 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    4.02 TFLOPS�[0m        4.00 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.72 TFLOPS�[0m        3.04 TFLOPS�[0m        1.32 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.38 TFLOPS�[0m        3.75 TFLOPS�[0m        1.37 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.24 TFLOPS�[0m        3.46 TFLOPS�[0m        0.22 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.41 TFLOPS�[0m        2.35 TFLOPS�[0m        0.94 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.17 TFLOPS�[0m        1.82 TFLOPS�[0m        0.65 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.42 TFLOPS�[0m        5.42 TFLOPS�[0m        1.00 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.42 TFLOPS�[0m        2.49 TFLOPS�[0m       -0.93 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.46 TFLOPS�[0m        2.64 TFLOPS�[0m       -0.82 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        2.21 TFLOPS�[0m        2.21 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        3.70 TFLOPS�[0m        3.72 TFLOPS�[0m        0.02 TFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       5.19 TFLOPS�[0m        5.21 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.10 TFLOPS�[0m        4.09 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.61 TFLOPS�[0m        4.59 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.66 TFLOPS�[0m        3.68 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.10 TFLOPS�[0m        3.11 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.15 TFLOPS�[0m        3.17 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.26 TFLOPS�[0m        2.27 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.85 TFLOPS�[0m        4.85 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.27 TFLOPS�[0m        4.28 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.33 TFLOPS�[0m        4.34 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    4.41 TFLOPS�[0m        5.05 TFLOPS�[0m        0.64 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.75 TFLOPS�[0m        2.83 TFLOPS�[0m        1.08 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.26 TFLOPS�[0m        3.45 TFLOPS�[0m        1.19 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.63 TFLOPS�[0m        3.37 TFLOPS�[0m       -0.26 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.49 TFLOPS�[0m        3.06 TFLOPS�[0m        1.57 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.19 TFLOPS�[0m        1.77 TFLOPS�[0m        0.58 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.72 TFLOPS�[0m        5.95 TFLOPS�[0m        1.23 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.85 TFLOPS�[0m        2.85 TFLOPS�[0m       -1.00 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.91 TFLOPS�[0m        3.14 TFLOPS�[0m       -0.77 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        3.25 TFLOPS�[0m        3.22 TFLOPS�[0m       -0.03 TFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        3.98 TFLOPS�[0m        3.95 TFLOPS�[0m       -0.03 TFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       5.67 TFLOPS�[0m        5.68 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.37 TFLOPS�[0m        4.39 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       5.12 TFLOPS�[0m        5.16 TFLOPS�[0m        0.04 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.02 TFLOPS�[0m        4.02 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.38 TFLOPS�[0m        3.37 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.54 TFLOPS�[0m        2.54 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.58 TFLOPS�[0m        2.59 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       5.16 TFLOPS�[0m        5.13 TFLOPS�[0m       -0.03 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.25 TFLOPS�[0m        4.28 TFLOPS�[0m        0.03 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.17 TFLOPS�[0m        4.14 TFLOPS�[0m       -0.03 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.32 TFLOPS�[0m        0.78 TFLOPS�[0m       -2.54 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.01 TFLOPS�[0m        0.77 TFLOPS�[0m       -1.24 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.59 TFLOPS�[0m        1.00 TFLOPS�[0m       -1.59 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.99 TFLOPS�[0m        4.20 TFLOPS�[0m        0.21 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.60 TFLOPS�[0m        0.65 TFLOPS�[0m       -0.95 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.20 TFLOPS�[0m        0.50 TFLOPS�[0m       -0.70 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     5.25 TFLOPS�[0m        6.60 TFLOPS�[0m        1.35 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.23 TFLOPS�[0m        3.19 TFLOPS�[0m        0.96 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.56 TFLOPS�[0m        3.63 TFLOPS�[0m       -0.93 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     48.26 TFLOPS�[0m       48.67 TFLOPS�[0m        0.41 TFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     70.95 TFLOPS�[0m       72.02 TFLOPS�[0m        1.07 TFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    69.38 TFLOPS�[0m       70.73 TFLOPS�[0m        1.35 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    61.97 TFLOPS�[0m       64.49 TFLOPS�[0m        2.52 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    55.80 TFLOPS�[0m       56.48 TFLOPS�[0m        0.68 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    57.35 TFLOPS�[0m       58.36 TFLOPS�[0m        1.01 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    68.81 TFLOPS�[0m       70.56 TFLOPS�[0m        1.75 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    46.61 TFLOPS�[0m       47.14 TFLOPS�[0m        0.53 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    44.03 TFLOPS�[0m       44.66 TFLOPS�[0m        0.63 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    45.67 TFLOPS�[0m       46.20 TFLOPS�[0m        0.53 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    44.74 TFLOPS�[0m       45.22 TFLOPS�[0m        0.48 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    42.16 TFLOPS�[0m       43.03 TFLOPS�[0m        0.87 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 40.11 TFLOPS�[0m       39.39 TFLOPS�[0m       -0.72 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  38.35 TFLOPS�[0m       38.06 TFLOPS�[0m       -0.29 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   48.14 TFLOPS�[0m       35.47 TFLOPS�[0m      -12.67 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 49.98 TFLOPS�[0m       34.91 TFLOPS�[0m      -15.07 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   27.98 TFLOPS�[0m       27.05 TFLOPS�[0m       -0.93 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   21.74 TFLOPS�[0m       23.35 TFLOPS�[0m        1.61 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  55.30 TFLOPS�[0m       55.54 TFLOPS�[0m        0.24 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   39.43 TFLOPS�[0m       38.55 TFLOPS�[0m       -0.88 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  42.33 TFLOPS�[0m       51.67 TFLOPS�[0m        9.34 TFLOPS�[0m

Coopmat1:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | matrix cores: KHR_coopmat

[...]

MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     23.13 TFLOPS�[0m       23.08 TFLOPS�[0m       -0.05 TFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     23.17 TFLOPS�[0m       23.17 TFLOPS�[0m        0.00 TFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    45.46 TFLOPS�[0m       45.32 TFLOPS�[0m       -0.14 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    44.11 TFLOPS�[0m       43.94 TFLOPS�[0m       -0.17 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    41.47 TFLOPS�[0m       41.43 TFLOPS�[0m       -0.04 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    42.07 TFLOPS�[0m       41.49 TFLOPS�[0m       -0.58 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    39.87 TFLOPS�[0m       40.73 TFLOPS�[0m        0.86 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    41.76 TFLOPS�[0m       41.85 TFLOPS�[0m        0.09 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    33.02 TFLOPS�[0m       32.89 TFLOPS�[0m       -0.13 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    34.01 TFLOPS�[0m       33.93 TFLOPS�[0m       -0.08 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    27.52 TFLOPS�[0m       27.58 TFLOPS�[0m        0.06 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    33.82 TFLOPS�[0m       33.82 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 25.89 TFLOPS�[0m       25.97 TFLOPS�[0m        0.08 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  26.80 TFLOPS�[0m       26.18 TFLOPS�[0m       -0.62 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   26.25 TFLOPS�[0m       26.93 TFLOPS�[0m        0.68 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 24.65 TFLOPS�[0m       24.01 TFLOPS�[0m       -0.64 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   37.03 TFLOPS�[0m       37.04 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   22.91 TFLOPS�[0m       21.38 TFLOPS�[0m       -1.53 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  35.39 TFLOPS�[0m       41.14 TFLOPS�[0m        5.75 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   28.71 TFLOPS�[0m       26.64 TFLOPS�[0m       -2.07 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  27.04 TFLOPS�[0m       34.31 TFLOPS�[0m        7.27 TFLOPS�[0m
AMD Radeon Pro VII
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: AMD Radeon (TM) Pro VII (RADV VEGA20)
  Device memory: 16368 MB (16368 MB free)


MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      357.12 GFLOPS�[0m      357.03 GFLOPS�[0m       -0.09 GFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      472.81 GFLOPS�[0m      472.87 GFLOPS�[0m        0.06 GFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.22 TFLOPS�[0m        1.21 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.00 TFLOPS�[0m        1.00 TFLOPS�[0m       -0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     852.59 GFLOPS�[0m      850.13 GFLOPS�[0m       -2.46 GFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     771.12 GFLOPS�[0m      771.19 GFLOPS�[0m        0.07 GFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     778.77 GFLOPS�[0m      778.75 GFLOPS�[0m       -0.02 GFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.33 TFLOPS�[0m        1.33 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     983.79 GFLOPS�[0m      980.45 GFLOPS�[0m       -3.34 GFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.21 TFLOPS�[0m        1.21 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     910.98 GFLOPS�[0m      911.61 GFLOPS�[0m        0.63 GFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     816.04 GFLOPS�[0m      815.88 GFLOPS�[0m       -0.16 GFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    0.78 TFLOPS�[0m        1.24 TFLOPS�[0m        0.46 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     0.81 TFLOPS�[0m        1.27 TFLOPS�[0m        0.46 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.70 TFLOPS�[0m        1.22 TFLOPS�[0m        0.52 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  808.33 GFLOPS�[0m      972.56 GFLOPS�[0m      164.23 GFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.33 TFLOPS�[0m        1.30 TFLOPS�[0m       -0.03 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.24 TFLOPS�[0m        1.23 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.22 TFLOPS�[0m        1.14 TFLOPS�[0m       -0.08 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.73 TFLOPS�[0m        1.06 TFLOPS�[0m        0.33 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   500.23 GFLOPS�[0m      498.81 GFLOPS�[0m       -1.42 GFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      656.36 GFLOPS�[0m      654.93 GFLOPS�[0m       -1.43 GFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      760.70 GFLOPS�[0m      758.13 GFLOPS�[0m       -2.57 GFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.94 TFLOPS�[0m        1.93 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.28 TFLOPS�[0m        1.27 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.40 TFLOPS�[0m        1.40 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.13 TFLOPS�[0m        1.12 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.19 TFLOPS�[0m        1.19 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.87 TFLOPS�[0m        1.86 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.50 TFLOPS�[0m        1.50 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.71 TFLOPS�[0m        1.70 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.43 TFLOPS�[0m        1.43 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.41 TFLOPS�[0m        1.41 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.38 TFLOPS�[0m        1.98 TFLOPS�[0m        0.60 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.35 TFLOPS�[0m        1.96 TFLOPS�[0m        0.61 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.20 TFLOPS�[0m        1.92 TFLOPS�[0m        0.72 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.36 TFLOPS�[0m        1.80 TFLOPS�[0m        0.44 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.50 TFLOPS�[0m        1.79 TFLOPS�[0m        0.29 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.40 TFLOPS�[0m        1.60 TFLOPS�[0m        0.20 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.84 TFLOPS�[0m        1.73 TFLOPS�[0m       -0.11 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.30 TFLOPS�[0m        1.45 TFLOPS�[0m        0.15 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   899.97 GFLOPS�[0m      827.04 GFLOPS�[0m      -72.93 GFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      832.13 GFLOPS�[0m      826.57 GFLOPS�[0m       -5.56 GFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      969.55 GFLOPS�[0m      965.07 GFLOPS�[0m       -4.48 GFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.35 TFLOPS�[0m        2.33 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.52 TFLOPS�[0m        1.52 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.82 TFLOPS�[0m        1.82 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.36 TFLOPS�[0m        1.35 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.33 TFLOPS�[0m        1.33 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.18 TFLOPS�[0m        2.18 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.74 TFLOPS�[0m        1.74 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.08 TFLOPS�[0m        2.06 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.76 TFLOPS�[0m        1.75 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.86 TFLOPS�[0m        1.84 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.80 TFLOPS�[0m        2.36 TFLOPS�[0m        0.56 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.75 TFLOPS�[0m        2.16 TFLOPS�[0m        0.41 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.53 TFLOPS�[0m        2.06 TFLOPS�[0m        0.53 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.81 TFLOPS�[0m        2.17 TFLOPS�[0m        0.36 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.29 TFLOPS�[0m        1.69 TFLOPS�[0m        0.40 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.15 TFLOPS�[0m        1.43 TFLOPS�[0m        0.28 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.29 TFLOPS�[0m        2.24 TFLOPS�[0m       -0.05 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.71 TFLOPS�[0m        1.43 TFLOPS�[0m       -0.28 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.22 TFLOPS�[0m        1.20 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      945.19 GFLOPS�[0m      944.44 GFLOPS�[0m       -0.75 GFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.14 TFLOPS�[0m        1.14 TFLOPS�[0m        0.00 TFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.52 TFLOPS�[0m        2.51 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.68 TFLOPS�[0m        1.68 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.14 TFLOPS�[0m        2.14 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.52 TFLOPS�[0m        1.52 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.30 TFLOPS�[0m        1.30 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.13 TFLOPS�[0m        2.12 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.69 TFLOPS�[0m        1.68 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.11 TFLOPS�[0m        2.11 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.74 TFLOPS�[0m        1.73 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.92 TFLOPS�[0m        1.91 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2.09 TFLOPS�[0m        2.36 TFLOPS�[0m        0.27 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.17 TFLOPS�[0m        2.49 TFLOPS�[0m        0.32 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.83 TFLOPS�[0m        1.89 TFLOPS�[0m        0.06 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2.12 TFLOPS�[0m        2.18 TFLOPS�[0m        0.06 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.27 TFLOPS�[0m        1.70 TFLOPS�[0m        0.43 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.16 TFLOPS�[0m        1.49 TFLOPS�[0m        0.33 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.53 TFLOPS�[0m        2.37 TFLOPS�[0m       -0.16 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.93 TFLOPS�[0m        1.64 TFLOPS�[0m       -0.29 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.48 TFLOPS�[0m        1.25 TFLOPS�[0m       -0.23 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.04 TFLOPS�[0m        1.04 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.18 TFLOPS�[0m        1.18 TFLOPS�[0m        0.00 TFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.66 TFLOPS�[0m        2.66 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.78 TFLOPS�[0m        1.77 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.33 TFLOPS�[0m        2.32 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.62 TFLOPS�[0m        1.60 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.33 TFLOPS�[0m        1.33 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.06 TFLOPS�[0m        2.07 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.60 TFLOPS�[0m        1.59 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.07 TFLOPS�[0m        2.06 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.82 TFLOPS�[0m        1.81 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.07 TFLOPS�[0m        2.06 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2.28 TFLOPS�[0m        1.92 TFLOPS�[0m       -0.36 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.31 TFLOPS�[0m        1.97 TFLOPS�[0m       -0.34 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.85 TFLOPS�[0m        2.08 TFLOPS�[0m        0.23 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2.23 TFLOPS�[0m        1.84 TFLOPS�[0m       -0.39 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.26 TFLOPS�[0m        1.69 TFLOPS�[0m        0.43 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.19 TFLOPS�[0m        1.55 TFLOPS�[0m        0.36 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.47 TFLOPS�[0m        2.36 TFLOPS�[0m       -0.11 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.08 TFLOPS�[0m        1.61 TFLOPS�[0m       -0.47 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.72 TFLOPS�[0m        1.43 TFLOPS�[0m       -0.29 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.33 TFLOPS�[0m        1.33 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.43 TFLOPS�[0m        1.43 TFLOPS�[0m        0.00 TFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.84 TFLOPS�[0m        2.82 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.35 TFLOPS�[0m        1.36 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.75 TFLOPS�[0m        2.75 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.31 TFLOPS�[0m        1.29 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.39 TFLOPS�[0m        1.39 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.06 TFLOPS�[0m        2.06 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.97 TFLOPS�[0m        1.97 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.22 TFLOPS�[0m        2.21 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.05 TFLOPS�[0m        2.04 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.29 TFLOPS�[0m        2.28 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2.54 TFLOPS�[0m        2.21 TFLOPS�[0m       -0.33 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.38 TFLOPS�[0m        2.24 TFLOPS�[0m       -0.14 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.23 TFLOPS�[0m        1.61 TFLOPS�[0m       -0.62 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2.45 TFLOPS�[0m        2.24 TFLOPS�[0m       -0.21 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    538.47 GFLOPS�[0m      309.16 GFLOPS�[0m     -229.31 GFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    531.48 GFLOPS�[0m      295.32 GFLOPS�[0m     -236.16 GFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.56 TFLOPS�[0m        2.62 TFLOPS�[0m        0.06 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.46 TFLOPS�[0m        0.26 TFLOPS�[0m       -2.20 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.06 TFLOPS�[0m        1.75 TFLOPS�[0m       -0.31 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      4.93 TFLOPS�[0m        4.89 TFLOPS�[0m       -0.04 TFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      4.19 TFLOPS�[0m        4.19 TFLOPS�[0m        0.00 TFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.16 TFLOPS�[0m        4.15 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.17 TFLOPS�[0m        4.15 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.96 TFLOPS�[0m        3.94 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.02 TFLOPS�[0m        4.00 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.10 TFLOPS�[0m        4.10 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.92 TFLOPS�[0m        3.90 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.62 TFLOPS�[0m        3.60 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.60 TFLOPS�[0m        3.59 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.46 TFLOPS�[0m        3.43 TFLOPS�[0m       -0.03 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.60 TFLOPS�[0m        3.60 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  3.68 TFLOPS�[0m        3.66 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3.73 TFLOPS�[0m        3.73 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.04 TFLOPS�[0m        3.04 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  3.66 TFLOPS�[0m        3.66 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.87 TFLOPS�[0m        3.84 TFLOPS�[0m       -0.03 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.69 TFLOPS�[0m        3.64 TFLOPS�[0m       -0.05 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4.17 TFLOPS�[0m        4.17 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.71 TFLOPS�[0m        3.68 TFLOPS�[0m       -0.03 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3.81 TFLOPS�[0m        3.78 TFLOPS�[0m       -0.03 TFLOPS�[0m
Intel A770
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(tm) A770 Graphics (DG2)
  Device memory: 16032 MB (16032 MB free)


MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      189.57 GFLOPS�[0m      190.11 GFLOPS�[0m        0.54 GFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      296.02 GFLOPS�[0m      296.71 GFLOPS�[0m        0.69 GFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     871.57 GFLOPS�[0m      874.82 GFLOPS�[0m        3.25 GFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     737.74 GFLOPS�[0m      743.45 GFLOPS�[0m        5.71 GFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     214.73 GFLOPS�[0m      213.72 GFLOPS�[0m       -1.01 GFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     712.11 GFLOPS�[0m      714.34 GFLOPS�[0m        2.23 GFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     156.99 GFLOPS�[0m      157.66 GFLOPS�[0m        0.67 GFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     426.22 GFLOPS�[0m      430.62 GFLOPS�[0m        4.40 GFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     422.48 GFLOPS�[0m      422.82 GFLOPS�[0m        0.34 GFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     361.73 GFLOPS�[0m      362.73 GFLOPS�[0m        1.00 GFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     335.37 GFLOPS�[0m      341.13 GFLOPS�[0m        5.76 GFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     170.50 GFLOPS�[0m      170.48 GFLOPS�[0m       -0.02 GFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  417.24 GFLOPS�[0m      924.01 GFLOPS�[0m      506.77 GFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   320.53 GFLOPS�[0m      438.50 GFLOPS�[0m      117.97 GFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    131.97 GFLOPS�[0m      207.22 GFLOPS�[0m       75.25 GFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  218.50 GFLOPS�[0m      300.80 GFLOPS�[0m       82.30 GFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    663.49 GFLOPS�[0m      527.20 GFLOPS�[0m     -136.29 GFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    617.97 GFLOPS�[0m      943.99 GFLOPS�[0m      326.02 GFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   802.10 GFLOPS�[0m      798.31 GFLOPS�[0m       -3.79 GFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    312.51 GFLOPS�[0m      379.32 GFLOPS�[0m       66.81 GFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   436.02 GFLOPS�[0m      583.02 GFLOPS�[0m      147.00 GFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       21.01 GFLOPS�[0m       21.02 GFLOPS�[0m        0.01 GFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      450.53 GFLOPS�[0m      449.38 GFLOPS�[0m       -1.15 GFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.44 TFLOPS�[0m        1.44 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.27 TFLOPS�[0m        1.27 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     480.08 GFLOPS�[0m      480.09 GFLOPS�[0m        0.01 GFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.14 TFLOPS�[0m        1.13 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     255.75 GFLOPS�[0m      256.14 GFLOPS�[0m        0.39 GFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.27 TFLOPS�[0m        1.28 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.14 TFLOPS�[0m        1.16 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.34 TFLOPS�[0m        1.42 TFLOPS�[0m        0.08 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.19 TFLOPS�[0m        1.18 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     355.17 GFLOPS�[0m      355.67 GFLOPS�[0m        0.50 GFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    0.72 TFLOPS�[0m        1.38 TFLOPS�[0m        0.66 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     0.51 TFLOPS�[0m        1.01 TFLOPS�[0m        0.50 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    350.45 GFLOPS�[0m      638.57 GFLOPS�[0m      288.12 GFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  500.65 GFLOPS�[0m      606.06 GFLOPS�[0m      105.41 GFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.32 TFLOPS�[0m        1.54 TFLOPS�[0m        0.22 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.42 TFLOPS�[0m        1.48 TFLOPS�[0m        0.06 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.50 TFLOPS�[0m        1.49 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    620.99 GFLOPS�[0m      698.42 GFLOPS�[0m       77.43 GFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.31 TFLOPS�[0m        1.29 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       31.56 GFLOPS�[0m       31.59 GFLOPS�[0m        0.03 GFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.09 TFLOPS�[0m        1.09 TFLOPS�[0m        0.00 TFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.98 TFLOPS�[0m        2.00 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.72 TFLOPS�[0m        1.66 TFLOPS�[0m       -0.06 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     742.91 GFLOPS�[0m      749.39 GFLOPS�[0m        6.48 GFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.60 TFLOPS�[0m        1.62 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     540.98 GFLOPS�[0m      547.84 GFLOPS�[0m        6.86 GFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.62 TFLOPS�[0m        1.72 TFLOPS�[0m        0.10 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.32 TFLOPS�[0m        1.34 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.57 TFLOPS�[0m        1.68 TFLOPS�[0m        0.11 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.74 TFLOPS�[0m        1.74 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     472.78 GFLOPS�[0m      479.26 GFLOPS�[0m        6.48 GFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.13 TFLOPS�[0m        1.97 TFLOPS�[0m        0.84 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.06 TFLOPS�[0m        1.19 TFLOPS�[0m        0.13 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    494.73 GFLOPS�[0m      786.12 GFLOPS�[0m      291.39 GFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  571.33 GFLOPS�[0m      725.79 GFLOPS�[0m      154.46 GFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.52 TFLOPS�[0m        1.35 TFLOPS�[0m       -0.17 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.53 TFLOPS�[0m        1.21 TFLOPS�[0m       -0.32 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.09 TFLOPS�[0m        2.09 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.97 TFLOPS�[0m        1.26 TFLOPS�[0m        0.29 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.34 TFLOPS�[0m        1.72 TFLOPS�[0m        0.38 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       42.03 GFLOPS�[0m       42.02 GFLOPS�[0m       -0.01 GFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.43 TFLOPS�[0m        1.44 TFLOPS�[0m        0.01 TFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.36 TFLOPS�[0m        2.39 TFLOPS�[0m        0.03 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.01 TFLOPS�[0m        1.99 TFLOPS�[0m       -0.02 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     894.33 GFLOPS�[0m      894.49 GFLOPS�[0m        0.16 GFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.96 TFLOPS�[0m        1.95 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     455.14 GFLOPS�[0m      457.33 GFLOPS�[0m        2.19 GFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.93 TFLOPS�[0m        1.92 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.24 TFLOPS�[0m        1.23 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.93 TFLOPS�[0m        2.05 TFLOPS�[0m        0.12 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.61 TFLOPS�[0m        1.61 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     614.91 GFLOPS�[0m      620.23 GFLOPS�[0m        5.32 GFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.51 TFLOPS�[0m        2.34 TFLOPS�[0m        0.83 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.25 TFLOPS�[0m        1.40 TFLOPS�[0m        0.15 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    614.23 GFLOPS�[0m      921.90 GFLOPS�[0m      307.67 GFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  718.17 GFLOPS�[0m      982.58 GFLOPS�[0m      264.41 GFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    713.72 GFLOPS�[0m      677.85 GFLOPS�[0m      -35.87 GFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    693.99 GFLOPS�[0m      594.83 GFLOPS�[0m      -99.16 GFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.60 TFLOPS�[0m        1.61 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.30 TFLOPS�[0m        1.21 TFLOPS�[0m       -0.09 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.61 TFLOPS�[0m        1.48 TFLOPS�[0m       -0.13 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       49.83 GFLOPS�[0m       49.99 GFLOPS�[0m        0.16 GFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      478.84 GFLOPS�[0m      477.77 GFLOPS�[0m       -1.07 GFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      48.09 GFLOPS�[0m       47.43 GFLOPS�[0m       -0.66 GFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     749.31 GFLOPS�[0m      754.11 GFLOPS�[0m        4.80 GFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     886.72 GFLOPS�[0m      880.66 GFLOPS�[0m       -6.06 GFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.15 TFLOPS�[0m        1.15 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     535.12 GFLOPS�[0m      535.36 GFLOPS�[0m        0.24 GFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.59 TFLOPS�[0m        1.58 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     887.47 GFLOPS�[0m      893.60 GFLOPS�[0m        6.13 GFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.94 TFLOPS�[0m        1.86 TFLOPS�[0m       -0.08 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.09 TFLOPS�[0m        1.10 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     641.03 GFLOPS�[0m      626.24 GFLOPS�[0m      -14.79 GFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.57 TFLOPS�[0m        1.52 TFLOPS�[0m       -0.05 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.44 TFLOPS�[0m        1.37 TFLOPS�[0m       -0.07 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    672.89 GFLOPS�[0m      871.08 GFLOPS�[0m      198.19 GFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    0.74 TFLOPS�[0m        1.16 TFLOPS�[0m        0.42 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    553.13 GFLOPS�[0m      459.81 GFLOPS�[0m      -93.32 GFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    552.08 GFLOPS�[0m      422.63 GFLOPS�[0m     -129.45 GFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    46.31 GFLOPS�[0m       46.16 GFLOPS�[0m       -0.15 GFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.44 TFLOPS�[0m        0.43 TFLOPS�[0m       -1.01 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.81 TFLOPS�[0m        1.11 TFLOPS�[0m       -0.70 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       83.69 GFLOPS�[0m       83.65 GFLOPS�[0m       -0.04 GFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      870.61 GFLOPS�[0m      131.63 GFLOPS�[0m     -738.98 GFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      30.15 GFLOPS�[0m       30.25 GFLOPS�[0m        0.10 GFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      83.21 GFLOPS�[0m       76.32 GFLOPS�[0m       -6.89 GFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     149.72 GFLOPS�[0m      153.41 GFLOPS�[0m        3.69 GFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     118.52 GFLOPS�[0m      112.88 GFLOPS�[0m       -5.64 GFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     721.92 GFLOPS�[0m      724.89 GFLOPS�[0m        2.97 GFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     307.55 GFLOPS�[0m      302.99 GFLOPS�[0m       -4.56 GFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     196.38 GFLOPS�[0m      200.70 GFLOPS�[0m        4.32 GFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     589.82 GFLOPS�[0m      589.49 GFLOPS�[0m       -0.33 GFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     165.33 GFLOPS�[0m      161.88 GFLOPS�[0m       -3.45 GFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     524.20 GFLOPS�[0m      527.73 GFLOPS�[0m        3.53 GFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.84 TFLOPS�[0m        0.57 TFLOPS�[0m       -1.27 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.76 TFLOPS�[0m        0.60 TFLOPS�[0m       -1.16 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    872.29 GFLOPS�[0m      357.91 GFLOPS�[0m     -514.38 GFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.14 TFLOPS�[0m        0.55 TFLOPS�[0m       -0.59 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    196.65 GFLOPS�[0m      168.09 GFLOPS�[0m      -28.56 GFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    161.36 GFLOPS�[0m      164.88 GFLOPS�[0m        3.52 GFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    35.64 GFLOPS�[0m       35.00 GFLOPS�[0m       -0.64 GFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.89 TFLOPS�[0m        0.18 TFLOPS�[0m       -1.71 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.73 TFLOPS�[0m        0.28 TFLOPS�[0m       -1.45 TFLOPS�[0m
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.51 TFLOPS�[0m        1.52 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.32 TFLOPS�[0m        1.33 TFLOPS�[0m        0.01 TFLOPS�[0m
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.26 TFLOPS�[0m        1.26 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.17 TFLOPS�[0m        1.17 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.18 TFLOPS�[0m        1.12 TFLOPS�[0m       -0.06 TFLOPS�[0m
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.13 TFLOPS�[0m        1.14 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.15 TFLOPS�[0m        1.15 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.25 TFLOPS�[0m        1.27 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.31 TFLOPS�[0m        1.32 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.33 TFLOPS�[0m        1.26 TFLOPS�[0m       -0.07 TFLOPS�[0m
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.32 TFLOPS�[0m        1.32 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.27 TFLOPS�[0m        1.28 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  1.13 TFLOPS�[0m        1.12 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1.32 TFLOPS�[0m        1.31 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.17 TFLOPS�[0m        1.16 TFLOPS�[0m       -0.01 TFLOPS�[0m
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  1.19 TFLOPS�[0m        1.19 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.26 TFLOPS�[0m        1.28 TFLOPS�[0m        0.02 TFLOPS�[0m
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.32 TFLOPS�[0m        1.33 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1.19 TFLOPS�[0m        1.20 TFLOPS�[0m        0.01 TFLOPS�[0m
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.30 TFLOPS�[0m        1.30 TFLOPS�[0m        0.00 TFLOPS�[0m
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1.32 TFLOPS�[0m        1.32 TFLOPS�[0m        0.00 TFLOPS�[0m

Comparisons generated using modified compare.py by @daniandtheweb

@jeffbolznv
Copy link
Collaborator

Here's the before/after for RTX 4070. The only clear decrease is for iq4_xs, and for that type the only change was to NUM_ROWS.

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | matrix cores: NV_coopmat2
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 4070
  Device memory: 12012 MB (12012 MB free)


MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      243.17 GFLOPS      240.80 GFLOPS       -2.37 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      480.16 GFLOPS      477.70 GFLOPS       -2.46 GFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.67 TFLOPS        2.67 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.40 TFLOPS        2.37 TFLOPS       -0.03 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.46 TFLOPS        1.46 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.22 TFLOPS        1.22 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     895.77 GFLOPS      894.71 GFLOPS       -1.06 GFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.45 TFLOPS        2.46 TFLOPS        0.01 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.47 TFLOPS        1.47 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.11 TFLOPS        2.11 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.50 TFLOPS        1.50 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.15 TFLOPS        1.15 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.30 TFLOPS        1.97 TFLOPS        0.67 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     0.98 TFLOPS        1.44 TFLOPS        0.46 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.00 TFLOPS        1.57 TFLOPS        0.57 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.57 TFLOPS        2.23 TFLOPS        0.66 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.94 TFLOPS        1.38 TFLOPS        0.44 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.93 TFLOPS        1.18 TFLOPS        0.25 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.05 TFLOPS        2.26 TFLOPS        0.21 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.28 TFLOPS        1.24 TFLOPS       -0.04 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.25 TFLOPS        1.32 TFLOPS        0.07 TFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      484.14 GFLOPS      483.65 GFLOPS       -0.49 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      951.21 GFLOPS      950.54 GFLOPS       -0.67 GFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.81 TFLOPS        3.79 TFLOPS       -0.02 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.21 TFLOPS        3.23 TFLOPS        0.02 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.53 TFLOPS        2.54 TFLOPS        0.01 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.17 TFLOPS        2.14 TFLOPS       -0.03 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.75 TFLOPS        1.75 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.99 TFLOPS        2.89 TFLOPS       -0.10 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.42 TFLOPS        2.42 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.40 TFLOPS        3.40 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.72 TFLOPS        2.72 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.26 TFLOPS        2.26 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.61 TFLOPS        3.31 TFLOPS        1.70 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.26 TFLOPS        2.67 TFLOPS        1.41 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.72 TFLOPS        2.75 TFLOPS        1.03 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2.52 TFLOPS        3.49 TFLOPS        0.97 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.02 TFLOPS        1.97 TFLOPS        0.95 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.84 TFLOPS        1.60 TFLOPS        0.76 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.10 TFLOPS        3.66 TFLOPS        0.56 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.65 TFLOPS        2.38 TFLOPS        0.73 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.16 TFLOPS        2.08 TFLOPS       -0.08 TFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      711.03 GFLOPS      719.04 GFLOPS        8.01 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.39 TFLOPS        1.40 TFLOPS        0.01 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.55 TFLOPS        4.60 TFLOPS        0.05 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.53 TFLOPS        3.61 TFLOPS        0.08 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.23 TFLOPS        3.24 TFLOPS        0.01 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.63 TFLOPS        2.67 TFLOPS        0.04 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.53 TFLOPS        2.56 TFLOPS        0.03 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.86 TFLOPS        2.88 TFLOPS        0.02 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.57 TFLOPS        2.60 TFLOPS        0.03 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.98 TFLOPS        4.00 TFLOPS        0.02 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.98 TFLOPS        3.01 TFLOPS        0.03 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.04 TFLOPS        3.07 TFLOPS        0.03 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.78 TFLOPS        4.07 TFLOPS        2.29 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.37 TFLOPS        2.97 TFLOPS        1.60 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.05 TFLOPS        2.87 TFLOPS        0.82 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.02 TFLOPS        3.85 TFLOPS        0.83 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.57 TFLOPS        2.09 TFLOPS        0.52 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.84 TFLOPS        1.63 TFLOPS        0.79 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.82 TFLOPS        4.07 TFLOPS        0.25 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.47 TFLOPS        2.61 TFLOPS        1.14 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.83 TFLOPS        2.22 TFLOPS       -0.61 TFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      947.41 GFLOPS      947.71 GFLOPS        0.30 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.84 TFLOPS        1.84 TFLOPS        0.00 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.69 TFLOPS        4.69 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.77 TFLOPS        3.77 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.69 TFLOPS        3.68 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.00 TFLOPS        3.00 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.74 TFLOPS        2.73 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.09 TFLOPS        3.09 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.89 TFLOPS        2.90 TFLOPS        0.01 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.25 TFLOPS        4.26 TFLOPS        0.01 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.71 TFLOPS        3.74 TFLOPS        0.03 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.28 TFLOPS        4.25 TFLOPS       -0.03 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.33 TFLOPS        3.28 TFLOPS        1.95 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.45 TFLOPS        2.40 TFLOPS        0.95 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.65 TFLOPS        3.04 TFLOPS        1.39 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2.94 TFLOPS        4.34 TFLOPS        1.40 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.91 TFLOPS        1.47 TFLOPS        0.56 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.77 TFLOPS        1.29 TFLOPS        0.52 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.21 TFLOPS        4.52 TFLOPS        0.31 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.27 TFLOPS        2.76 TFLOPS        1.49 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.37 TFLOPS        2.61 TFLOPS       -0.76 TFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.19 TFLOPS        1.19 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        2.26 TFLOPS        2.26 TFLOPS        0.00 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.84 TFLOPS        4.87 TFLOPS        0.03 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.73 TFLOPS        3.73 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.81 TFLOPS        3.82 TFLOPS        0.01 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.08 TFLOPS        3.07 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.84 TFLOPS        2.83 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.13 TFLOPS        3.13 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.93 TFLOPS        2.94 TFLOPS        0.01 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.47 TFLOPS        4.46 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.89 TFLOPS        3.89 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.73 TFLOPS        4.73 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.17 TFLOPS        3.48 TFLOPS        2.31 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.30 TFLOPS        2.66 TFLOPS        1.36 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.92 TFLOPS        2.65 TFLOPS        0.73 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2.30 TFLOPS        4.61 TFLOPS        2.31 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.88 TFLOPS        2.97 TFLOPS        2.09 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.79 TFLOPS        1.20 TFLOPS        0.41 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.35 TFLOPS        3.13 TFLOPS       -1.22 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.05 TFLOPS        2.88 TFLOPS        1.83 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.64 TFLOPS        2.72 TFLOPS       -0.92 TFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.87 TFLOPS        1.87 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        3.35 TFLOPS        3.36 TFLOPS        0.01 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       5.03 TFLOPS        5.02 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.69 TFLOPS        3.68 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.36 TFLOPS        4.40 TFLOPS        0.04 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.24 TFLOPS        3.23 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.02 TFLOPS        3.03 TFLOPS        0.01 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.79 TFLOPS        2.75 TFLOPS       -0.04 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.98 TFLOPS        2.98 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.76 TFLOPS        4.70 TFLOPS       -0.06 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.21 TFLOPS        4.17 TFLOPS       -0.04 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.77 TFLOPS        1.77 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.08 TFLOPS        2.35 TFLOPS       -0.73 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.26 TFLOPS        1.70 TFLOPS        0.44 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.49 TFLOPS        2.27 TFLOPS        0.78 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.01 TFLOPS        4.20 TFLOPS        1.19 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.07 TFLOPS        1.35 TFLOPS        0.28 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.79 TFLOPS        1.08 TFLOPS        0.29 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.68 TFLOPS        6.49 TFLOPS        1.81 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.08 TFLOPS        2.61 TFLOPS        0.53 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.16 TFLOPS        2.69 TFLOPS       -1.47 TFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     32.00 TFLOPS       32.46 TFLOPS        0.46 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     54.13 TFLOPS       54.49 TFLOPS        0.36 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    52.75 TFLOPS       52.83 TFLOPS        0.08 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    45.34 TFLOPS       46.94 TFLOPS        1.60 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    44.22 TFLOPS       44.35 TFLOPS        0.13 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    44.82 TFLOPS       44.21 TFLOPS       -0.61 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    50.11 TFLOPS       48.73 TFLOPS       -1.38 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    39.04 TFLOPS       38.05 TFLOPS       -0.99 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    37.23 TFLOPS       36.55 TFLOPS       -0.68 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    38.45 TFLOPS       37.74 TFLOPS       -0.71 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    37.58 TFLOPS       37.06 TFLOPS       -0.52 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    37.66 TFLOPS       37.40 TFLOPS       -0.26 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 34.21 TFLOPS       33.82 TFLOPS       -0.39 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  36.72 TFLOPS       32.18 TFLOPS       -4.54 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   32.02 TFLOPS       33.71 TFLOPS        1.69 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 32.29 TFLOPS       35.53 TFLOPS        3.24 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   23.33 TFLOPS       21.59 TFLOPS       -1.74 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   19.40 TFLOPS       18.34 TFLOPS       -1.06 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  39.43 TFLOPS       49.24 TFLOPS        9.81 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   31.56 TFLOPS       33.83 TFLOPS        2.27 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  34.39 TFLOPS       43.91 TFLOPS        9.52 TFLOPS
  Backend Vulkan0: OK

Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 28, 2025

Yeah, I don't think batching performance is important enough to hold up this PR. Overall it looks fine to me.

@0cc4m 0cc4m merged commit 438a839 into ggml-org:master Feb 28, 2025
46 checks passed
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
…ns (ggml-org#11595)

* vulkan: implement specialized MMV kernels for IQ2 quantizations

* vulkan: add MMV kernels for IQ3 quants

* vulkan: Increase MMV batch size and unroll IQ LUT setup

* vulkan: fix init_iq_shmem for WG sizes larger than tables

* vulkan: common batch size for all I-quants
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
…ns (ggml-org#11595)

* vulkan: implement specialized MMV kernels for IQ2 quantizations

* vulkan: add MMV kernels for IQ3 quants

* vulkan: Increase MMV batch size and unroll IQ LUT setup

* vulkan: fix init_iq_shmem for WG sizes larger than tables

* vulkan: common batch size for all I-quants
mostlyuseful pushed a commit to mostlyuseful/llama.cpp that referenced this pull request May 12, 2025
…ns (ggml-org#11595)

* vulkan: implement specialized MMV kernels for IQ2 quantizations

* vulkan: add MMV kernels for IQ3 quants

* vulkan: Increase MMV batch size and unroll IQ LUT setup

* vulkan: fix init_iq_shmem for WG sizes larger than tables

* vulkan: common batch size for all I-quants
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants