Skip to content

IQ3_S: multiplier based code book #5867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 24 commits into
base: master
Choose a base branch
from
Draft

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Mar 4, 2024

This PR demonstrates a simple, multiplication based codebook for IQ3_S as suggested by @PeterReid

It does not achieve quite the same PPL as IQ3_S on master, but inference performance is better on AVX2 and ARM_NEON. On CUDA performance gain for TG is in the range of 1-2%. On Metal (running on 30-core M2 Max GPU) TG became slower compared to lookup, so I ended up precomputing a lookup table in shared memory. For more inference speed details see table below.

The codebook is simply ((magic * i) & 0x0f0f0f0f) | 0x01010101), where i is the codebook index in [0, 512), magic is a unit32_t value, and the four bytes in the resulting uint32_t represent 4 quants.

No intention to merge this into master, but putting it out there as a demo in case someone is interested in playing with it (or perhaps coming up with a better magic).

Perplexity

PPL is for context of 4096 (LLaMA-v2 and Mistral) or 2048 (LLaMA-v1) with imatrix from woki.train.raw.

Model IQ3_S (lookup) IQ3_S (multiplier)
LLaMA-v1-7B 5.4447 5 5199
LLaMA-v2-7B 5.1340 5.2016
Mistral-7B 4.9420 4.9920
LLaMA-v1-13B 4.8041 4.8594
LLaMA-v2-13B 4.5497 4.5977

Inference speed

Measured on

  • RTX-4080 for CUDA
  • Ryzen 7950X for AVX2
  • M2 Max for ARM_NEON
  • 30-core M2 Max GPU for Metal
model backend threads test t/s (lookup) t/s (multiplier) Speedup
llama 7B IQ3_S CUDA 1 pp 512 5682.99 ± 101.57 5710.45 ± 48.15 1.005
llama 7B IQ3_S CUDA 1 tg 128 151.43 ± 2.14 154.74 ± 0.14 1.022
llama 7B IQ3_S AVX2 2 tg 128 3.84 ± 0.00 7.03 ± 0.03 1.831
llama 7B IQ3_S AVX2 4 tg 128 7.33 ± 0.01 13.19 ± 0.02 1.799
llama 7B IQ3_S AVX2 8 tg 128 13.52 ± 0.04 17.83 ± 0.02 1.319
llama 7B IQ3_S AVX2 16 tg 128 15.94 ± 0.11 16.85 ± 0.22 1.057
llama 7B IQ3_S AVX2 16 pp 512 27.61 ± 0.21 49.47 ± 0.56 1.792
llama 7B IQ3_S Metal 4 pp 512 457.59 ± 0.63 462.14 ± 0.28 1.010
llama 7B IQ3_S Metal 4 tg 128 50.30 ± 0.01 49.48 ± 0.02 0.984
llama 7B IQ3_S ARM_NEON 2 tg 128 3.01 ± 0.00 4.02 ± 0.00 1.335
llama 7B IQ3_S ARM_NEON 4 tg 128 5.73 ± 0.00 7.61 ± 0.00 1.328
llama 7B IQ3_S ARM_NEON 8 tg 128 10.99 ± 0.05 14.48 ± 0.04 1.316

@ikawrakow ikawrakow added the demo Demonstrate some concept or idea, not intended to be merged label Mar 4, 2024
@ikawrakow
Copy link
Contributor Author

ikawrakow commented Mar 4, 2024

So, if I allow for a shuffle operation as @PeterReid in his PR #5866, I can further lower PPL (see table). I didn't want to go there originally because shuffles, although nearly free in terms of performance on CPU's, come at a cost on GPU's.

Model IQ3_S (lookup) IQ3_S (multiplier) IQ3_S (mult + shuffle)
LLaMA-v1-7B 5.4447 5 5199 5.4778
LLaMA-v2-7B 5.1340 5.2016 5.1753
Mistral-7B 4.9420 4.9920 4.9581
LLaMA-v1-13B 4.8041 4.8594 4.8212
LLaMA-v2-13B 4.5497 4.5977 4.5641

The codebook is (540201 * i) & 0x0f0f0f0f, and the resulting 4 bytes that take values in 0...15 are shuffled according to {1, 1, 1, 3, 3, 3, 5, 5, 7, 7, 9, 9, 11, 11, 13, 15}. Have only implemented the dequantize kernel on CUDA for now to be able to run perplexity calculations. But this is starting to look promising as a possible IQ3_S replacement.

@sorasoras
Copy link

So, if I allow for a shuffle operation as @PeterReid in his PR #5866, I can further lower PPL (see table). I didn't want to go there originally because shuffles, although nearly free in terms of performance on CPU's, come at a cost on GPU's.
Model IQ3_S (lookup) IQ3_S (multiplier) IQ3_S (mult + shuffle)
LLaMA-v1-7B 5.4447 5 5199 5.4778
LLaMA-v2-7B 5.1340 5.2016 5.1753
Mistral-7B 5.9420 4.9920 4.9581
LLaMA-v1-13B 4.8041 4.8594 4.8212
LLaMA-v2-13B 4.5497 4.5977 4.5641

The codebook is (540201 * i) & 0x0f0f0f0f, and the resulting 4 bytes that take values in 0...15 are shuffled according to {1, 1, 1, 3, 3, 3, 5, 5, 7, 7, 9, 9, 11, 11, 13, 15}. Have only implemented the dequantize kernel on CUDA for now to be able to run perplexity calculations. But this is starting to look promising as a possible IQ3_S replacement.

What happen to Mistral-7B for lookup?
PPL seems very high

@ikawrakow
Copy link
Contributor Author

What happen to Mistral-7B for lookup? PPL seems very high

Sorry, typo. Have corrected in the tables.

@ikawrakow ikawrakow force-pushed the ik/iq3_s_multiplier branch from 1b6dce3 to 31cecc8 Compare March 5, 2024 09:06
@ikawrakow
Copy link
Contributor Author

Have updated the branch to a better version that almost matches PPL of the IQ3_S based on a lookup table:

Model IQ3_S (lookup) IQ3_S (mult + shuffle)
LLaMA-v1-7B 5.4447 5 4709
LLaMA-v2-7B 5.1340 5.1534
Mistral-7B 4.9420 4.9436
LLaMA-v1-13B 4.8041 4.8089
LLaMA-v2-13B 4.5497 4.5606

As performance on CUDA and Metal suffers from the shuffle operation, on these platforms I have reverted back to a lookup table (which is prepared based from the multiply+shuffle, so 100% equivalent). Hence, performance is exactly the same as on master. Performance on AVX2 is massively better. Performance on ARM_NEON has progressed from dog slow to slow:

model backend threads test t/s (lookup) t/s (multiplier) Speedup
llama 7B IQ3_S AVX2 2 tg 128 3.84 ± 0.00 7.61 ± 0.03 1.982
llama 7B IQ3_S AVX2 4 tg 128 7.33 ± 0.01 14.19 ± 0.02 1.936
llama 7B IQ3_S AVX2 8 tg 128 13.52 ± 0.04 17.83 ± 0.02 1.319
llama 7B IQ3_S AVX2 16 pp 512 27.61 ± 0.21 52.68 ± 0.56 1.908
llama 7B IQ3_S ARM_NEON 2 tg 128 3.01 ± 0.00 4.02 ± 0.00 1.335
llama 7B IQ3_S ARM_NEON 4 tg 128 5.73 ± 0.00 7.61 ± 0.00 1.328
llama 7B IQ3_S ARM_NEON 8 tg 128 10.99 ± 0.05 14.48 ± 0.04 1.316

The final codebook is shuffle[(518559 * i) & 0x0f0f0f0f], shuffle = {1, 1, 1, 3, 3, 3, 5, 5, 5, 7, 7, 9, 9, 11, 13, 15};

@Green-Sky
Copy link
Collaborator

Amazing work as always, but surely you ment wookiee... instead of woki.train.raw 😄

@sorasoras
Copy link

Have updated the branch to a better version that almost matches PPL of the IQ3_S based on a lookup table:
Model IQ3_S (lookup) IQ3_S (mult + shuffle)
LLaMA-v1-7B 5.4447 5 4709
LLaMA-v2-7B 5.1340 5.1534
Mistral-7B 4.9420 4.9436
LLaMA-v1-13B 4.8041 4.8089
LLaMA-v2-13B 4.5497 4.5606

As performance on CUDA and Metal suffers from the shuffle operation, on these platforms I have reverted back to a lookup table (which is prepared based from the multiply+shuffle, so 100% equivalent). Hence, performance is exactly the same as on master. Performance on AVX2 is massively better. Performance on ARM_NEON has progressed from dog slow to slow:
model backend threads test t/s (lookup) t/s (multiplier) Speedup
llama 7B IQ3_S AVX2 2 tg 128 3.84 ± 0.00 7.61 ± 0.03 1.982
llama 7B IQ3_S AVX2 4 tg 128 7.33 ± 0.01 14.19 ± 0.02 1.936
llama 7B IQ3_S AVX2 8 tg 128 13.52 ± 0.04 17.83 ± 0.02 1.319
llama 7B IQ3_S AVX2 16 pp 512 27.61 ± 0.21 52.68 ± 0.56 1.908
llama 7B IQ3_S ARM_NEON 2 tg 128 3.01 ± 0.00 4.02 ± 0.00 1.335
llama 7B IQ3_S ARM_NEON 4 tg 128 5.73 ± 0.00 7.61 ± 0.00 1.328
llama 7B IQ3_S ARM_NEON 8 tg 128 10.99 ± 0.05 14.48 ± 0.04 1.316

The final codebook is shuffle[(518559 * i) & 0x0f0f0f0f], shuffle = {1, 1, 1, 3, 3, 3, 5, 5, 5, 7, 7, 9, 9, 11, 13, 15};

Maybe assign different name for this version?
mult+shuffle is not exactly the same PPL as lookup or keep improve it i guess?

@ikawrakow
Copy link
Contributor Author

ikawrakow commented Mar 5, 2024

Maybe assign different name for this version? mult+shuffle is not exactly the same PPL as lookup or keep improve it i guess?

No intention to merge this for now. It is there as a demo with the hope that someone may find a better codebook generator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants