IQ3_S: multiplier based code book #5867

ikawrakow · 2024-03-04T07:19:03Z

This PR demonstrates a simple, multiplication based codebook for IQ3_S as suggested by @PeterReid

It does not achieve quite the same PPL as IQ3_S on master, but inference performance is better on AVX2 and ARM_NEON. On CUDA performance gain for TG is in the range of 1-2%. On Metal (running on 30-core M2 Max GPU) TG became slower compared to lookup, so I ended up precomputing a lookup table in shared memory. For more inference speed details see table below.

The codebook is simply ((magic * i) & 0x0f0f0f0f) | 0x01010101), where i is the codebook index in [0, 512), magic is a unit32_t value, and the four bytes in the resulting uint32_t represent 4 quants.

No intention to merge this into master, but putting it out there as a demo in case someone is interested in playing with it (or perhaps coming up with a better magic).

Perplexity

PPL is for context of 4096 (LLaMA-v2 and Mistral) or 2048 (LLaMA-v1) with imatrix from woki.train.raw.

Model	IQ3_S (lookup)	IQ3_S (multiplier)
LLaMA-v1-7B	5.4447	5 5199
LLaMA-v2-7B	5.1340	5.2016
Mistral-7B	4.9420	4.9920
LLaMA-v1-13B	4.8041	4.8594
LLaMA-v2-13B	4.5497	4.5977

Inference speed

Measured on

RTX-4080 for CUDA
Ryzen 7950X for AVX2
M2 Max for ARM_NEON
30-core M2 Max GPU for Metal

model	backend	threads	test	t/s (lookup)	t/s (multiplier)	Speedup
llama 7B IQ3_S	CUDA	1	pp 512	5682.99 ± 101.57	5710.45 ± 48.15	1.005
llama 7B IQ3_S	CUDA	1	tg 128	151.43 ± 2.14	154.74 ± 0.14	1.022
llama 7B IQ3_S	AVX2	2	tg 128	3.84 ± 0.00	7.03 ± 0.03	1.831
llama 7B IQ3_S	AVX2	4	tg 128	7.33 ± 0.01	13.19 ± 0.02	1.799
llama 7B IQ3_S	AVX2	8	tg 128	13.52 ± 0.04	17.83 ± 0.02	1.319
llama 7B IQ3_S	AVX2	16	tg 128	15.94 ± 0.11	16.85 ± 0.22	1.057
llama 7B IQ3_S	AVX2	16	pp 512	27.61 ± 0.21	49.47 ± 0.56	1.792
llama 7B IQ3_S	Metal	4	pp 512	457.59 ± 0.63	462.14 ± 0.28	1.010
llama 7B IQ3_S	Metal	4	tg 128	50.30 ± 0.01	49.48 ± 0.02	0.984
llama 7B IQ3_S	ARM_NEON	2	tg 128	3.01 ± 0.00	4.02 ± 0.00	1.335
llama 7B IQ3_S	ARM_NEON	4	tg 128	5.73 ± 0.00	7.61 ± 0.00	1.328
llama 7B IQ3_S	ARM_NEON	8	tg 128	10.99 ± 0.05	14.48 ± 0.04	1.316

CUDA is 153.8 t/s, so faster than lookup table (151 t/s) and Q3_K_S (145 t/s). AVX2 on Ryzen-5975WX is 13.7 t/s, so faster than lookup (12.7 t/s), but slower than Q3_K_S (15.5 t/s).

This brings the bpw to 3.5625. We come close but don't quite match lookup with 3.4375 bpw (blocks of 32)

ikawrakow · 2024-03-04T16:30:19Z

So, if I allow for a shuffle operation as @PeterReid in his PR #5866, I can further lower PPL (see table). I didn't want to go there originally because shuffles, although nearly free in terms of performance on CPU's, come at a cost on GPU's.

Model	IQ3_S (lookup)	IQ3_S (multiplier)	IQ3_S (mult + shuffle)
LLaMA-v1-7B	5.4447	5 5199	5.4778
LLaMA-v2-7B	5.1340	5.2016	5.1753
Mistral-7B	4.9420	4.9920	4.9581
LLaMA-v1-13B	4.8041	4.8594	4.8212
LLaMA-v2-13B	4.5497	4.5977	4.5641

The codebook is (540201 * i) & 0x0f0f0f0f, and the resulting 4 bytes that take values in 0...15 are shuffled according to {1, 1, 1, 3, 3, 3, 5, 5, 7, 7, 9, 9, 11, 11, 13, 15}. Have only implemented the dequantize kernel on CUDA for now to be able to run perplexity calculations. But this is starting to look promising as a possible IQ3_S replacement.

sorasoras · 2024-03-04T17:43:18Z

So, if I allow for a shuffle operation as @PeterReid in his PR #5866, I can further lower PPL (see table). I didn't want to go there originally because shuffles, although nearly free in terms of performance on CPU's, come at a cost on GPU's.
Model IQ3_S (lookup) IQ3_S (multiplier) IQ3_S (mult + shuffle)
LLaMA-v1-7B 5.4447 5 5199 5.4778
LLaMA-v2-7B 5.1340 5.2016 5.1753
Mistral-7B 5.9420 4.9920 4.9581
LLaMA-v1-13B 4.8041 4.8594 4.8212
LLaMA-v2-13B 4.5497 4.5977 4.5641

The codebook is (540201 * i) & 0x0f0f0f0f, and the resulting 4 bytes that take values in 0...15 are shuffled according to {1, 1, 1, 3, 3, 3, 5, 5, 7, 7, 9, 9, 11, 11, 13, 15}. Have only implemented the dequantize kernel on CUDA for now to be able to run perplexity calculations. But this is starting to look promising as a possible IQ3_S replacement.

What happen to Mistral-7B for lookup?
PPL seems very high

ikawrakow · 2024-03-04T17:57:18Z

What happen to Mistral-7B for lookup? PPL seems very high

Sorry, typo. Have corrected in the tables.

ggml-metal.metal

~4% faster TG that way.

~4% faster TG and ~2% faster PP that way.

ikawrakow · 2024-03-05T09:26:50Z

Have updated the branch to a better version that almost matches PPL of the IQ3_S based on a lookup table:

Model	IQ3_S (lookup)	IQ3_S (mult + shuffle)
LLaMA-v1-7B	5.4447	5 4709
LLaMA-v2-7B	5.1340	5.1534
Mistral-7B	4.9420	4.9436
LLaMA-v1-13B	4.8041	4.8089
LLaMA-v2-13B	4.5497	4.5606

As performance on CUDA and Metal suffers from the shuffle operation, on these platforms I have reverted back to a lookup table (which is prepared based from the multiply+shuffle, so 100% equivalent). Hence, performance is exactly the same as on master. Performance on AVX2 is massively better. Performance on ARM_NEON has progressed from dog slow to slow:

model	backend	threads	test	t/s (lookup)	t/s (multiplier)	Speedup
llama 7B IQ3_S	AVX2	2	tg 128	3.84 ± 0.00	7.61 ± 0.03	1.982
llama 7B IQ3_S	AVX2	4	tg 128	7.33 ± 0.01	14.19 ± 0.02	1.936
llama 7B IQ3_S	AVX2	8	tg 128	13.52 ± 0.04	17.83 ± 0.02	1.319
llama 7B IQ3_S	AVX2	16	pp 512	27.61 ± 0.21	52.68 ± 0.56	1.908
llama 7B IQ3_S	ARM_NEON	2	tg 128	3.01 ± 0.00	4.02 ± 0.00	1.335
llama 7B IQ3_S	ARM_NEON	4	tg 128	5.73 ± 0.00	7.61 ± 0.00	1.328
llama 7B IQ3_S	ARM_NEON	8	tg 128	10.99 ± 0.05	14.48 ± 0.04	1.316

The final codebook is shuffle[(518559 * i) & 0x0f0f0f0f], shuffle = {1, 1, 1, 3, 3, 3, 5, 5, 5, 7, 7, 9, 9, 11, 13, 15};

Green-Sky · 2024-03-05T10:03:23Z

Amazing work as always, but surely you ment wookiee... instead of woki.train.raw 😄

sorasoras · 2024-03-05T11:34:49Z

Have updated the branch to a better version that almost matches PPL of the IQ3_S based on a lookup table:
Model IQ3_S (lookup) IQ3_S (mult + shuffle)
LLaMA-v1-7B 5.4447 5 4709
LLaMA-v2-7B 5.1340 5.1534
Mistral-7B 4.9420 4.9436
LLaMA-v1-13B 4.8041 4.8089
LLaMA-v2-13B 4.5497 4.5606

As performance on CUDA and Metal suffers from the shuffle operation, on these platforms I have reverted back to a lookup table (which is prepared based from the multiply+shuffle, so 100% equivalent). Hence, performance is exactly the same as on master. Performance on AVX2 is massively better. Performance on ARM_NEON has progressed from dog slow to slow:
model backend threads test t/s (lookup) t/s (multiplier) Speedup
llama 7B IQ3_S AVX2 2 tg 128 3.84 ± 0.00 7.61 ± 0.03 1.982
llama 7B IQ3_S AVX2 4 tg 128 7.33 ± 0.01 14.19 ± 0.02 1.936
llama 7B IQ3_S AVX2 8 tg 128 13.52 ± 0.04 17.83 ± 0.02 1.319
llama 7B IQ3_S AVX2 16 pp 512 27.61 ± 0.21 52.68 ± 0.56 1.908
llama 7B IQ3_S ARM_NEON 2 tg 128 3.01 ± 0.00 4.02 ± 0.00 1.335
llama 7B IQ3_S ARM_NEON 4 tg 128 5.73 ± 0.00 7.61 ± 0.00 1.328
llama 7B IQ3_S ARM_NEON 8 tg 128 10.99 ± 0.05 14.48 ± 0.04 1.316

The final codebook is shuffle[(518559 * i) & 0x0f0f0f0f], shuffle = {1, 1, 1, 3, 3, 3, 5, 5, 5, 7, 7, 9, 9, 11, 13, 15};

Maybe assign different name for this version?
mult+shuffle is not exactly the same PPL as lookup or keep improve it i guess?

ikawrakow · 2024-03-05T11:40:59Z

Maybe assign different name for this version? mult+shuffle is not exactly the same PPL as lookup or keep improve it i guess?

No intention to merge this for now. It is there as a demo with the hope that someone may find a better codebook generator.

Kawrakow added 18 commits March 1, 2024 11:52

Trying IQ3_S without a lookup table

9c752ff

iq3_s(multiplier): use SIMD also in dequantize

1cc7cb2

WIP

4c21c82

iq3_s_multiplier: CUDA and AVX2 works

160acec

CUDA is 153.8 t/s, so faster than lookup table (151 t/s) and Q3_K_S (145 t/s). AVX2 on Ryzen-5975WX is 13.7 t/s, so faster than lookup (12.7 t/s), but slower than Q3_K_S (15.5 t/s).

WIP

e43e81a

WIP

0fe9cd4

iq3_s_mult: ARM_NEON works - 13 t/s

bf90920

iq3_s_mult: Metal works - slower than lookup

3000e0a

iq3_s_mult: quantization tuning

fe3c20b

iq3_s_mult: alternative multiplier / bit twidling

726aed3

iq3_s_mult: ifdef'd slow / fast versions

b6402fa

iq3s_mult: ARM and Metal

5b9c878

iq3s_mult: quantization tuning

8b713a9

iq3_s_mult: another alternative multiplier

dbe98df

iq3_s_mult: play with blocks of 16

f4cb4ea

This brings the bpw to 3.5625. We come close but don't quite match lookup with 3.4375 bpw (blocks of 32)

iq3_s_mult: back to blocks of 32

e5e7256

iq3_s_mult: also CUDA

f2c2bd6

iq3_s_mult: scalar dot product

b48bf8b

ikawrakow added the demo Demonstrate some concept or idea, not intended to be merged label Mar 4, 2024

iq3_s_mult_shuffle: mult + shuffle based codebook

b587482

ggerganov reviewed Mar 4, 2024

View reviewed changes

ggml-metal.metal Outdated Show resolved Hide resolved

Kawrakow added 5 commits March 4, 2024 20:10

iq3_s_mult_shuffle: works on ARM_NEON and Metal

a6a263b

iq3_s_mult: remove SLOW_MULT option

b1d753b

iq3_s_mult_shuffle: use new multiplier and cleanup

6d15da1

iq3_s_mult_shuffle: use lookup table on CUDA

93034df

~4% faster TG that way.

iq3_s_mult_shuffle: use lookup table on Metal

31cecc8

~4% faster TG and ~2% faster PP that way.

ikawrakow force-pushed the ik/iq3_s_multiplier branch from 1b6dce3 to 31cecc8 Compare March 5, 2024 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IQ3_S: multiplier based code book #5867

IQ3_S: multiplier based code book #5867

ikawrakow commented Mar 4, 2024 •

edited

Loading

Uh oh!

ikawrakow commented Mar 4, 2024 •

edited

Loading

Uh oh!

sorasoras commented Mar 4, 2024

Uh oh!

ikawrakow commented Mar 4, 2024

Uh oh!

Uh oh!

ikawrakow commented Mar 5, 2024

Uh oh!

Green-Sky commented Mar 5, 2024

Uh oh!

sorasoras commented Mar 5, 2024

Uh oh!

ikawrakow commented Mar 5, 2024 •

edited

Loading

Uh oh!

Uh oh!

IQ3_S: multiplier based code book #5867

Are you sure you want to change the base?

IQ3_S: multiplier based code book #5867

Conversation

ikawrakow commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Perplexity

Inference speed

Uh oh!

ikawrakow commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sorasoras commented Mar 4, 2024

Uh oh!

ikawrakow commented Mar 4, 2024

Uh oh!

Uh oh!

ikawrakow commented Mar 5, 2024

Uh oh!

Green-Sky commented Mar 5, 2024

Uh oh!

sorasoras commented Mar 5, 2024

Uh oh!

ikawrakow commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ikawrakow commented Mar 4, 2024 •

edited

Loading

ikawrakow commented Mar 4, 2024 •

edited

Loading

ikawrakow commented Mar 5, 2024 •

edited

Loading