-
Notifications
You must be signed in to change notification settings - Fork 11.9k
IQ3_S: multiplier based code book #5867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
CUDA is 153.8 t/s, so faster than lookup table (151 t/s) and Q3_K_S (145 t/s). AVX2 on Ryzen-5975WX is 13.7 t/s, so faster than lookup (12.7 t/s), but slower than Q3_K_S (15.5 t/s).
This brings the bpw to 3.5625. We come close but don't quite match lookup with 3.4375 bpw (blocks of 32)
So, if I allow for a shuffle operation as @PeterReid in his PR #5866, I can further lower PPL (see table). I didn't want to go there originally because shuffles, although nearly free in terms of performance on CPU's, come at a cost on GPU's.
The codebook is |
What happen to Mistral-7B for lookup? |
Sorry, typo. Have corrected in the tables. |
~4% faster TG that way.
~4% faster TG and ~2% faster PP that way.
1b6dce3
to
31cecc8
Compare
Have updated the branch to a better version that almost matches PPL of the
As performance on CUDA and Metal suffers from the shuffle operation, on these platforms I have reverted back to a lookup table (which is prepared based from the multiply+shuffle, so 100% equivalent). Hence, performance is exactly the same as on master. Performance on AVX2 is massively better. Performance on ARM_NEON has progressed from dog slow to slow:
The final codebook is |
Amazing work as always, but surely you ment |
Maybe assign different name for this version? |
No intention to merge this for now. It is there as a demo with the hope that someone may find a better codebook generator. |
This PR demonstrates a simple, multiplication based codebook for
IQ3_S
as suggested by @PeterReidIt does not achieve quite the same PPL as
IQ3_S
on master, but inference performance is better on AVX2 and ARM_NEON. On CUDA performance gain for TG is in the range of 1-2%. On Metal (running on 30-core M2 Max GPU) TG became slower compared to lookup, so I ended up precomputing a lookup table in shared memory. For more inference speed details see table below.The codebook is simply
((magic * i) & 0x0f0f0f0f) | 0x01010101)
, wherei
is the codebook index in[0, 512)
,magic
is aunit32_t
value, and the four bytes in the resultinguint32_t
represent 4 quants.No intention to merge this into master, but putting it out there as a demo in case someone is interested in playing with it (or perhaps coming up with a better magic).
Perplexity
PPL is for context of 4096 (LLaMA-v2 and Mistral) or 2048 (LLaMA-v1) with imatrix from
woki.train.raw
.Inference speed
Measured on