Replies: 4 comments 1 reply
-
Great work! I'm not sure what these perplexity values are, but my guess is that they are for a fraction of WikiText2 (around 10 chunks perhaps?). In any case, you are running on the CPU, and there was a bug in the AVX2 implementation that was fixed in #5834, which I think you don't have. This might explain the higher PPL values you are observing on the master branch. In any case, I have done a complete PPL run for Mistral-7B for a context of 512 and imatrix from Mistral-7B PPL this PR, Final PPL = 5.9530
main: build = 2278 (7ad9511)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709531437
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = tmp
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 26
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "
Mistral-7B PPL master, Final PPL = 5.8807
main: build = 2329 (67be2ce1)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709537499
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = models
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 26
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "
I have also implemented a multiplier based codebook, see PR #5867. For Mistral-7B I get
|
Beta Was this translation helpful? Give feedback.
-
@PeterReid I was intrigued by the fact that your codebook results in a slightly better perplexity for Mistral-7B compared to #5867, so went ahead and tried it on LLaMA-v2-7B. It does not do very well there: LLaMA-v2-7B PPL this PR = 5.2466
main: build = 2278 (7ad9511)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709545886
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = llama2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 26
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "
LLaMA-v2-7B PPL master = 5.1340
main: build = 2307 (7b629c3b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709314323
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = llama2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 26
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "
LLaMA-v2-7B PPL #5867 = 5.2016
main: build = 2295 (8b713a98)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709458694
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = llama2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 26
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "
|
Beta Was this translation helpful? Give feedback.
-
My perplexity runs were on the full 600ish chunks of wiki.test, and on a GPU, so the AVX2 bug wouldn't have affected them. I wonder if the difference is me using the instruct-tuned mistral? I also requantized down from Q8 because I couldn't find a fp16 ggml, so maybe that is it. I will do some more testing. I did not realize you had already done all this work and more in this direction! I bet if you used a shuffle at the end the performance gap would close up. |
Beta Was this translation helpful? Give feedback.
-
Ah, you used an instruct tuned Mistral-7B, this explains the large PPL values. Are you using the official one from Mistral AI or some other random tuning? With the official Mistral-Instruct-7B-v0.2 I get Master, Final PPL = 6.7768, PPL after 100 chunks: 6.9156
main: build = 2282 (cb49e0f8)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709560713
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = hf
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 26
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "
This PR, PPL after 100 chunks: 7.0149
main: build = 2278 (7ad9511)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709561204
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = hf
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 26
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have been exploring ways to improve perplexity for IQ3_S quantization while speeding it up on AVX/NEON, and I think I have found one. This uses the multiply-instead-of-codebook-lookup that I was asking about in #5676 for its speed boost, and the insight from the IQ4_NL quantization for its perplexity improvement. This is not backwards compatible, unfortunately, because it is changing the code book. It is also not ready for merging (I've broken IQ3_XXS, I think, for example), and represents me mucking around rather than making something presentable.
I started by noticing that some values appeared much more often than others in the codebook. Specifically, the number of occurrences of the 8 values are: 436, 344, 327, 271, 223, 185, 112, 150. This is what you would expect to see from the distribution of weights being somewhat quadratic-ish, and is the fact that IQ4_NL uses to be better than similarly-sized methods. I decided to choose values in the codebook following the same polynomial as IQ4_NL. I fit a polynomial (0.08095843xxx + .0671659xx + 11.43774359x + 0.99047392) and ended up using the values [0, 3, 6, 9, 12, 16, 19, 23, 26, 31, 35, 40, 45, 50, 56, 62]. (I used a maximum of 62 because that's what the highest value was in the old codebook.)
For assembling these values into a codebook, I compute the values indices by computing (codebook_index * 0xd137151) & 0x0f0f0f0f, and then map those four indices in those four bytes to the values above. That magic number is the result of me trying out a few numbers until I found one that used each value an about-equal number of times, not some search computation, so it is very possible that it would be possible to find a better one that ends up with better-spaced points. It's also possible that there's a better list of values to use, but I didn't want to overfit anything to the one model I'm working with.
I have only done this for AVX so far. It does all of those operations vectorized, working on 32 weights at a time.
I've tested three versions
I have not figured out why the current version of iq3_s is performing worse than the baseline on these metrics. But in any case, the speed improvement for my version is pretty big! 158% - 210% of the original speed. Plus the perplexity is better.
So, to summarize, this breaks backwards compatibility with existing IQ3_S-quantified files, but seems like it may be worthwhile to pursue for performance and perplexity reasons. @ikawrakow ?
The code is in https://github.com/PeterReid/llama.cpp/commits/iq3_s_quant_change_cleaned/
Beta Was this translation helpful? Give feedback.
All reactions