imatrix : use GGUF to store importance matrices #9400

compilade · 2024-09-10T02:14:44Z

Follow-up from ikawrakow/ik_llama.cpp#15 (reply in thread).

Using GGUF as the format for imatrix files will be useful for further experiments (e.g. with L²QER) and compatibility with existing or future GGUF tooling (e.g. GGUF previews on HuggingFace, graphical GGUF viewer(s) #6715, some kind of gguf-diff, etc.).

There are multiple problems with imatrix which this is addressing:

Ad-hoc format which isn't really readable by other projects (and which has no way to backward-compatibly be extended except by adding more stuff at the end)
Non-deterministic tensor order depending on unordered_map iteration order (makes sha256sum useless to compare imatrix files made on the same dataset)
Broken behavior at small -ub (intermediate saves happen waaay too often)
Can't use bigger batch size than chunk size

Summary of changes

Use GGUF to store imatrix data.
- general.type is imatrix
- no general.architecture
  - can't really know the architecture from old imatrix files.
- store *.in_sum2 and *.counts for each tensors with imatrix data.
  - *.in_sum2 are the per-channel sums of squared activations
    - Stored in F32, like before.
  - *.counts are the number of activations (also the number of tokens), useful to calculate the mean squared activations (which is used by llama-quantize)
    - Why not simply store the mean? To allow merging imatrix files together with --in-file.
    - It's stored in F32 even though it's integer values, because when calculating the mean it would be converted to F32 anyway to perform the division.
~~Add convert_legacy_imatrix_to_gguf.py to convert old imatrix.dat files to imatrix.gguf~~
- Conversion is either not necessary (since llama-quantize can still read the old format (with a warning)) or can be converted with llama-imatrix directly (when the output file has the .gguf suffix).
Like llama-perplexity since perplexity : support using multiple sequences to allow larger batch sizes #5946, allow computing multiple chunks per batch with llama-imatrix
- This should be useful for huge models like Llama-405B when they don't fit completely in RAM.
Use fused-multiply-add (with std::fma) when accumulating the sums of activations
- Shouldn't hurt to somewhat reduce rounding errors
  - (obviously f64 would be even better, but I'm not use it's worth it yet. For the curious, using double for the intermediate accumulations can be tried by changing only one line in IMatrixStats: vector<float> values to vector<double> values.)
Sort the tensor names before serializing
- This makes the tensor order deterministic, because otherwise it depended on the iteration order of unordered_map.
  - Determinism between runs means sha256sum can be meaningfully used to compare imatrix files generated in very similar conditions.

TODO

Compare old llama-quantize with old imatrix.dat with new llama-quantize using converted imatrix.gguf
- Seemed to work, but need to re-test. The resulting quantized model(s) should have the same sha256sum.
Test new llama-imatrix at different batch sizes
- Same checksums with -ub 64 -b 512 and -ub 512 -b 2048 for a chunk size of 512 (-c 512)
Perplexity test(s) with i-quants with old llama-imatrix vs new llama-imatrix
Test with MoE models (perplexity with i-quants should be in the same ballpark as before)
Test --in-file with llama-imatrix
(maybe) Implement cleaner general.architecture exclusion.
- Currently, this uses a subclass to make self.add_architecture() a no-op, but maybe general.architecture should simply be excluded when self.arch == "". Not sure how to prevent using the other self.add_* (in GGUFWriter) which expect self.arch to be something.
- Or maybe the architecture should be included?
  - What about conversions from older imatrix.dat files?

I have read the contributing guidelines
Self-reported review complexity:
- Medium

* perplexity : simplify filling the batch

examples/imatrix/imatrix.cpp

Sums and counts tensors no longer need to be consecutive. * imatrix : more sanity checks when loading multiple imatrix files * imatrix : use ggml_format_name instead of std::string concatenation Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

compilade · 2024-09-13T03:16:15Z

I'm setting this to "draft", because of concerns by @ikawrakow in ikawrakow/ik_llama.cpp#15 (comment) and ikawrakow/ik_llama.cpp#15 (comment) (mostly related to the fact that GGUF is harder to parse than imatrix.dat files).

More details near the end of ikawrakow/ik_llama.cpp#15 (reply in thread).

I'll need some days to think about how to go further with this.

ggerganov · 2025-04-08T07:59:37Z

@compilade This is a good change and I think it would be useful to bring it to a completion.

In the future, we can extend libllama with an interface for saving/loading imatrix data. This way the implementation for reading and writing the imatrix data would be localized in libllama and can be kept in-sync more easily. This can be combined with the refactoring of llama_model_quantize_params to not pass C++ objects.

compilade added 8 commits August 20, 2024 15:17

imatrix : allow processing multiple chunks per batch

bce5464

* perplexity : simplify filling the batch

imatrix : fix segfault when using a single chunk per batch

347247a

imatrix : use GGUF to store imatrix data

3de9300

imatrix : fix conversion problems

c8ab6a3

Merge branch 'master' into compilade/imatrix-batched-chunks

3ad0603

imatrix : use FMA and sort tensor names

d19101c

py : add requirements for legacy imatrix convert script

503630e

perplexity : revert changes

9e6b0e9

compilade added 3 commits September 9, 2024 22:20

py : include imatrix converter requirements in toplevel requirements

894ed8d

imatrix : avoid using designated initializers in C++

efa9186

imatrix : remove unused n_entries

2217247

ngxson reviewed Sep 10, 2024

View reviewed changes

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

compilade and others added 2 commits September 10, 2024 11:51

quantize : use unused imatrix chunk_size with LLAMA_TRACE

2d79a70

compilade marked this pull request as draft September 13, 2024 03:11

compilade added 3 commits January 30, 2025 19:56

common : use GGUF for imatrix output by default

c7a32e7

Merge branch 'master' into compilade/imatrix-batched-chunks

db502dd

Merge branch 'master' into compilade/imatrix-batched-chunks

1be357d

compilade mentioned this pull request Apr 6, 2025

imatrix: add option to display importance score statistics for a given imatrix file #12718

Open

compilade added 3 commits April 13, 2025 12:10

Merge branch 'master' into compilade/imatrix-batched-chunks

16202d6

imatrix : two-way conversion between old format and GGUF

a5165a6

convert : remove imatrix to gguf python script

635f945

imatrix : use the function name in more error messages

1d19025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

imatrix : use GGUF to store importance matrices #9400

imatrix : use GGUF to store importance matrices #9400

Uh oh!

compilade commented Sep 10, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

compilade commented Sep 13, 2024

Uh oh!

ggerganov commented Apr 8, 2025

Uh oh!

Uh oh!

imatrix : use GGUF to store importance matrices #9400

Are you sure you want to change the base?

imatrix : use GGUF to store importance matrices #9400

Uh oh!

Conversation

compilade commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of changes

TODO

Uh oh!

Uh oh!

Uh oh!

Uh oh!

compilade commented Sep 13, 2024

Uh oh!

ggerganov commented Apr 8, 2025

Uh oh!

Uh oh!

compilade commented Sep 10, 2024 •

edited

Loading