Description
Quantizing Stable Diffusion 3.5 models to any kind of k-quants results in large files made out of mostly fp16 weights. That's because
a lot of tensors have width 2432 or 7296, wich do not fit in the block size of k quants (256).
https://github.com/leejet/stable-diffusion.cpp/blob/master/model.cpp#L1761
For example, a q3_k quant of SD3.5 Large is 13 842 megabytes (it's 16 460 megabytes for the fp16 model), which would indicate only about 10% of the weights in total got quantized (assuming 3.4375 bits per quantized weight) .
I'm not sure what could be done to fix that. Maybe fallback to the next bigger quant that does the job instead of skipping the tensor altogether?
Edit: I found this PR that adresses a similar issue lin llama.cpp: ggml-org/llama.cpp#2001