IQ3_S Improvements #5866

PeterReid · 2024-03-04T04:31:01Z

PeterReid
Mar 4, 2024

I have been exploring ways to improve perplexity for IQ3_S quantization while speeding it up on AVX/NEON, and I think I have found one. This uses the multiply-instead-of-codebook-lookup that I was asking about in #5676 for its speed boost, and the insight from the IQ4_NL quantization for its perplexity improvement. This is not backwards compatible, unfortunately, because it is changing the code book. It is also not ready for merging (I've broken IQ3_XXS, I think, for example), and represents me mucking around rather than making something presentable.

I started by noticing that some values appeared much more often than others in the codebook. Specifically, the number of occurrences of the 8 values are: 436, 344, 327, 271, 223, 185, 112, 150. This is what you would expect to see from the distribution of weights being somewhat quadratic-ish, and is the fact that IQ4_NL uses to be better than similarly-sized methods. I decided to choose values in the codebook following the same polynomial as IQ4_NL. I fit a polynomial (0.08095843xxx + .0671659xx + 11.43774359x + 0.99047392) and ended up using the values [0, 3, 6, 9, 12, 16, 19, 23, 26, 31, 35, 40, 45, 50, 56, 62]. (I used a maximum of 62 because that's what the highest value was in the old codebook.)

For assembling these values into a codebook, I compute the values indices by computing (codebook_index * 0xd137151) & 0x0f0f0f0f, and then map those four indices in those four bytes to the values above. That magic number is the result of me trying out a few numbers until I found one that used each value an about-equal number of times, not some search computation, so it is very possible that it would be possible to find a better one that ends up with better-spaced points. It's also possible that there's a better list of values to use, but I didn't want to overfit anything to the one model I'm working with.

I have only done this for AVX so far. It does all of those operations vectorized, working on 32 weights at a time.

I've tested three versions

IQ3_S as originally merged into master.
IQ3_S as it currently exists in master.
IQ3_S with my changes on top of the originally-merged on.

	iq3_s original (`e1b8efb`)	iq3_s current (`67be2ce`)	mine (`67be2ce`)
Perplexity mistral7b_instruct, wiki.test, +/- 0.04639	7.3965	7.4942	7.1268
Speed (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz)	4.92 t/s	4.84 t/s	7.66 t/s
Speed (Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz)	0.98 t/s	0.90 t/s	1.90 t/s

I have not figured out why the current version of iq3_s is performing worse than the baseline on these metrics. But in any case, the speed improvement for my version is pretty big! 158% - 210% of the original speed. Plus the perplexity is better.

So, to summarize, this breaks backwards compatibility with existing IQ3_S-quantified files, but seems like it may be worthwhile to pursue for performance and perplexity reasons. @ikawrakow ?

The code is in https://github.com/PeterReid/llama.cpp/commits/iq3_s_quant_change_cleaned/

ikawrakow · 2024-03-04T08:18:20Z

ikawrakow
Mar 4, 2024

Great work!

I'm not sure what these perplexity values are, but my guess is that they are for a fraction of WikiText2 (around 10 chunks perhaps?). In any case, you are running on the CPU, and there was a bug in the AVX2 implementation that was fixed in #5834, which I think you don't have. This might explain the higher PPL values you are observing on the master branch. In any case, I have done a complete PPL run for Mistral-7B for a context of 512 and imatrix from wiki.train.raw using your code and current master, and IQ3_S on master is lower. Here are the runs:

Mistral-7B PPL this PR, Final PPL = 5.9530


main: build = 2278 (7ad9511)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709531437
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = tmp
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 26
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:   32 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  193 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ3_S - 3.4375 bpw
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 2.96 GiB (3.51 BPW)
llm_load_print_meta: general.name     = tmp
llm_load_print_meta: general.name     = tmp
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 ''
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  3034.27 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    10.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    72.00 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | perplexity: tokenizing the input .. perplexity: tokenization took 689.831 ms perplexity: calculating perplexity over 642 chunks, batch_size=512 perplexity: 7.10 seconds per pass - ETA 1 hours 15.92 minutes [1]4.1219,[2]4.7653,[3]5.5513,[4]6.2615,[5]6.2347,[6]6.2921,[7]6.5020,[8]6.6544,[9]6.8340,[10]7.1582,[11]7.4271,[12]7.4242,[13]7.4532,[14]7.4752,[15]7.3513,[16]7.1754,[17]7.1933,[18]6.9322,[19]6.7896,[20]6.8915,[21]6.7575,[22]6.7466,[23]6.5620,[24]6.5804,[25]6.4001,[26]6.1981,[27]6.0723,[28]5.9393,[29]5.7915,[30]5.7630,[31]5.6500,[32]5.6655,[33]5.6305,[34]5.5984,[35]5.5978,[36]5.5447,[37]5.5036,[38]5.5196,[39]5.5000,[40]5.5587,[41]5.6170,[42]5.5701,[43]5.5772,[44]5.5780,[45]5.5570,[46]5.5932,[47]5.5845,[48]5.5739,[49]5.5553,[50]5.5817,[51]5.6153,[52]5.6689,[53]5.6701,[54]5.6502,[55]5.6615,[56]5.6447,[57]5.6756,[58]5.7165,[59]5.7514,[60]5.7643,[61]5.8180,[62]5.8105,[63]5.8143,[64]5.8499,[65]5.8756,[66]5.8934,[67]5.9009,[68]5.9263,[69]5.9426,[70]5.9899,[71]6.0307,[72]6.0493,[73]6.0893,[74]6.0813,[75]6.0896,[76]6.0868,[77]6.0936,[78]6.1294,[79]6.1373,[80]6.1074,[81]6.0891,[82]6.0807,[83]6.0698,[84]6.0787,[85]6.0733,[86]6.0280,[87]6.0306,[88]6.0147,[89]6.0420,[90]6.0533,[91]6.0686,[92]6.0624,[93]6.0771,[94]6.0810,[95]6.0858,[96]6.0955,[97]6.0943,[98]6.0819,[99]6.1221,[100]6.1232,[101]6.1366,[102]6.1464,[103]6.1472,[104]6.1588,[105]6.1609,[106]6.1751,[107]6.1800,[108]6.1730,[109]6.1952,[110]6.2061,[111]6.2048,[112]6.2194,[113]6.2153,[114]6.2158,[115]6.2096,[116]6.2416,[117]6.2738,[118]6.3076,[119]6.2989,[120]6.3348,[121]6.3593,[122]6.3637,[123]6.3444,[124]6.3781,[125]6.3967,[126]6.3989,[127]6.3759,[128]6.3734,[129]6.3737,[130]6.3660,[131]6.3627,[132]6.3516,[133]6.3466,[134]6.3278,[135]6.3067,[136]6.2944,[137]6.2957,[138]6.2793,[139]6.2560,[140]6.2293,[141]6.2160,[142]6.1918,[143]6.1738,[144]6.1649,[145]6.1654,[146]6.1772,[147]6.1715,[148]6.1781,[149]6.1665,[150]6.1495,[151]6.1408,[152]6.1470,[153]6.1478,[154]6.1550,[155]6.1542,[156]6.1559,[157]6.1576,[158]6.1681,[159]6.1400,[160]6.1273,[161]6.1027,[162]6.0755,[163]6.0430,[164]6.0025,[165]5.9714,[166]5.9542,[167]5.9431,[168]5.9190,[169]5.9117,[170]5.8962,[171]5.8639,[172]5.8457,[173]5.8322,[174]5.8063,[175]5.7847,[176]5.7715,[177]5.7599,[178]5.7427,[179]5.7330,[180]5.7335,[181]5.7297,[182]5.7207,[183]5.7256,[184]5.7251,[185]5.7289,[186]5.7432,[187]5.7607,[188]5.7894,[189]5.8153,[190]5.8261,[191]5.8566,[192]5.8666,[193]5.8853,[194]5.9099,[195]5.9297,[196]5.9336,[197]5.9256,[198]5.9369,[199]5.9561,[200]5.9457,[201]5.9536,[202]5.9561,[203]5.9602,[204]5.9669,[205]5.9673,[206]5.9676,[207]5.9814,[208]5.9911,[209]6.0015,[210]6.0042,[211]6.0021,[212]6.0188,[213]6.0296,[214]6.0524,[215]6.0548,[216]6.0597,[217]6.0486,[218]6.0440,[219]6.0333,[220]6.0298,[221]6.0236,[222]6.0471,[223]6.0599,[224]6.0728,[225]6.0658,[226]6.0599,[227]6.0534,[228]6.0396,[229]6.0552,[230]6.0540,[231]6.0537,[232]6.0553,[233]6.0527,[234]6.0430,[235]6.0374,[236]6.0370,[237]6.0244,[238]6.0141,[239]6.0024,[240]5.9893,[241]5.9692,[242]5.9645,[243]5.9560,[244]5.9463,[245]5.9419,[246]5.9357,[247]5.9230,[248]5.9134,[249]5.9095,[250]5.8988,[251]5.8820,[252]5.8676,[253]5.8590,[254]5.8530,[255]5.8452,[256]5.8436,[257]5.8388,[258]5.8326,[259]5.8178,[260]5.8157,[261]5.8108,[262]5.8059,[263]5.8088,[264]5.8084,[265]5.8065,[266]5.8076,[267]5.8098,[268]5.8190,[269]5.8229,[270]5.8317,[271]5.8338,[272]5.8390,[273]5.8516,[274]5.8643,[275]5.8722,[276]5.8785,[277]5.8829,[278]5.8941,[279]5.9082,[280]5.9104,[281]5.9197,[282]5.9311,[283]5.9254,[284]5.9177,[285]5.9030,[286]5.8950,[287]5.8821,[288]5.8707,[289]5.8677,[290]5.8648,[291]5.8639,[292]5.8624,[293]5.8576,[294]5.8510,[295]5.8513,[296]5.8466,[297]5.8380,[298]5.8260,[299]5.8214,[300]5.8125,[301]5.8104,[302]5.8090,[303]5.7970,[304]5.7941,[305]5.7889,[306]5.7885,[307]5.7806,[308]5.7791,[309]5.7659,[310]5.7594,[311]5.7427,[312]5.7226,[313]5.7365,[314]5.7424,[315]5.7428,[316]5.7366,[317]5.7290,[318]5.7318,[319]5.7392,[320]5.7469,[321]5.7493,[322]5.7507,[323]5.7546,[324]5.7645,[325]5.7719,[326]5.7796,[327]5.7807,[328]5.7752,[329]5.7733,[330]5.7711,[331]5.7685,[332]5.7660,[333]5.7655,[334]5.7790,[335]5.7793,[336]5.7807,[337]5.7823,[338]5.7819,[339]5.7833,[340]5.7861,[341]5.7891,[342]5.7986,[343]5.7955,[344]5.7969,[345]5.7967,[346]5.7894,[347]5.7927,[348]5.7980,[349]5.8013,[350]5.7975,[351]5.8027,[352]5.8030,[353]5.8018,[354]5.7948,[355]5.8014,[356]5.8109,[357]5.8233,[358]5.8279,[359]5.8256,[360]5.8261,[361]5.8254,[362]5.8274,[363]5.8249,[364]5.8202,[365]5.8190,[366]5.8191,[367]5.8185,[368]5.8138,[369]5.8148,[370]5.8156,[371]5.8254,[372]5.8274,[373]5.8257,[374]5.8234,[375]5.8146,[376]5.8141,[377]5.8109,[378]5.8139,[379]5.8128,[380]5.8075,[381]5.8008,[382]5.7993,[383]5.7915,[384]5.7880,[385]5.7839,[386]5.7831,[387]5.7798,[388]5.7776,[389]5.7832,[390]5.7886,[391]5.7985,[392]5.7987,[393]5.7965,[394]5.8010,[395]5.8000,[396]5.8040,[397]5.7986,[398]5.7973,[399]5.7997,[400]5.8031,[401]5.8131,[402]5.8255,[403]5.8380,[404]5.8530,[405]5.8621,[406]5.8675,[407]5.8767,[408]5.8890,[409]5.8945,[410]5.8981,[411]5.9018,[412]5.9092,[413]5.9119,[414]5.9127,[415]5.9197,[416]5.9218,[417]5.9322,[418]5.9400,[419]5.9447,[420]5.9532,[421]5.9629,[422]5.9740,[423]5.9737,[424]5.9708,[425]5.9633,[426]5.9680,[427]5.9663,[428]5.9792,[429]5.9860,[430]5.9830,[431]5.9844,[432]5.9812,[433]5.9804,[434]5.9835,[435]5.9857,[436]5.9857,[437]5.9872,[438]5.9886,[439]5.9937,[440]5.9949,[441]5.9945,[442]5.9897,[443]5.9826,[444]5.9840,[445]5.9764,[446]5.9820,[447]5.9842,[448]5.9857,[449]5.9867,[450]5.9867,[451]5.9958,[452]5.9986,[453]5.9973,[454]6.0021,[455]6.0014,[456]5.9953,[457]5.9988,[458]6.0037,[459]6.0074,[460]6.0074,[461]6.0038,[462]6.0066,[463]6.0013,[464]6.0037,[465]6.0017,[466]6.0009,[467]5.9961,[468]5.9977,[469]6.0055,[470]6.0117,[471]6.0147,[472]6.0170,[473]6.0146,[474]6.0153,[475]6.0170,[476]6.0094,[477]6.0096,[478]6.0093,[479]6.0047,[480]6.0042,[481]6.0017,[482]5.9970,[483]5.9980,[484]5.9938,[485]5.9935,[486]5.9917,[487]5.9868,[488]5.9848,[489]5.9804,[490]5.9765,[491]5.9727,[492]5.9704,[493]5.9693,[494]5.9604,[495]5.9604,[496]5.9624,[497]5.9617,[498]5.9568,[499]5.9573,[500]5.9565,[501]5.9559,[502]5.9579,[503]5.9663,[504]5.9733,[505]5.9701,[506]5.9685,[507]5.9705,[508]5.9760,[509]5.9784,[510]5.9785,[511]5.9802,[512]5.9814,[513]5.9843,[514]5.9873,[515]5.9905,[516]5.9935,[517]5.9876,[518]5.9882,[519]5.9840,[520]5.9811,[521]5.9827,[522]5.9878,[523]5.9865,[524]5.9877,[525]5.9813,[526]5.9800,[527]5.9851,[528]5.9865,[529]5.9886,[530]5.9915,[531]5.9920,[532]5.9920,[533]5.9927,[534]5.9897,[535]5.9886,[536]5.9845,[537]5.9798,[538]5.9776,[539]5.9767,[540]5.9751,[541]5.9741,[542]5.9711,[543]5.9682,[544]5.9663,[545]5.9661,[546]5.9642,[547]5.9669,[548]5.9628,[549]5.9629,[550]5.9633,[551]5.9586,[552]5.9640,[553]5.9687,[554]5.9692,[555]5.9694,[556]5.9696,[557]5.9684,[558]5.9701,[559]5.9676,[560]5.9683,[561]5.9701,[562]5.9677,[563]5.9654,[564]5.9624,[565]5.9593,[566]5.9587,[567]5.9573,[568]5.9537,[569]5.9504,[570]5.9488,[571]5.9501,[572]5.9480,[573]5.9434,[574]5.9389,[575]5.9398,[576]5.9406,[577]5.9449,[578]5.9458,[579]5.9488,[580]5.9481,[581]5.9429,[582]5.9453,[583]5.9438,[584]5.9465,[585]5.9437,[586]5.9430,[587]5.9435,[588]5.9429,[589]5.9389,[590]5.9435,[591]5.9493,[592]5.9503,[593]5.9544,[594]5.9589,[595]5.9533,[596]5.9490,[597]5.9509,[598]5.9513,[599]5.9484,[600]5.9474,[601]5.9468,[602]5.9387,[603]5.9388,[604]5.9406,[605]5.9350,[606]5.9304,[607]5.9248,[608]5.9131,[609]5.9120,[610]5.9133,[611]5.9174,[612]5.9195,[613]5.9209,[614]5.9257,[615]5.9279,[616]5.9312,[617]5.9348,[618]5.9408,[619]5.9470,[620]5.9463,[621]5.9497,[622]5.9485,[623]5.9493,[624]5.9457,[625]5.9474,[626]5.9431,[627]5.9422,[628]5.9416,[629]5.9485,[630]5.9498,[631]5.9492,[632]5.9480,[633]5.9527,[634]5.9496,[635]5.9510,[636]5.9517,[637]5.9538,[638]5.9582,[639]5.9617,[640]5.9681,[641]5.9588,[642]5.9530, Final estimate: PPL = 5.9530 +/- 0.03372

Mistral-7B PPL master, Final PPL = 5.8807


main: build = 2329 (67be2ce1)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709537499
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 26
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:   32 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  193 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ3_S - 3.4375 bpw
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 2.96 GiB (3.51 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 ''
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    53.71 MiB
llm_load_tensors:      CUDA0 buffer size =  2980.55 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    10.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    73.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.00 MiB
llama_new_context_with_model: graph splits (measure): 2

system_info: n_threads = 1 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | perplexity: tokenizing the input .. perplexity: tokenization took 559.298 ms perplexity: calculating perplexity over 642 chunks, batch_size=512 perplexity: 0.10 seconds per pass - ETA 1.10 minutes [1]4.1126,[2]4.7304,[3]5.4836,[4]6.1856,[5]6.1795,[6]6.2092,[7]6.4162,[8]6.5425,[9]6.7009,[10]7.0221,[11]7.2943,[12]7.3164,[13]7.3517,[14]7.3877,[15]7.2234,[16]7.0575,[17]7.0784,[18]6.8178,[19]6.6758,[20]6.7671,[21]6.6494,[22]6.6288,[23]6.4525,[24]6.4738,[25]6.3046,[26]6.0999,[27]5.9710,[28]5.8355,[29]5.6832,[30]5.6567,[31]5.5478,[32]5.5667,[33]5.5315,[34]5.4987,[35]5.4923,[36]5.4467,[37]5.4087,[38]5.4282,[39]5.4170,[40]5.4767,[41]5.5373,[42]5.4925,[43]5.5071,[44]5.5061,[45]5.4841,[46]5.5178,[47]5.5117,[48]5.4987,[49]5.4782,[50]5.5066,[51]5.5412,[52]5.5969,[53]5.6014,[54]5.5798,[55]5.5921,[56]5.5741,[57]5.6047,[58]5.6437,[59]5.6724,[60]5.6832,[61]5.7347,[62]5.7305,[63]5.7339,[64]5.7754,[65]5.8013,[66]5.8170,[67]5.8240,[68]5.8491,[69]5.8680,[70]5.9156,[71]5.9551,[72]5.9742,[73]6.0152,[74]6.0070,[75]6.0160,[76]6.0165,[77]6.0297,[78]6.0684,[79]6.0740,[80]6.0434,[81]6.0262,[82]6.0173,[83]6.0037,[84]6.0108,[85]6.0045,[86]5.9624,[87]5.9645,[88]5.9496,[89]5.9724,[90]5.9842,[91]5.9994,[92]5.9939,[93]6.0117,[94]6.0158,[95]6.0210,[96]6.0302,[97]6.0315,[98]6.0204,[99]6.0606,[100]6.0617,[101]6.0754,[102]6.0881,[103]6.0903,[104]6.1051,[105]6.1079,[106]6.1218,[107]6.1286,[108]6.1220,[109]6.1433,[110]6.1507,[111]6.1513,[112]6.1655,[113]6.1606,[114]6.1604,[115]6.1553,[116]6.1866,[117]6.2202,[118]6.2497,[119]6.2416,[120]6.2775,[121]6.3020,[122]6.3066,[123]6.2879,[124]6.3218,[125]6.3377,[126]6.3408,[127]6.3194,[128]6.3180,[129]6.3204,[130]6.3106,[131]6.3086,[132]6.2984,[133]6.2939,[134]6.2743,[135]6.2541,[136]6.2453,[137]6.2474,[138]6.2298,[139]6.2074,[140]6.1811,[141]6.1678,[142]6.1450,[143]6.1277,[144]6.1190,[145]6.1190,[146]6.1308,[147]6.1262,[148]6.1317,[149]6.1203,[150]6.1022,[151]6.0935,[152]6.1012,[153]6.1038,[154]6.1110,[155]6.1097,[156]6.1122,[157]6.1130,[158]6.1220,[159]6.0943,[160]6.0816,[161]6.0568,[162]6.0306,[163]5.9998,[164]5.9590,[165]5.9288,[166]5.9123,[167]5.9004,[168]5.8757,[169]5.8699,[170]5.8532,[171]5.8213,[172]5.8029,[173]5.7876,[174]5.7628,[175]5.7412,[176]5.7274,[177]5.7155,[178]5.6979,[179]5.6889,[180]5.6882,[181]5.6841,[182]5.6757,[183]5.6787,[184]5.6781,[185]5.6816,[186]5.6953,[187]5.7129,[188]5.7415,[189]5.7686,[190]5.7795,[191]5.8091,[192]5.8187,[193]5.8367,[194]5.8629,[195]5.8829,[196]5.8867,[197]5.8794,[198]5.8905,[199]5.9101,[200]5.8997,[201]5.9064,[202]5.9100,[203]5.9127,[204]5.9202,[205]5.9189,[206]5.9187,[207]5.9307,[208]5.9403,[209]5.9498,[210]5.9520,[211]5.9507,[212]5.9673,[213]5.9790,[214]6.0007,[215]6.0033,[216]6.0084,[217]5.9973,[218]5.9930,[219]5.9826,[220]5.9794,[221]5.9724,[222]5.9950,[223]6.0083,[224]6.0218,[225]6.0144,[226]6.0085,[227]6.0006,[228]5.9873,[229]6.0026,[230]6.0011,[231]6.0007,[232]6.0022,[233]5.9990,[234]5.9888,[235]5.9838,[236]5.9844,[237]5.9714,[238]5.9612,[239]5.9490,[240]5.9370,[241]5.9174,[242]5.9121,[243]5.9042,[244]5.8956,[245]5.8916,[246]5.8848,[247]5.8726,[248]5.8629,[249]5.8591,[250]5.8497,[251]5.8335,[252]5.8194,[253]5.8107,[254]5.8048,[255]5.7977,[256]5.7960,[257]5.7912,[258]5.7847,[259]5.7686,[260]5.7664,[261]5.7611,[262]5.7558,[263]5.7581,[264]5.7583,[265]5.7576,[266]5.7581,[267]5.7598,[268]5.7690,[269]5.7726,[270]5.7804,[271]5.7830,[272]5.7881,[273]5.8012,[274]5.8129,[275]5.8207,[276]5.8270,[277]5.8317,[278]5.8419,[279]5.8567,[280]5.8592,[281]5.8671,[282]5.8781,[283]5.8728,[284]5.8649,[285]5.8500,[286]5.8417,[287]5.8284,[288]5.8171,[289]5.8140,[290]5.8120,[291]5.8102,[292]5.8082,[293]5.8033,[294]5.7959,[295]5.7963,[296]5.7920,[297]5.7843,[298]5.7723,[299]5.7665,[300]5.7566,[301]5.7540,[302]5.7524,[303]5.7415,[304]5.7382,[305]5.7334,[306]5.7329,[307]5.7255,[308]5.7243,[309]5.7110,[310]5.7038,[311]5.6877,[312]5.6679,[313]5.6812,[314]5.6870,[315]5.6875,[316]5.6808,[317]5.6729,[318]5.6762,[319]5.6834,[320]5.6901,[321]5.6921,[322]5.6939,[323]5.6973,[324]5.7071,[325]5.7139,[326]5.7222,[327]5.7227,[328]5.7172,[329]5.7149,[330]5.7124,[331]5.7097,[332]5.7074,[333]5.7067,[334]5.7201,[335]5.7201,[336]5.7214,[337]5.7226,[338]5.7220,[339]5.7235,[340]5.7254,[341]5.7284,[342]5.7381,[343]5.7350,[344]5.7363,[345]5.7359,[346]5.7296,[347]5.7338,[348]5.7389,[349]5.7417,[350]5.7381,[351]5.7423,[352]5.7426,[353]5.7414,[354]5.7343,[355]5.7410,[356]5.7506,[357]5.7630,[358]5.7673,[359]5.7650,[360]5.7657,[361]5.7644,[362]5.7667,[363]5.7638,[364]5.7591,[365]5.7578,[366]5.7575,[367]5.7572,[368]5.7523,[369]5.7532,[370]5.7539,[371]5.7641,[372]5.7668,[373]5.7653,[374]5.7627,[375]5.7537,[376]5.7533,[377]5.7502,[378]5.7529,[379]5.7521,[380]5.7465,[381]5.7401,[382]5.7392,[383]5.7315,[384]5.7275,[385]5.7234,[386]5.7224,[387]5.7190,[388]5.7168,[389]5.7221,[390]5.7276,[391]5.7373,[392]5.7377,[393]5.7356,[394]5.7406,[395]5.7401,[396]5.7436,[397]5.7385,[398]5.7365,[399]5.7384,[400]5.7419,[401]5.7514,[402]5.7636,[403]5.7760,[404]5.7910,[405]5.7999,[406]5.8052,[407]5.8141,[408]5.8255,[409]5.8306,[410]5.8336,[411]5.8366,[412]5.8437,[413]5.8466,[414]5.8475,[415]5.8552,[416]5.8568,[417]5.8667,[418]5.8737,[419]5.8781,[420]5.8863,[421]5.8955,[422]5.9064,[423]5.9057,[424]5.9024,[425]5.8950,[426]5.8998,[427]5.8985,[428]5.9113,[429]5.9186,[430]5.9154,[431]5.9175,[432]5.9142,[433]5.9139,[434]5.9167,[435]5.9187,[436]5.9186,[437]5.9203,[438]5.9219,[439]5.9269,[440]5.9279,[441]5.9269,[442]5.9222,[443]5.9146,[444]5.9155,[445]5.9084,[446]5.9140,[447]5.9165,[448]5.9180,[449]5.9184,[450]5.9184,[451]5.9275,[452]5.9304,[453]5.9292,[454]5.9341,[455]5.9333,[456]5.9278,[457]5.9312,[458]5.9364,[459]5.9401,[460]5.9402,[461]5.9368,[462]5.9393,[463]5.9344,[464]5.9364,[465]5.9341,[466]5.9331,[467]5.9287,[468]5.9305,[469]5.9382,[470]5.9441,[471]5.9471,[472]5.9491,[473]5.9467,[474]5.9476,[475]5.9489,[476]5.9411,[477]5.9407,[478]5.9399,[479]5.9356,[480]5.9351,[481]5.9326,[482]5.9281,[483]5.9289,[484]5.9247,[485]5.9240,[486]5.9219,[487]5.9164,[488]5.9144,[489]5.9104,[490]5.9064,[491]5.9026,[492]5.9007,[493]5.8997,[494]5.8910,[495]5.8908,[496]5.8927,[497]5.8916,[498]5.8867,[499]5.8867,[500]5.8860,[501]5.8854,[502]5.8874,[503]5.8953,[504]5.9021,[505]5.8992,[506]5.8975,[507]5.8992,[508]5.9047,[509]5.9074,[510]5.9072,[511]5.9089,[512]5.9102,[513]5.9132,[514]5.9159,[515]5.9194,[516]5.9229,[517]5.9175,[518]5.9178,[519]5.9135,[520]5.9106,[521]5.9133,[522]5.9184,[523]5.9170,[524]5.9184,[525]5.9121,[526]5.9105,[527]5.9154,[528]5.9167,[529]5.9185,[530]5.9210,[531]5.9215,[532]5.9217,[533]5.9224,[534]5.9196,[535]5.9182,[536]5.9143,[537]5.9096,[538]5.9078,[539]5.9067,[540]5.9053,[541]5.9044,[542]5.9012,[543]5.8980,[544]5.8964,[545]5.8963,[546]5.8946,[547]5.8973,[548]5.8933,[549]5.8932,[550]5.8935,[551]5.8882,[552]5.8932,[553]5.8972,[554]5.8981,[555]5.8984,[556]5.8990,[557]5.8978,[558]5.8998,[559]5.8976,[560]5.8982,[561]5.9000,[562]5.8981,[563]5.8959,[564]5.8928,[565]5.8899,[566]5.8887,[567]5.8871,[568]5.8833,[569]5.8801,[570]5.8786,[571]5.8796,[572]5.8776,[573]5.8730,[574]5.8684,[575]5.8690,[576]5.8697,[577]5.8738,[578]5.8750,[579]5.8780,[580]5.8771,[581]5.8722,[582]5.8748,[583]5.8735,[584]5.8761,[585]5.8737,[586]5.8724,[587]5.8728,[588]5.8723,[589]5.8682,[590]5.8726,[591]5.8785,[592]5.8792,[593]5.8834,[594]5.8881,[595]5.8824,[596]5.8785,[597]5.8800,[598]5.8804,[599]5.8777,[600]5.8766,[601]5.8760,[602]5.8683,[603]5.8685,[604]5.8701,[605]5.8649,[606]5.8602,[607]5.8545,[608]5.8431,[609]5.8420,[610]5.8432,[611]5.8470,[612]5.8491,[613]5.8508,[614]5.8553,[615]5.8578,[616]5.8610,[617]5.8644,[618]5.8697,[619]5.8756,[620]5.8748,[621]5.8781,[622]5.8768,[623]5.8779,[624]5.8746,[625]5.8760,[626]5.8717,[627]5.8707,[628]5.8701,[629]5.8763,[630]5.8774,[631]5.8767,[632]5.8755,[633]5.8805,[634]5.8777,[635]5.8791,[636]5.8799,[637]5.8816,[638]5.8859,[639]5.8894,[640]5.8958,[641]5.8859,[642]5.8807, Final estimate: PPL = 5.8807 +/- 0.03252

I have also implemented a multiplier based codebook, see PR #5867. For Mistral-7B I get PPL = 5.9699, so yours is slightly better (I suspect this is due to the non-linearity that you have borrowed from IQ4_NL). In terms of inference speed yours is also slightly better on a Ryzen 7950X:

backend	threads	test	t/s (this PR)	t/s (PR #5867)	Speedup
CPU	2	tg 128	7.46 ± 0.02	7.03 ± 0.03	1.061
CPU	4	tg 128	13.91 ± 0.02	13.19 ± 0.02	1.055
CPU	8	tg 128	17.81 ± 0.01	17.83 ± 0.02	1.000
CPU	16	tg 128	17.15 ± 0.31	16.85 ± 0.22	1.018
CPU	16	pp 512	51.68 ± 0.34	49.47 ± 0.56	1.045

0 replies

ikawrakow · 2024-03-04T11:29:41Z

ikawrakow
Mar 4, 2024

@PeterReid I was intrigued by the fact that your codebook results in a slightly better perplexity for Mistral-7B compared to #5867, so went ahead and tried it on LLaMA-v2-7B. It does not do very well there:

LLaMA-v2-7B PPL this PR = 5.2466


main: build = 2278 (7ad9511)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709545886
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 26
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  225 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ3_S - 3.4375 bpw
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 2.75 GiB (3.50 BPW) 
llm_load_print_meta: general.name     = llama2
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 ''
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  2811.02 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    17.04 MiB
llama_new_context_with_model:        CPU compute buffer size =   288.00 MiB
llama_new_context_with_model: graph splits (measure): 1
system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |

perplexity: tokenizing the input ..

perplexity: tokenization took 674.543 ms

perplexity: calculating perplexity over 81 chunks, batch_size=512

perplexity: 62.40 seconds per pass - ETA 1 hours 24.23 minutes

[1]6.6602,[2]6.4764,[3]5.8350,[4]5.0903,[5]5.0734,[6]5.0979,[7]5.2551,[8]5.3814,[9]5.5493,[10]5.6095,[11]5.3853,[12]5.5084,[13]5.5980,[14]5.6959,[15]5.8131,[16]5.8695,[17]5.8352,[18]5.7160,[19]5.6963,[20]5.6486,[21]5.4525,[22]5.3078,[23]5.2092,[24]5.3100,[25]5.4108,[26]5.4441,[27]5.4611,[28]5.4241,[29]5.3916,[30]5.3384,[31]5.2722,[32]5.1930,[33]5.1386,[34]5.1512,[35]5.1933,[36]5.2532,[37]5.2241,[38]5.1861,[39]5.1547,[40]5.1211,[41]5.1559,[42]5.1555,[43]5.1728,[44]5.1688,[45]5.1643,[46]5.1710,[47]5.1586,[48]5.1408,[49]5.1089,[50]5.1358,[51]5.1405,[52]5.1911,[53]5.2305,[54]5.2807,[55]5.3151,[56]5.3283,[57]5.3127,[58]5.3240,[59]5.3369,[60]5.3465,[61]5.3369,[62]5.3158,[63]5.2972,[64]5.2944,[65]5.3056,[66]5.3103,[67]5.2963,[68]5.3000,[69]5.2858,[70]5.2823,[71]5.2805,[72]5.2673,[73]5.2614,[74]5.2586,[75]5.2549,[76]5.2567,[77]5.2351,[78]5.2274,[79]5.2600,[80]5.2478,[81]5.2466,

Final estimate: PPL = 5.2466 +/- 0.02793

llama_print_timings: load time = 1019.65 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 5157857.79 ms / 331776 tokens ( 15.55 ms per token, 64.32 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 5161672.57 ms / 331777 tokens

LLaMA-v2-7B PPL master = 5.1340


main: build = 2307 (7b629c3b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709314323
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no   
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096 
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096 
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 26
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  225 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096 
llm_load_print_meta: n_embd           = 4096 
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096 
llm_load_print_meta: n_embd_v_gqa     = 4096 
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0  
llm_load_print_meta: pooling type     = 0  
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096 
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ3_S - 3.4375 bpw
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 2.75 GiB (3.50 BPW) 
llm_load_print_meta: general.name     = llama2
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 ''
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    53.71 MiB
llm_load_tensors:      CUDA0 buffer size =  2757.31 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    17.04 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   296.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.00 MiB
llama_new_context_with_model: graph splits (measure): 2
system_info: n_threads = 1 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |

perplexity: tokenizing the input ..

perplexity: tokenization took 662.952 ms

perplexity: calculating perplexity over 81 chunks, batch_size=512

perplexity: 1.37 seconds per pass - ETA 1.85 minutes

[1]6.4711,[2]6.3067,[3]5.6979,[4]4.9666,[5]4.9495,[6]4.9678,[7]5.1338,[8]5.2697,[9]5.4348,[10]5.4985,[11]5.2822,[12]5.4005,[13]5.4893,[14]5.5854,[15]5.7044,[16]5.7570,[17]5.7283,[18]5.6095,[19]5.5959,[20]5.5499,[21]5.3616,[22]5.2208,[23]5.1229,[24]5.2218,[25]5.3172,[26]5.3467,[27]5.3618,[28]5.3212,[29]5.2904,[30]5.2397,[31]5.1736,[32]5.0952,[33]5.0376,[34]5.0510,[35]5.0909,[36]5.1466,[37]5.1160,[38]5.0791,[39]5.0489,[40]5.0174,[41]5.0497,[42]5.0482,[43]5.0649,[44]5.0608,[45]5.0559,[46]5.0638,[47]5.0500,[48]5.0340,[49]5.0041,[50]5.0302,[51]5.0328,[52]5.0825,[53]5.1203,[54]5.1696,[55]5.2019,[56]5.2153,[57]5.1977,[58]5.2089,[59]5.2216,[60]5.2315,[61]5.2225,[62]5.2014,[63]5.1835,[64]5.1821,[65]5.1919,[66]5.1947,[67]5.1815,[68]5.1862,[69]5.1735,[70]5.1711,[71]5.1687,[72]5.1559,[73]5.1504,[74]5.1472,[75]5.1430,[76]5.1441,[77]5.1231,[78]5.1158,[79]5.1475,[80]5.1351,[81]5.1340,

llama_print_timings:        load time =     651.08 ms

llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)

llama_print_timings: prompt eval time =   93395.04 ms / 331776 tokens (    0.28 ms per token,  3552.39 tokens per second)

llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)

llama_print_timings:       total time =   96996.83 ms / 331777 tokens

Final estimate: PPL = 5.1340 +/- 0.02703

LLaMA-v2-7B PPL #5867 = 5.2016


main: build = 2295 (8b713a98)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709458694
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no   
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096 
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096 
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 26
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  225 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096 
llm_load_print_meta: n_embd           = 4096 
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096 
llm_load_print_meta: n_embd_v_gqa     = 4096 
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0  
llm_load_print_meta: pooling type     = 0  
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096 
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ3_S - 3.4375 bpw
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 2.75 GiB (3.50 BPW) 
llm_load_print_meta: general.name     = llama2
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 ''
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    53.71 MiB
llm_load_tensors:      CUDA0 buffer size =  2757.31 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    17.04 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   296.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.00 MiB
llama_new_context_with_model: graph splits (measure): 2
system_info: n_threads = 1 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |

perplexity: tokenizing the input ..

perplexity: tokenization took 665.161 ms

perplexity: calculating perplexity over 81 chunks, batch_size=512

perplexity: 1.39 seconds per pass - ETA 1.87 minutes

[1]6.4998,[2]6.3700,[3]5.7592,[4]5.0478,[5]5.0291,[6]5.0520,[7]5.2187,[8]5.3524,[9]5.5179,[10]5.5753,[11]5.3576,[12]5.4743,[13]5.5646,[14]5.6633,[15]5.7813,[16]5.8371,[17]5.8028,[18]5.6845,[19]5.6663,[20]5.6185,[21]5.4242,[22]5.2831,[23]5.1836,[24]5.2831,[25]5.3802,[26]5.4107,[27]5.4283,[28]5.3885,[29]5.3570,[30]5.3027,[31]5.2358,[32]5.1606,[33]5.1030,[34]5.1164,[35]5.1563,[36]5.2141,[37]5.1836,[38]5.1444,[39]5.1149,[40]5.0832,[41]5.1155,[42]5.1157,[43]5.1327,[44]5.1294,[45]5.1245,[46]5.1322,[47]5.1198,[48]5.1026,[49]5.0723,[50]5.0976,[51]5.1014,[52]5.1499,[53]5.1886,[54]5.2387,[55]5.2718,[56]5.2834,[57]5.2656,[58]5.2770,[59]5.2900,[60]5.3000,[61]5.2900,[62]5.2692,[63]5.2513,[64]5.2487,[65]5.2593,[66]5.2629,[67]5.2492,[68]5.2542,[69]5.2407,[70]5.2376,[71]5.2353,[72]5.2234,[73]5.2176,[74]5.2145,[75]5.2095,[76]5.2106,[77]5.1893,[78]5.1817,[79]5.2135,[80]5.2015,[81]5.2016,

llama_print_timings:        load time =     646.79 ms

llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)

llama_print_timings: prompt eval time =   93230.62 ms / 331776 tokens (    0.28 ms per token,  3558.66 tokens per second)

llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)

llama_print_timings:       total time =   96821.16 ms / 331777 tokens

Final estimate: PPL = 5.2016 +/- 0.02743

0 replies

PeterReid · 2024-03-04T13:27:14Z

PeterReid
Mar 4, 2024
Author

My perplexity runs were on the full 600ish chunks of wiki.test, and on a GPU, so the AVX2 bug wouldn't have affected them. I wonder if the difference is me using the instruct-tuned mistral? I also requantized down from Q8 because I couldn't find a fp16 ggml, so maybe that is it. I will do some more testing.

I did not realize you had already done all this work and more in this direction! I bet if you used a shuffle at the end the performance gap would close up.

0 replies

ikawrakow · 2024-03-04T14:21:34Z

ikawrakow
Mar 4, 2024

Ah, you used an instruct tuned Mistral-7B, this explains the large PPL values. Are you using the official one from Mistral AI or some other random tuning? With the official Mistral-Instruct-7B-v0.2 I get PPL = 6.7768 with IQ3_S on current master using imatrix from wiki.train.raw. Re-quantizing from Q8_0 cannot make a difference larger than 0.01 or 0.02 in perplexity. Perhaps you are not using an importance matrix in the quantization? With an importance matrix master IQ3_S is far ahead of this PR also for the instruct tuned version:

Master, Final PPL = 6.7768, PPL after 100 chunks: 6.9156


main: build = 2282 (cb49e0f8)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709560713
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 26
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,58980]   = ["▁ t", "i n", "e r", "▁ a", "h e...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:   32 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  193 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ3_S - 3.4375 bpw
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 2.96 GiB (3.51 BPW) 
llm_load_print_meta: general.name     = hf
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 ''
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    53.71 MiB
llm_load_tensors:      CUDA0 buffer size =  2980.55 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    10.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    73.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.00 MiB
llama_new_context_with_model: graph splits (measure): 2
system_info: n_threads = 1 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |

perplexity: tokenizing the input ..

perplexity: tokenization took 681.819 ms

perplexity: calculating perplexity over 642 chunks, batch_size=512

perplexity: 0.11 seconds per pass - ETA 1.15 minutes

[1]5.1181,[2]5.5599,[3]6.3461,[4]7.2322,[5]7.1592,[6]7.0989,[7]7.3795,[8]7.5409,[9]7.7160,[10]8.1432,[11]8.4899,[12]8.4855,[13]8.5357,[14]8.5273,[15]8.4072,[16]8.2555,[17]8.3492,[18]7.9897,[19]7.8003,[20]7.8981,[21]7.7498,[22]7.7304,[23]7.5170,[24]7.5580,[25]7.3131,[26]7.0754,[27]6.9229,[28]6.7517,[29]6.5879,[30]6.5547,[31]6.4093,[32]6.4126,[33]6.3674,[34]6.3117,[35]6.2889,[36]6.2256,[37]6.1849,[38]6.2061,[39]6.1918,[40]6.2496,[41]6.3127,[42]6.2660,[43]6.2984,[44]6.3000,[45]6.2882,[46]6.3372,[47]6.3260,[48]6.3233,[49]6.3043,[50]6.3390,[51]6.3846,[52]6.4401,[53]6.4351,[54]6.4167,[55]6.4315,[56]6.4257,[57]6.4709,[58]6.5014,[59]6.5373,[60]6.5413,[61]6.5942,[62]6.5747,[63]6.5802,[64]6.6214,[65]6.6507,[66]6.6682,[67]6.6758,[68]6.7018,[69]6.7270,[70]6.7854,[71]6.8295,[72]6.8511,[73]6.8937,[74]6.8795,[75]6.8931,[76]6.8968,[77]6.9067,[78]6.9502,[79]6.9595,[80]6.9444,[81]6.9217,[82]6.9091,[83]6.8887,[84]6.8878,[85]6.8872,[86]6.8413,[87]6.8393,[88]6.8178,[89]6.8480,[90]6.8578,[91]6.8671,[92]6.8603,[93]6.8791,[94]6.8780,[95]6.8820,[96]6.8895,[97]6.8873,[98]6.8793,[99]6.9181,[100]6.9156,[101]6.9298,[102]6.9390,[103]6.9347,[104]6.9486,[105]6.9492,[106]6.9676,[107]6.9671,[108]6.9609,[109]6.9909,[110]6.9977,[111]6.9955,[112]7.0105,[113]7.0031,[114]7.0071,[115]7.0050,[116]7.0433,[117]7.0804,[118]7.1245,[119]7.1187,[120]7.1618,[121]7.1858,[122]7.2066,[123]7.1828,[124]7.2175,[125]7.2398,[126]7.2494,[127]7.2323,[128]7.2300,[129]7.2318,[130]7.2195,[131]7.2165,[132]7.1970,[133]7.1938,[134]7.1731,[135]7.1474,[136]7.1389,[137]7.1408,[138]7.1230,[139]7.0974,[140]7.0696,[141]7.0560,[142]7.0339,[143]7.0159,[144]7.0042,[145]7.0006,[146]7.0145,[147]7.0116,[148]7.0211,[149]7.0118,[150]6.9947,[151]6.9875,[152]6.9932,[153]6.9959,[154]7.0030,[155]7.0083,[156]7.0155,[157]7.0201,[158]7.0357,[159]7.0015,[160]6.9938,[161]6.9678,[162]6.9338,[163]6.8956,[164]6.8493,[165]6.8168,[166]6.8003,[167]6.7867,[168]6.7566,[169]6.7459,[170]6.7281,[171]6.6930,[172]6.6722,[173]6.6568,[174]6.6250,[175]6.6001,[176]6.5861,[177]6.5711,[178]6.5503,[179]6.5431,[180]6.5446,[181]6.5419,[182]6.5339,[183]6.5433,[184]6.5428,[185]6.5480,[186]6.5680,[187]6.5848,[188]6.6151,[189]6.6465,[190]6.6596,[191]6.6922,[192]6.7012,[193]6.7215,[194]6.7490,[195]6.7682,[196]6.7731,[197]6.7674,[198]6.7799,[199]6.8035,[200]6.7936,[201]6.8054,[202]6.8118,[203]6.8137,[204]6.8201,[205]6.8197,[206]6.8248,[207]6.8393,[208]6.8476,[209]6.8602,[210]6.8596,[211]6.8561,[212]6.8755,[213]6.8892,[214]6.9130,[215]6.9132,[216]6.9206,[217]6.9077,[218]6.9012,[219]6.8900,[220]6.8907,[221]6.8807,[222]6.9053,[223]6.9201,[224]6.9341,[225]6.9243,[226]6.9173,[227]6.9099,[228]6.8926,[229]6.9125,[230]6.9107,[231]6.9111,[232]6.9177,[233]6.9153,[234]6.9049,[235]6.8982,[236]6.8961,[237]6.8824,[238]6.8706,[239]6.8572,[240]6.8428,[241]6.8195,[242]6.8117,[243]6.8007,[244]6.7897,[245]6.7862,[246]6.7799,[247]6.7642,[248]6.7529,[249]6.7454,[250]6.7302,[251]6.7108,[252]6.6942,[253]6.6822,[254]6.6751,[255]6.6658,[256]6.6627,[257]6.6560,[258]6.6491,[259]6.6339,[260]6.6323,[261]6.6256,[262]6.6225,[263]6.6250,[264]6.6248,[265]6.6253,[266]6.6290,[267]6.6302,[268]6.6405,[269]6.6456,[270]6.6546,[271]6.6560,[272]6.6606,[273]6.6780,[274]6.6902,[275]6.6987,[276]6.7083,[277]6.7105,[278]6.7213,[279]6.7361,[280]6.7380,[281]6.7476,[282]6.7592,[283]6.7540,[284]6.7459,[285]6.7274,[286]6.7210,[287]6.7075,[288]6.6955,[289]6.6909,[290]6.6872,[291]6.6842,[292]6.6826,[293]6.6774,[294]6.6697,[295]6.6712,[296]6.6669,[297]6.6587,[298]6.6452,[299]6.6388,[300]6.6286,[301]6.6255,[302]6.6279,[303]6.6162,[304]6.6112,[305]6.6050,[306]6.6056,[307]6.5984,[308]6.5961,[309]6.5792,[310]6.5734,[311]6.5537,[312]6.5326,[313]6.5542,[314]6.5614,[315]6.5626,[316]6.5554,[317]6.5484,[318]6.5529,[319]6.5627,[320]6.5717,[321]6.5732,[322]6.5741,[323]6.5771,[324]6.5890,[325]6.5969,[326]6.6072,[327]6.6068,[328]6.6001,[329]6.5964,[330]6.5908,[331]6.5858,[332]6.5848,[333]6.5823,[334]6.5968,[335]6.5962,[336]6.5972,[337]6.5979,[338]6.5961,[339]6.5968,[340]6.5988,[341]6.6020,[342]6.6128,[343]6.6106,[344]6.6115,[345]6.6104,[346]6.6030,[347]6.6086,[348]6.6161,[349]6.6177,[350]6.6164,[351]6.6233,[352]6.6235,[353]6.6215,[354]6.6127,[355]6.6198,[356]6.6311,[357]6.6472,[358]6.6525,[359]6.6492,[360]6.6486,[361]6.6486,[362]6.6507,[363]6.6460,[364]6.6414,[365]6.6382,[366]6.6392,[367]6.6377,[368]6.6295,[369]6.6302,[370]6.6315,[371]6.6415,[372]6.6449,[373]6.6410,[374]6.6411,[375]6.6296,[376]6.6269,[377]6.6217,[378]6.6296,[379]6.6270,[380]6.6206,[381]6.6114,[382]6.6106,[383]6.6024,[384]6.5965,[385]6.5910,[386]6.5882,[387]6.5847,[388]6.5834,[389]6.5902,[390]6.5946,[391]6.6036,[392]6.6059,[393]6.6045,[394]6.6114,[395]6.6108,[396]6.6160,[397]6.6104,[398]6.6081,[399]6.6121,[400]6.6153,[401]6.6276,[402]6.6428,[403]6.6569,[404]6.6746,[405]6.6856,[406]6.6931,[407]6.7024,[408]6.7156,[409]6.7216,[410]6.7258,[411]6.7285,[412]6.7359,[413]6.7378,[414]6.7393,[415]6.7475,[416]6.7492,[417]6.7590,[418]6.7677,[419]6.7738,[420]6.7824,[421]6.7926,[422]6.8054,[423]6.8046,[424]6.8005,[425]6.7913,[426]6.7970,[427]6.7957,[428]6.8092,[429]6.8168,[430]6.8145,[431]6.8177,[432]6.8140,[433]6.8152,[434]6.8186,[435]6.8206,[436]6.8197,[437]6.8222,[438]6.8233,[439]6.8290,[440]6.8312,[441]6.8304,[442]6.8257,[443]6.8178,[444]6.8191,[445]6.8101,[446]6.8151,[447]6.8162,[448]6.8191,[449]6.8193,[450]6.8200,[451]6.8307,[452]6.8344,[453]6.8325,[454]6.8376,[455]6.8354,[456]6.8292,[457]6.8318,[458]6.8377,[459]6.8424,[460]6.8415,[461]6.8379,[462]6.8403,[463]6.8329,[464]6.8370,[465]6.8337,[466]6.8326,[467]6.8273,[468]6.8300,[469]6.8395,[470]6.8471,[471]6.8514,[472]6.8533,[473]6.8500,[474]6.8507,[475]6.8519,[476]6.8452,[477]6.8476,[478]6.8479,[479]6.8424,[480]6.8420,[481]6.8389,[482]6.8334,[483]6.8356,[484]6.8314,[485]6.8310,[486]6.8292,[487]6.8239,[488]6.8209,[489]6.8181,[490]6.8137,[491]6.8109,[492]6.8081,[493]6.8082,[494]6.7978,[495]6.7979,[496]6.8023,[497]6.8020,[498]6.7938,[499]6.7931,[500]6.7924,[501]6.7912,[502]6.7944,[503]6.8048,[504]6.8144,[505]6.8093,[506]6.8064,[507]6.8071,[508]6.8136,[509]6.8166,[510]6.8150,[511]6.8166,[512]6.8169,[513]6.8188,[514]6.8215,[515]6.8248,[516]6.8278,[517]6.8216,[518]6.8227,[519]6.8173,[520]6.8132,[521]6.8147,[522]6.8212,[523]6.8211,[524]6.8220,[525]6.8155,[526]6.8149,[527]6.8205,[528]6.8216,[529]6.8253,[530]6.8287,[531]6.8303,[532]6.8294,[533]6.8289,[534]6.8249,[535]6.8239,[536]6.8199,[537]6.8144,[538]6.8127,[539]6.8112,[540]6.8097,[541]6.8077,[542]6.8061,[543]6.8034,[544]6.8005,[545]6.8000,[546]6.7975,[547]6.8008,[548]6.7969,[549]6.7961,[550]6.7963,[551]6.7910,[552]6.7993,[553]6.8052,[554]6.8064,[555]6.8080,[556]6.8084,[557]6.8071,[558]6.8104,[559]6.8076,[560]6.8086,[561]6.8102,[562]6.8080,[563]6.8062,[564]6.8013,[565]6.7976,[566]6.7978,[567]6.7944,[568]6.7900,[569]6.7862,[570]6.7832,[571]6.7837,[572]6.7824,[573]6.7783,[574]6.7732,[575]6.7737,[576]6.7739,[577]6.7791,[578]6.7797,[579]6.7836,[580]6.7833,[581]6.7785,[582]6.7810,[583]6.7778,[584]6.7800,[585]6.7764,[586]6.7743,[587]6.7744,[588]6.7725,[589]6.7674,[590]6.7725,[591]6.7796,[592]6.7798,[593]6.7837,[594]6.7884,[595]6.7839,[596]6.7787,[597]6.7810,[598]6.7818,[599]6.7776,[600]6.7770,[601]6.7755,[602]6.7668,[603]6.7657,[604]6.7671,[605]6.7597,[606]6.7533,[607]6.7457,[608]6.7323,[609]6.7310,[610]6.7327,[611]6.7359,[612]6.7382,[613]6.7406,[614]6.7465,[615]6.7494,[616]6.7526,[617]6.7559,[618]6.7624,[619]6.7683,[620]6.7672,[621]6.7707,[622]6.7691,[623]6.7711,[624]6.7688,[625]6.7701,[626]6.7646,[627]6.7637,[628]6.7623,[629]6.7699,[630]6.7705,[631]6.7694,[632]6.7687,[633]6.7744,[634]6.7721,[635]6.7742,[636]6.7746,[637]6.7778,[638]6.7823,[639]6.7861,[640]6.7945,[641]6.7827,[642]6.7768,

Final estimate: PPL = 6.7768 +/- 0.04347

llama_print_timings: load time = 1179.10 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 66356.34 ms / 328704 tokens ( 0.20 ms per token, 4953.62 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 69002.37 ms / 328705 tokens

This PR, PPL after 100 chunks: 7.0149


main: build = 2278 (7ad9511)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709561204
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 26
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,58980]   = ["▁ t", "i n", "e r", "▁ a", "h e...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:   32 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  193 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ3_S - 3.4375 bpw
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 2.96 GiB (3.51 BPW) 
llm_load_print_meta: general.name     = hf
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 ''
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  3034.27 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    10.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    72.00 MiB
llama_new_context_with_model: graph splits (measure): 1
system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |

perplexity: tokenizing the input ..

perplexity: tokenization took 681.818 ms

perplexity: calculating perplexity over 642 chunks, batch_size=512

perplexity: 7.13 seconds per pass - ETA 1 hours 16.25 minutes

[1]5.2641,[2]5.6427,[3]6.5795,[4]7.5137,[5]7.3683,[6]7.2984,[7]7.5501,[8]7.6696,[9]7.8524,[10]8.2674,[11]8.6302,[12]8.5866,[13]8.6865,[14]8.7002,[15]8.5642,[16]8.4235,[17]8.4992,[18]8.1304,[19]7.9161,[20]8.0189,[21]7.8672,[22]7.8708,[23]7.6620,[24]7.6940,[25]7.4465,[26]7.1942,[27]7.0320,[28]6.8689,[29]6.7001,[30]6.6704,[31]6.5208,[32]6.5267,[33]6.4811,[34]6.4211,[35]6.3936,[36]6.3402,[37]6.3032,[38]6.3161,[39]6.2938,[40]6.3561,[41]6.4142,[42]6.3680,[43]6.3910,[44]6.3964,[45]6.3844,[46]6.4346,[47]6.4185,[48]6.4142,[49]6.3949,[50]6.4261,[51]6.4706,[52]6.5265,[53]6.5237,[54]6.5117,[55]6.5250,[56]6.5181,[57]6.5589,[58]6.5914,[59]6.6257,[60]6.6309,[61]6.6843,[62]6.6644,[63]6.6668,[64]6.7073,[65]6.7359,[66]6.7494,[67]6.7543,[68]6.7820,[69]6.8051,[70]6.8560,[71]6.9020,[72]6.9180,[73]6.9615,[74]6.9469,[75]6.9539,[76]6.9566,[77]6.9668,[78]7.0093,[79]7.0206,[80]7.0081,[81]6.9834,[82]6.9756,[83]6.9585,[84]6.9619,[85]6.9694,[86]6.9274,[87]6.9270,[88]6.9030,[89]6.9336,[90]6.9442,[91]6.9554,[92]6.9495,[93]6.9681,[94]6.9689,[95]6.9743,[96]6.9858,[97]6.9844,[98]6.9768,[99]7.0184,[100]7.0149,^C

1 reply

PeterReid Mar 4, 2024
Author

You are right! I forgot to specify the importance matrix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IQ3_S Improvements #5866

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

IQ3_S Improvements #5866

Uh oh!

Uh oh!

PeterReid Mar 4, 2024

Replies: 4 comments · 1 reply

Uh oh!

ikawrakow Mar 4, 2024

Uh oh!

ikawrakow Mar 4, 2024

Uh oh!

Uh oh!

PeterReid Mar 4, 2024 Author

Uh oh!

ikawrakow Mar 4, 2024

Uh oh!

PeterReid Mar 4, 2024 Author

PeterReid
Mar 4, 2024

Replies: 4 comments 1 reply

ikawrakow
Mar 4, 2024

ikawrakow
Mar 4, 2024

PeterReid
Mar 4, 2024
Author

ikawrakow
Mar 4, 2024

PeterReid Mar 4, 2024
Author