Description
Hi everybody,
I am trying to fine-tune a llama-2-13B-chat model and I think I did everything correctly but I still cannot apply my lora.
What I did was:
- I converted the llama2 weights into hf format using this: (https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py)
python convert_llama_weights_to_hf.py --input_dir ../models/llama-2-13b-chat --output_dir ../models/llama-2-13b-chat/llama-2-13b-chat-hf --model_size 13B
- Then I ran the convert.py and converted it in fp32
python convert.py ../models/llama-2-13b-chat/llama-2-13b-chat-hf --outtype f32 --outfile ../models/llama-2-13b-chat/llama-2-13b-chat-hf-f32.bin
- then I quantised it (I am using q5_k_m and not q4_0)
./quantize ../models/llama-2-13b-chat/llama-2-13b-chat-hf-f32.bin ../models/llama-2-13b-chat/llama-2-13b-chat-hf-quantized_q5_k_m.bin q5_k_m
At this point I test the models with ./main and they work perfectly.
- So I created a dataset trying to orientate myself in many contradictory answers about the format and at the end I opted for the one that seemed the most used:
<s>[INST] <<SYS>>
In the context of physics
<</SYS>>What is quantum entanglement? [/INST] Quantum entanglement is the phenomenon that occurs when a group of particles are generated, interact, or share in such a way that the quantum state of each particle of the group cannot be described independently of the state of the others, including when the particles are separated by a large distance.
(I didn't put any </s> at the end because for some reason the loss became nan after less than 10 iterations)
- I started the fine-tuning and left it for almost 12h for completing 1 epoch
./finetune --model-base ../models/llama-2-13b-chat/llama-2-13b-chat-hf-quantized_q5_k_m.bin --train-data ../datasets/FineTune/train_llamacpp.txt --threads 26 --sample-start "<s>" --ctx 512 -ngl 32
- Then I tested the model
./main -i -m ../models/llama-2-13b-chat/llama-2-13b-chat-hf-quantized_q5_k_m.bin --lora-base ../models/llama-2-13b-chat/llama-2-13b-chat-hf-f32.bin --lora ../models/llama-2-13b-chat/ggml-lora-LATEST-f32.gguf --color -p "What is entanglement in physics?"
I always get as last lines of the logs
.....
llama_apply_lora_from_file_internal: unsupported tensor dimension 1
llama_init_from_gpt_params: error: failed to apply lora adapter
ggml_metal_free: deallocating
main: error: unable to load model
I am running everything on an M1 Max with 64GB of Ram and 1 GPU with 32 cores
What can be the problem? I tried already different things but no success, that's why I'm writing here...
Thank you for any help
Luca