Alpaca model is running very slow in llama.cpp compared to alpaca.cpp

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Current Behavior

Just yesterday, this migration script was added : `migrate-ggml-2023-03-30-pr613.py`. 
So, what I did on top of [@madmads11  instructions for using alpaca models](https://github.com/ggerganov/llama.cpp/issues/382#issuecomment-1479091459) was to use this above script and generate the final bin file to work with.

Details : 
- Alpaca Model used  : https://huggingface.co/Pi3141/alpaca-lora-7B-ggml
- Tokenizer used : https://huggingface.co/decapoda-research/llama-7b-hf/blob/main/tokenizer.model 

I am using `llama.cpp` just today to run alpaca model. (was using antimatters alpaca.cpp until now)

This same model that's converted and loaded in `llama.cpp` runs very slow compared  to running it in `alpaca.cpp`. 

How I started up model : 
- `./main -m ./models/alpaca-7b-migrated.bin -ins --n_parts 1`

The logs : 
```
main: seed = 1680346670
llama_model_load: loading model from './models/alpaca-7b-migrated.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/alpaca-7b-migrated.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 2
```

Additionally, I also used this bin file : https://huggingface.co/Pi3141/alpaca-lora-7B-ggml/blob/main/ggml-model-q4_1.bin that's already migrated for `llama.cpp`. And even for this, model is running slow with `llama.cpp`.

One thing I noticed was, while loading between these two model variants, this line is different than on above.
`llama_model_load: f16     = 3`.





# Environment and Context 
* Physical (or virtual) hardware :
```

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            13
    CPU(s) scaling MHz:  98%
    CPU max MHz:         5000,0000
    CPU min MHz:         800,0000
    BogoMIPS:            7202,00
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    2 MiB (8 instances)
  L3:                    16 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:         
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; Enhanced IBRS
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Vulnerable: eIBRS with unprivileged eBPF
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Mitigation; TSX disabled
```

* System info:
```
- OS : 5.10.148-1-MANJARO
- python version : Python 3.10.9
- g++version : g++ (GCC) 12.2.1 20230201
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alpaca model is running very slow in llama.cpp compared to alpaca.cpp #677

Prerequisites

Current Behavior

Environment and Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alpaca model is running very slow in llama.cpp compared to alpaca.cpp #677

Description

Prerequisites

Current Behavior

Environment and Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions