Closed
Description
Expected Behavior
I can load a 13B model and generate text with it with decent token generation speed with a M1 Pro CPU (16 GB RAM).
Current Behavior
When I load a 13B model with llama.cpp (like Alpaca 13B or other models based on it) and I try to generate some text, every token generation needs several seconds, to the point that these models are not usable for how unbearably slow they are. But they works with reasonable speed using Dalai, that uses an older version of llama.cpp
Environment and Context
MacBook Pro with M1 Pro, 16 GB RAM, macOS Ventura 13.3.
Python 3.9.16
GNU Make 3.81
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.4.0
Thread model: posix
If you need some kind of log or other informations, I will post everything you need. Thanks in advance.