Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp)

# Expected Behavior

I can load a 13B model and generate text with it with decent token generation speed with a M1 Pro CPU (16 GB RAM).

# Current Behavior

When I load a 13B model with llama.cpp (like Alpaca 13B or other models based on it) and I try to generate some text, every token generation needs several seconds, to the point that these models are not usable for how unbearably slow they are. But they works with reasonable speed using Dalai, that uses an older version of llama.cpp

# Environment and Context 

MacBook Pro with M1 Pro, 16 GB RAM, macOS Ventura 13.3.

Python 3.9.16

GNU Make 3.81

Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.4.0
Thread model: posix

If you need some kind of log or other informations, I will post everything you need. Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp) #767

Expected Behavior

Current Behavior

Environment and Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp) #767

Description

Expected Behavior

Current Behavior

Environment and Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions