Skip to content

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp) #767

Closed
@serovar

Description

@serovar

Expected Behavior

I can load a 13B model and generate text with it with decent token generation speed with a M1 Pro CPU (16 GB RAM).

Current Behavior

When I load a 13B model with llama.cpp (like Alpaca 13B or other models based on it) and I try to generate some text, every token generation needs several seconds, to the point that these models are not usable for how unbearably slow they are. But they works with reasonable speed using Dalai, that uses an older version of llama.cpp

Environment and Context

MacBook Pro with M1 Pro, 16 GB RAM, macOS Ventura 13.3.

Python 3.9.16

GNU Make 3.81

Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.4.0
Thread model: posix

If you need some kind of log or other informations, I will post everything you need. Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions