Description
Expected Behavior
I am comparing the performance of two executables: llama.cpp (current version) and the default gpt4all executable (which uses a previous version of llama.cpp). I am using the same language model for both executables, and I expect the current version of llama.cpp (which is built specifically for the hardware) to perform at least as fast as the default gpt4all executable.
Current Behavior
The default gpt4all executable, which uses a previous version of llama.cpp, performs significantly faster than the current version of llama.cpp. Despite building the current version of llama.cpp with hardware-specific compiler flags, it consistently performs significantly slower when using the same model as the default gpt4all executable.
Environment and Context
I am running the comparison on a Windows platform, using the default gpt4all executable and the current version of llama.cpp included in the gpt4all project. The version of llama.cpp is the latest available (after the compatibility with the gpt4all model).
Steps to Reproduce
- Build the current version of llama.cpp with hardware-specific compiler flags.
- Execute the llama.cpp executable using the gpt4all language model and record the performance metrics.
- Execute the default gpt4all executable (previous version of llama.cpp) using the same language model and record the performance metrics.
- You'll see that the gpt4all executable generates output significantly faster for any number of threads or config.
Here's some context/config when I'm doing the runs:
(left panel is latest llama.cpp, right panel is gpt4all build)
This is the older version that gpt4all uses (with some tweaks): https://github.com/zanussbaum/gpt4all.cpp
*To quickly test the difference yourself you can use the gpt4all default binaries here: https://github.com/nomic-ai/gpt4all/tree/main/chat