Performance of llama.cpp with Vulkan #10879

netrunnereve · 2024-12-18T03:56:09Z

netrunnereve
Dec 18, 2024
Collaborator

This is similar to the Apple Silicon benchmark thread, but for Vulkan! Many improvements have been made to the Vulkan backend and I think it's good to consolidate and discuss our results here.

We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

Either run the commands below or download one of our Vulkan releases. If you have multiple GPUs please run the test on a single GPU using -sm none -mg YOUR_GPU_NUMBER unless the model is too big to fit in VRAM.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release
make
llama-bench -m ../../llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1 (add any extra options here)

Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.

If multiple entries are posted for the same device newer commits with substantial Vulkan updates are prioritized, alternatively the one with the highest tg128 score will be used. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that your memory speed and number of channels will greatly affect your inference speed!

Vulkan Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip	pp512 t/s	tg128 t/s	Commit	Comments
AMD Radeon RX 7900 XTX	3419.13 ± 29.51	144.90 ± 0.92	03f582a
Nvidia RTX 5070 Ti	6213.63 ± 27.72	135.63 ± 0.18	`d13d0f6`	coopmat2
Nvidia RTX 3090	3301.47 ± 33.76	123.72 ± 0.14	`0d52a69`
Nvidia A100 (80GB)	3103.32 ± 4.21	121.83 ± 0.54	`d394a9a`
AMD Radeon RX 9070 XT	2336.07 ± 5.92	117.72 ± 0.34	`6f180b9`
Apple M3 Ultra Mac Studio	1116.83 ± 0.55	115.54 ± 0.78	`2d451c8`	MoltenVK
AMD Radeon RX 7800 XT	1260.54 ± 10.51	107.53 ± 0.07	`ee02ad0`
AMD Radeon RX 6900 XT	1257.98 ± 1.55	101.42 ± 0.02	44e18ef
AMD Radeon RX 6800 XT	1533.60 ± 2.47	95.56 ± 0.72	N/A
Nvidia RTX 4070	3179.37 ± 46.16	92.29 ± 0.28	`9a48399`
AMD Radeon PRO W6800X	510.80 ± 0.13	86.47 ± 0.46	`13b4548`	MoltenVK
AMD Radeon PRO W6800X Duo	519.14 ± 0.13	87.56 ± 0.19	`13b4548`	MoltenVK
Nvidia RTX 5060 Ti	3211.73 ± 24.44	81.48 ± 3.50	`658987c`	coopmat2
Nvidia RTX 3070	2113.02 ± 7.38	78.71 ± 0.13	`1b8fb81`
AMD Radeon Instinct MI60	369.26 ± 2.48	78.16 ± 1.40	504af20
AMD Radeon Instinct MI50	387.37 ± 0.33	71.46 ± 0.10	d5fe4e8
AMD Radeon Pro VII	612.47 ± 0.87	71.37 ± 0.98	N/A
AMD Radeon RX 5700 XT	439.42 ± 0.28	70.13 ± 0.05	c05e8c9
AMD Radeon Pro W5700	504.20 ± 0.14	67.18 ± 0.08	`4265a87`
Nvidia RTX 2070 SUPER	1199.13 ± 7.70	64.64 ± 0.20	`b7552cf`
Nvidia RTX 3080	1706.07 ± 139.33	62.16 ± 1.98	`4da69d1`	Result appears lower than expected, maybe non-release build?
AMD Radeon RX 7600 XT	632.88 ± 0.70	58.44 ± 0.01	`3b24d26`
AMD Radeon Instinct MI25	439.42 ± 0.34	54.69 ± 0.03	`2739a71`
Nvidia RTX 3060	1298.03 ± 23.40	54.28 ± 1.05	6171c9d
AMD Radeon RX 6600 XT	574.65 ± 0.86	53.92 ± 0.11	`091592d`
AMD BC-250	331.58 ± 0.06	49.76 ± 0.06	cf2270e
Nvidia RTX 3060 Mobile	1059.76 ± 3.54	49.03 ± 0.13	`dbb3a47`
Intel Arc A770	725.31 + 0.98	49.43 + 1.45	`259469c`
AMD Radeon RX 6600M	605.59 ± 0.65	48.21 ± 0.07	`fe5b78c`
AMD Radeon RX 6600	380.87 ± 0.21	47.47 ± 0.18	0fd7ca7
AMD Radeon RX 7600M XT	459.39 ± 2.34	45.28 ± 0.10	`b9ab0a4`	eGPU
Intel Arc B580	175.56 ± 2.65	44.12 ± 0.09	`9a48399`
Nvidia RTX 4050 Mobile	1154.28 + 15.76	41.89 + 0.10	`d79d8f3`
AMD Radeon RX 580	258.03 ± 0.71	39.32 ± 0.03	de4c07f
AMD Radeon RX 470	185.48 ± 1.17	33.94 ± 0.06	`d7a14c4`
Intel Arc B570	651.09 ± 0.16	31.44 ± 0.01	`8e186ef`
AMD FirePro W8100	154.96 ± 0.60	28.55 ± 0.17	`d7a14c4`
AMD Radeon RX 6500 XT	255.25 ± 0.35	27.81 ± 0.10	g9fdfcd
Intel Arc A750	88.86 ± 0.14	27.57 ± 0.03	`8d59d91`
Apple M3 MacBook Pro	263.70 ± 0.02	26.39 ± 0.14	`b9ab0a4`	MoltenVK
AMD FirePro S10000	94.78 ± 0.02	25.32 ± 0.02	`914a82d`	Split across dual GPUs
Intel Core Ultra 7 258V	210.27 ± 0.86	21.63 ± 0.16	`0cf6725`
AMD Ryzen AI 9 HX 370	309.35 ± 0.93	21.23 ± 0.40	`87616f0`
AMD Ryzen 7 8840HS	245.79 ± 2.97	20.10 ± 0.07	`19d3c82`
AMD Ryzen 7 7940HS	281.62 ± 1.56	19.91 ± 0.07	`ebce03e`
AMD Ryzen Z1 Extreme	199.36 ± 7.02	18.77 ± 0.02	`53ff6b9`
AMD Ryzen 7 7840U	237.73 ± 13.98	18.22 ± 0.62	`70680c4`
AMD Ryzen 5 8600G	183.35 ± 1.73	16.99 ± 0.02	`9ecf3e6`
AMD FirePro D700	69.95 ± 0.04	16.62 ± 0.01	`d3bd719`	MoltenVK, running in FP16 mode on FP32 only chip
Apple M2 MacBook Air	38.67 ± 0.03	11.07 ± 0.04	`017cc5f`	Asahi Linux
AMD Ryzen 7 5700G	90.55 ± 0.08	10.98 ± 0.07	d84635b
AMD Ryzen 7 5800H	90.15 ± 1.45	10.81 ± 0.14	`dbb3a47`
AMD Ryzen 5 5600H	75.60 ± 0.32	10.59 ± 0.18	`0bb2919`
AMD Ryzen 7 7730U	84.79 ± 0.88	10.23 ± 0.13	d84635b
AMD Ryzen 5 5600G	77.22 ± 0.01	9.34 ± 0.01	`8ae5ebc`
AMD Ryzen 5 5600U	61.82 ± 0.46	8.92 ± 0.02	`141a908`
MediaTek Dimensity 9400	38.36 ± 15.15	8.92 ± 0.06	`b9ab0a4`	GPU supports coopmat but pp512 is faster with it turned off
Intel i7-1185G7	42.02 ± 0.07	7.28 ± 0.24	`ff3fcab`
AMD Ryzen 5 3400G	46.47 ± 5.15	5.99 ± 0.71	`0893e01`
Intel Core i7-1065G7	25.58 ± 0.00	4.25 ± 0.18	N/A
Intel i5-8350U	25.28 ± 0.00	3.23 ± 0.00	`f26c874`

Vulkan Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip	pp512 t/s	tg128 t/s	Commit	Comments
AMD Radeon RX 7900 XTX	3743.15 ± 45.40	143.98 ± 0.55	03f582a
Nvidia RTX 5070 Ti	6614.86 ± 8.32	133.94 ± 0.02	`d13d0f6`	coopmat2
Nvidia A100 (80GB)	3164.55 ± 5.00	120.53 ± 0.41	`d394a9a`
AMD Radeon RX 9070 XT	2296.83 ± 5.22	120.46 ± 0.29	`6f180b9`
Nvidia RTX 3090	4516.92 ± 9.55	120.44 ± 2.58	N/A	coopmat2
Nvidia RTX 4070	4293.57 ± 27.70	91.49 ± 0.89	`9a48399`	coopmat2
Nvidia RTX 5060 Ti	3492.22 ± 15.73	83.26 ± 2.03	`658987c`	coopmat2
AMD Radeon RX 7600 XT	586.16 ± 2.43	59.02 ± 0.03	`3b24d26`
Intel Arc A770	327.58 + 0.19	48.17 + 0.04	`259469c`
Intel Arc B570	342.71 ± 0.07	30.88 ± 0.01	`8e186ef`
AMD Ryzen 5 8600G	188.84 ± 0.73	16.57 ± 0.26	`9ecf3e6`

netrunnereve · 2024-12-18T03:58:41Z

netrunnereve
Dec 18, 2024
Collaborator Author

AMD FirePro W8100

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)

model	size	params	backend	ngl	threads	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	pp512	137.10 ± 0.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	tg128	28.51 ± 0.12

1 reply

netrunnereve May 1, 2025
Collaborator Author

With the latest updates:

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: d7a14c42 (5252)

model	size	params	backend	ngl	threads	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	pp512	154.96 ± 0.60
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	tg128	28.55 ± 0.17

netrunnereve · 2024-12-18T04:00:36Z

netrunnereve
Dec 18, 2024
Collaborator Author

AMD RX 470

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	161.47 ± 0.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.45 ± 0.04

1 reply

netrunnereve May 1, 2025
Collaborator Author

With the latest updates:

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: d7a14c42 (5252)

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	185.48 ± 1.17
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.94 ± 0.06

max-krasnyansky · 2024-12-18T05:09:04Z

max-krasnyansky
Dec 18, 2024
Collaborator

ubuntu 24.04, vulkan and cuda installed from official APT packages.

ggml_vulkan: 0 = NVIDIA GeForce RTX 3080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	1706.07 ± 139.33
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	62.16 ± 1.98

build: 4da69d1 (4351)

vs CUDA on the same build/setup

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	4499.47 ± 60.66
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg128	131.01 ± 0.43

build: 4da69d1 (4351)

0 replies

hkbu-kennycheng · 2025-01-08T02:57:11Z

hkbu-kennycheng
Jan 8, 2025

Macbook Air M2 on Asahi Linux

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M2 (G14G B0) (Honeykrisp) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	38.67 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	11.07 ± 0.04

[build build: 017cc5f](build: 017cc5f)

3 replies

ericcurtin Jan 14, 2025
Collaborator

For the record I think this is slow on the HoneyKrisp side rather than llama.cpp

tsugabloom Mar 29, 2025

Can you share how you got vulkan to build on Asahi? I can't seem to get cmake to notice it.

cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- Including CPU backend
-- ARM detected
-- ARM -mcpu not found, -mcpu=native will be used
-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+nosve+nosme 
CMake Error at /usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
  Could NOT find Vulkan (missing: Vulkan_LIBRARY) (found version "1.3.296")
Call Stack (most recent call first):
  /usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.30/Modules/FindVulkan.cmake:595 (find_package_handle_standard_args)
  ggml/src/ggml-vulkan/CMakeLists.txt:4 (find_package)


-- Configuring incomplete, errors occurred!

tsugabloom Mar 29, 2025

Spoke too soon, got it working! cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1 -DVulkan_LIBRARY=/usr/lib64/libvulkan.so.1

hkbu-kennycheng · 2025-01-08T03:22:16Z

hkbu-kennycheng
Jan 8, 2025

Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	199.36 ± 7.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	18.77 ± 0.02

[build build: 53ff6b9](build: 53ff6b9)

0 replies

hkbu-kennycheng · 2025-01-08T10:35:31Z

hkbu-kennycheng
Jan 8, 2025

ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1545.39 ± 6.58
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	88.12 ± 1.06

[build build: 53ff6b9](build: 53ff6b9)

4 replies

0cc4m Jan 8, 2025
Collaborator

Cool setup! Could you also post the result of 1, 2 and 3 7900 XTX GPUs? You can use only the first GPU with export GGML_VK_VISIBLE_DEVICES=0, the first two with export GGML_VK_VISIBLE_DEVICES=0,1 and so on.

hkbu-kennycheng Jan 8, 2025

env GGML_VK_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2022.59 ± 10.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	136.24 ± 0.30

env GGML_VK_VISIBLE_DEVICES=1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2039.24 ± 18.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	140.68 ± 2.09

env GGML_VK_VISIBLE_DEVICES=2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2062.17 ± 5.36
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	143.99 ± 0.23

env GGML_VK_VISIBLE_DEVICES=3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1997.04 ± 5.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	136.98 ± 1.73

env GGML_VK_VISIBLE_DEVICES=0,1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1668.19 ± 12.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	100.62 ± 0.66

env GGML_VK_VISIBLE_DEVICES=0,1,2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1566.38 ± 8.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	97.96 ± 1.13

env GGML_VK_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1484.04 ± 6.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	91.48 ± 0.63

netrunnereve Jan 8, 2025
Collaborator Author

For this multi GPU case getting Vulkan to support #6017 pipeline parallelism might help improve the prompt processing speed.

hkbu-kennycheng Jan 9, 2025

@netrunnereve I updated the commit id in all my result.

0cc4m · 2025-01-08T11:04:08Z

0cc4m
Jan 8, 2025
Collaborator

build: 0d52a69 (4439)

NVIDIA GeForce RTX 3090 (NVIDIA)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	3301.47 ± 33.76
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	123.72 ± 0.14

AMD Radeon RX 6800 XT (RADV NAVI21) (radv)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	863.03 ± 0.70
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	91.59 ± 0.40

AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	312.02 ± 0.97
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	70.17 ± 0.25

Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	95.52 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	44.49 ± 0.03

0 replies

0cc4m · 2025-01-08T11:08:46Z

0cc4m
Jan 8, 2025
Collaborator

@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release

2 replies

netrunnereve Jan 8, 2025
Collaborator Author

I've added -DCMAKE_BUILD_TYPE=Release to the post, but honestly I've always built without this flag for both Vulkan and CPU backends and never noticed a difference in performance. Having Release set might strip the debug symbols but it shouldn't affect the compiler optimizations.

My release numbers for the RX 470 are basically identical to the ones I posted earlier without the flag.

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	160.08 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.41 ± 0.15

0cc4m Jan 8, 2025
Collaborator

Maybe not in your case, but some other results are suspiciously low in tg (for example the RTX 3080)

qnixsynapse · 2025-01-09T02:41:52Z

qnixsynapse
Jan 9, 2025
Collaborator

Build: 8d59d91 (4450)
ggml_vulkan: 0 = Intel(R) Arc(tm) A750 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	88.86 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	27.57 ± 0.03

Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
Compared to SYCL:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	pp512	1616.11 ± 5.28
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	tg128	36.64 ± 0.05

edit: retested both with the default batch size.

8 replies

0cc4m Jan 10, 2025
Collaborator

They do have vtune but it needs a third party kernel module to run which I don't like tbh.

Also, I don't know whether it supports Vulkan apps or not. But it does seem to support opencl.

I put my A770 into a Windows PC and gave Intel GPA and vtune a shot: GPA just crashes most of the time, I couldn't get it to trace anything useful. vtune works, but does not support Vulkan. It just shows some high-level metrics in that case, not really useful sadly.

qnixsynapse Jan 11, 2025
Collaborator

Your Vulkan tg result is lower than expected, can you retry with the cmake build type set like in the updated instructions? It might be due to a debug build.

I did build it with cmake with build type Release.

0cc4m Jan 11, 2025
Collaborator

In that case it's something else, cause it should be performing similarly to my A770. I suspect the mesa version, there was something in newer mesa versions that slowed down tg on Intel.

qnixsynapse Jan 11, 2025
Collaborator

A750 has 448 CUs, A770 has 512 CUs I think. Personally, I am not worried about tg. I am worried about pp here. The gemm batch quickly saturates my GPU.

qnixsynapse Feb 9, 2025
Collaborator

@0cc4m https://gitlab.freedesktop.org/mesa/mesa/-/issues/12585

0cc4m · 2025-01-09T15:32:01Z

0cc4m
Jan 9, 2025
Collaborator

Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5.

build: 914a82d (4452)

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	pp512	94.78 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	tg128	25.32 ± 0.02

1 reply

netrunnereve Jan 9, 2025
Collaborator Author

Very interesting, and looks like it's pretty close to the W8100 in tg despite being a dual GPU card. Your backend scales pretty well with layer splitting which is why I find it worthwhile to run my RX470 and W8100 together (I end up getting results that are close to the average of both cards).

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	threads	main_gpu	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	pp512	147.84 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	tg128	30.77 ± 0.00

vkhodygo · 2025-01-10T12:21:36Z

vkhodygo
Jan 10, 2025

Latest arch with Vulkan Instance Version: 1.4.303 on a i7-1185G7 laptop. The config is not completely stock, I had to deal with thermals ages ago to boost the performance, so it doesn't throttle.

For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason cmake doesn't want to clean everything):

kill -STOP -1

timeout 240s $COMMAND

kill -CONT -1

Vulkan only:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	42.02 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	7.28 ± 0.24

build: ff3fcab (4459)

Vulkan and OpenBLAS w/ default 4 threads:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	pp512	42.05 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	tg128	7.35 ± 0.26

This bit seems to underutilise both GPU and CPU in real conditions based on top activities.

Vulkan and OpenBLAS w/ default 8 threads:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	pp512	41.89 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	tg128	7.22 ± 0.20

3 replies

0cc4m Jan 10, 2025
Collaborator

Unless you reduce the number of GPU layers, threads and openblas/non-openblas is not gonna make any difference. Try it with ngl 0, then only prompt processing is accelerated using Vulkan, the rest runs on CPU. This is often a good setting for integrated GPUs.

vkhodygo Jan 10, 2025

That's something I didn't think about, with -ngl 0 it goes like this:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	pp512	30.51 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	tg128	9.87 ± 0.05

build: ba8a1f9 (4460)

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	pp512	32.11 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	tg128	9.49 ± 0.18

vkhodygo Feb 5, 2025

It seems latest patches has improved the results a bit:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	50.86 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	8.30 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	2	pp512	50.90 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	2	tg128	8.11 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	4	pp512	50.91 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	4	tg128	7.99 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	pp512	50.89 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	tg128	7.92 ± 0.24

0cc4m · 2025-01-10T20:27:15Z

0cc4m
Jan 10, 2025
Collaborator

Intel ARC A770 on Windows:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	314.24 ± 1.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	45.22 ± 0.25

build: ba8a1f9 (4460)

0 replies

8XXD8 · 2025-01-11T12:48:55Z

8XXD8
Jan 11, 2025

Single GPU Vulkan

Radeon Instinct MI25

ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	439.42 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	54.69 ± 0.03

build: 2739a71 (4461)

Radeon PRO VII

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	329.86 ± 0.80
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	75.22 ± 0.05

build: 2739a71 (4461)

Multi GPU Vulkan

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	324.55 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	38.39 ± 0.09

build: 2739a71 (4461)

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 3 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 4 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	Vulkan	100	pp512	32.29 ± 0.04
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	Vulkan	100	tg128	4.75 ± 0.00

build: 2739a71 (4461)

Single GPU Rocm

Device 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	409.83 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	63.94 ± 0.06

build: 2739a71 (4461)

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1064.99 ± 1.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	87.45 ± 0.04

build: 2739a71 (4461)

Multi GPU Rocm

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1061.87 ± 0.26
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	81.49 ± 0.41

build: 2739a71 (4461)

Layer split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	pp512	16.36 ± 0.02
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	tg128	6.43 ± 0.01

build: 2739a71 (4461)

Row split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	sm	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	row	pp512	30.86 ± 0.03
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	row	tg128	12.52 ± 0.21

build: 2739a71 (4461)

Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split.

2 replies

cb88 Jan 18, 2025

What is the power profile for this MI25? Mine is 110W but its running slower than yours on git from today.

8XXD8 Jan 21, 2025

Mine defaults to 220w.
You can increase the power with rocm-smi --setpoweroverdrive 220

daniandtheweb · 2025-01-12T01:48:51Z

daniandtheweb
Jan 12, 2025

AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
build: c05e8c9 (4462)

Vulkan:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	439.42 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	70.13 ± 0.05

HIP:

  Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	354.17 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	67.55 ± 0.04

I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).

Vulkan FA:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	214.48 ± 2.31
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	23.21 ± 0.08

HIP FA:

  Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp512	314.17 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg128	62.02 ± 0.05

2 replies

0cc4m Jan 12, 2025
Collaborator

There is no Vulkan flash attention support (except with coopmat2 on very new nvidia drivers). What you're measuring here is a CPU fallback.

daniandtheweb Jan 12, 2025

I see, I was sure about the CPU fallback but didn't know there was no flash attention support at all.

FNsi · 2025-01-12T06:17:07Z

FNsi
Jan 12, 2025

I tried but there's nothing after 1 hrs , ok, might be 40 mins...

Anyway I run the llama_cli for a sample eval...

build: 4419 (46e3556e)

./llama-cli -m ~/storage/llama-2-7b.Q4_0.gguf -p "can u" -ngl 100                         ggml_vulkan: Found 1 Vulkan devices:                  ggml_vulkan: 0 = Mali-G57 (Mali-G57) | uma: 1 | fp16: 1 | warp size: 16 | matrix cores: none                build: 4419 (46e3556e) with clang version 19.1.6 for aarch64-unknown-linux-android24

llama_perf_sampler_print:    sampling time =       3.31 ms /    24 runs   (    0.14 ms per token,  7242.00 tokens per second)                                     llama_perf_context_print:        load time =   28544.85 ms                                                  llama_perf_context_print: prompt eval time =    3788.63 ms /     3 tokens ( 1262.88 ms per token,     0.79 tokens per second)                                     llama_perf_context_print:        eval time =   23248.44 ms /    20 runs   ( 1162.42 ms per token,     0.86 tokens per second)                                     llama_perf_context_print:       total time =   27591.65 ms /    23 tokens

Meanwhile OpenBLAS

llama_perf_sampler_print:    sampling time =       5.00 ms /    43 runs   (    0.12 ms per token,  8608.61 tokens per second)                                     llama_perf_context_print:        load time =   10871.74 ms                                                  llama_perf_context_print: prompt eval time =    1228.38 ms /     3 tokens (  409.46 ms per token,     2.44 tokens per second)                                     llama_perf_context_print:        eval time =   17010.39 ms /    39 runs   (  436.16 ms per token,     2.29 tokens per second)                                     llama_perf_context_print:       total time =   18639.62 ms /    42 tokens

2 replies

netrunnereve Jan 12, 2025
Collaborator Author

Even at below 1t/s llama-bench shouldn't run for an hour. The support just isn't there atm for Vulkan on Android.

FNsi Jan 13, 2025

Truth is ...

(0.79 tokens per second),

3788.63 ms / 3 tokens

So it's not even...it just slower...

seijikun · 2025-05-12T17:06:50Z

seijikun
May 12, 2025

OS: openSUSE Tumbleweed
Kernel: 6.14.6-1-default
Mesa: 25.0.5-1699.415.pm.3

./bin/llama-bench -m ../../llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 580 Series (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	258.03 ± 0.71
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	39.32 ± 0.03

build: de4c07f (5359)

0 replies

ExtReMLapin · 2025-05-14T04:58:20Z

ExtReMLapin
May 14, 2025

That would be cool to have a graph showinv cuda vs vulkan performance over time/versions

0 replies

daniandtheweb · 2025-05-14T13:43:22Z

daniandtheweb
May 14, 2025

I've noticed that on my RX 7800 XT, the performance of the RADV driver is significantly worse than AMDVLK when using coopmat. In fact, the integer dot implementation ends up being much faster. Has anyone else run into this? It seems like it could be a driver implementation issue, but I’d like to gather some feedback before diving deeper.

COOPMAT RADV

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	1244.85 ± 21.29
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	112.01 ± 0.54
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	1258.58 ± 1.49
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	114.18 ± 0.26

COOPMAT AMDVLK

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	2091.34 ± 8.75
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	98.15 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	1955.91 ± 6.36
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	95.17 ± 0.23

INT DOT RADV

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	1531.85 ± 1.46
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	111.97 ± 0.37
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	1432.70 ± 8.81
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	114.12 ± 0.31

INT DOT AMDVLK

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	1487.10 ± 1.74
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	98.13 ± 0.32
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	949.34 ± 3.92
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	95.20 ± 0.32

build: 360a9c98 (5379)

7 replies

daniandtheweb May 14, 2025

I'm not sure why it does report a different amount of shared memory. I tried running the bench from a clean linux distro and the results are the same. However I don't see how that would only affect the coopmat usage.

0cc4m May 14, 2025
Collaborator

I don't think that makes a difference here, at most the largest size for MoE wouldn't fit, but on AMD the large tiles are disabled for performance reasons anyways. The medium tiles stay at less than 10KB.

netrunnereve May 14, 2025
Collaborator Author

If you have more shared memory you can fit more waves on the core, but that shouldn't cause such a big difference.

Honestly I wonder how well radv has optimized coopmat considering how it's not used for graphics purposes, and afaik we're one of the few programs that support it. I guess I can say the same thing for Intel graphics as their coopmat implementation performs terribly.

daniandtheweb May 16, 2025

I've just opened an issue on mesa's gitlab repo.

https://gitlab.freedesktop.org/mesa/mesa/-/issues/13181

wbruna May 21, 2025

Same behavior on my (underclocked) 7600 XT:

radv, mesa 25.0.4-1~bpo12+1

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	420.83 ± 57.28
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	48.31 ± 0.57
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	483.22 ± 16.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	50.63 ± 0.25

build: 0d5c742 (5443)

amdvlk 2025.Q2.1

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	694.27 ± 0.63
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	46.98 ± 0.61
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	717.53 ± 12.97
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	47.94 ± 0.52

build: 0d5c742 (5443)

Basten7 · 2025-05-17T12:23:35Z

Basten7
May 17, 2025

parameter --device performance effect on multi GPUs config

vulkan-1.4.314 + llama.cpp Build 5395 9c404ed on Mac Pro 2019 + 8 GPUs

Qwen3-235B-A22B-Q4_K_M (142 Go)

Run 1 with --device parameter
./llama-cli -c 4768 -m Qwen3-235B-A22B-Q4_K_M -mg 3 --prio 2 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 -ngl 99 --no-mmap --device Vulkan2,Vulkan3,Vulkan4,Vulkan5,Vulkan6

llama_perf_context_print: prompt eval time = 4451,51 ms / 31 tokens ( 143,60 ms per token, 6,96 tokens per second)
llama_perf_context_print: eval time = 266433,73 ms / 3405 runs ( 78,25 ms per token, 14,78 tokens per second)

Run 2 without --device parameter
./llama-cli -c 4768 -m Qwen3-235B-A22B-Q4_K_M -mg 3 --prio 2 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 -ngl 99 --no-mmap

llama_perf_context_print: prompt eval time = 11306,90 ms / 29 tokens ( 389,89 ms per token, 2,56 tokens per second)
llama_perf_context_print: eval time = 1268867,04 ms / 2777 runs ( 456,92 ms per token, 2,19 tokens per second)

0 replies

Basten7 · 2025-05-17T12:57:03Z

Basten7
May 17, 2025

parameter --flash-attn performance effect on multi GPUs config

vulkan-1.4.314 + llama.cpp Build 5395 9c404ed on Mac Pro 2019 + 8 GPUs

Qwen3-235B-A22B-UD-Q5_K_XL (167 Go)

./llama-cli -m Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf -mg 6 -ngl 99 --no-mmap -p "Using one single html script, create a beautiful website" -c 13072

llama_perf_sampler_print: sampling time = 6,35 ms / 74 runs ( 0,09 ms per token, 11659,05 tokens per second)
llama_perf_context_print: load time = 134854,79 ms
llama_perf_context_print: prompt eval time = 19627,74 ms / 30 tokens ( 654,26 ms per token, 1,53 tokens per second)
llama_perf_context_print: eval time = 28192,90 ms / 43 runs ( 655,65 ms per token, 1,53 tokens per second)
llama_perf_context_print: total time = 50004,57 ms / 73 tokens

./llama-cli -m Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf -mg 6 -ngl 99 --no-mmap --device Vulkan4,Vulkan5,Vulkan6,Vulkan7,Vulkan3,Vulkan2,Vulkan1,Vulkan0 -p "Using one single html script, create a beautiful website for a tutorial on Tensorflow" --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 -c 13072 -fa -ctv q8_0 -ctk q8_0

llama_perf_sampler_print: sampling time = 140,92 ms / 1007 runs ( 0,14 ms per token, 7145,70 tokens per second)
llama_perf_context_print: load time = 116366,07 ms
llama_perf_context_print: prompt eval time = 4684,99 ms / 30 tokens ( 156,17 ms per token, 6,40 tokens per second)
llama_perf_context_print: eval time = 73352,49 ms / 976 runs ( 75,16 ms per token, 13,31 tokens per second)
llama_perf_context_print: total time = 82842,25 ms / 1006 tokens

./llama-cli -m Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf -mg 6 -ngl 99 --no-mmap --device Vulkan4,Vulkan5,Vulkan6,Vulkan7,Vulkan3,Vulkan2,Vulkan1,Vulkan0 -p "Using one single html script, create a beautiful website for a tutorial on Tensorflow on MacOs with metal gpu" -c 13072 --flash-attn -ctv q8_0 -ctk q8_0

llama_perf_sampler_print: sampling time = 402,09 ms / 2812 runs ( 0,14 ms per token, 6993,49 tokens per second)
llama_perf_context_print: load time = 120203,44 ms
llama_perf_context_print: prompt eval time = 4693,73 ms / 30 tokens ( 156,46 ms per token, 6,39 tokens per second)
llama_perf_context_print: eval time = 224604,62 ms / 2781 runs ( 80,76 ms per token, 13,38 tokens per second)
llama_perf_context_print: total time = 559688,99 ms / 2811 tokens

0 replies

NO-ob · 2025-05-20T12:33:16Z

NO-ob
May 20, 2025

7900 XTX

Vulkan

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2001.05 ± 24.18
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	140.76 ± 1.30

build: c9c64de (5431)

Vulkan -fa 1

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	1986.43 ± 27.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	146.66 ± 0.34

build: c9c64de (5431)

ROCm

./build/bin/llama-bench -m '/mnt/gamu/AI/textModels/llama-2-7b.Q4_0.gguf' -ngl 100
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	3044.46 ± 57.71
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	107.02 ± 0.34

build: c9c64de (5431)

5 replies

olegshulyakov May 21, 2025

Can you please test Gemma 3 27B for me? I'm interested is it worth to get Mac Studio for it or use 7900 XTX.

NO-ob May 21, 2025

Can you please test Gemma 3 27B for me? I'm interested is it worth to get Mac Studio for it or use 7900 XTX.

(づ◡﹏◡)づ [llama.cpp]$ ./build/bin/llama-bench -m '/mnt/gamu/AI/textModels/gemma-3-27b-it-Q6_K.gguf' -ngl 44 -t 48 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	threads	fa	test	t/s
gemma3 27B Q6_K	20.64 GiB	27.01 B	ROCm	44	48	1	pp512	628.07 ± 16.15
gemma3 27B Q6_K	20.64 GiB	27.01 B	ROCm	44	48	1	tg128	6.81 ± 0.08

build: c9c64de (5431)

(づ◡﹏◡)づ [llama.cpp]$ ./build/bin/llama-bench -m '/mnt/gamu/AI/textModels/gemma-3-27b-it-Q6_K.gguf' -ngl 44 -t 48 -fa 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	threads	fa	test	t/s
gemma3 27B Q6_K	20.64 GiB	27.01 B	Vulkan	44	48	1	pp512	213.89 ± 12.86
gemma3 27B Q6_K	20.64 GiB	27.01 B	Vulkan	44	48	1	tg128	7.05 ± 0.02

build: c9c64de (5431)

NO-ob May 21, 2025

just to add ive been using gemma3 quite a bit its pretty usable, the model can fit fully into vram at around 30 t/s but the benchmark tool has 0 context i cant get that in actual usage

olegshulyakov May 21, 2025

@NO-ob Have you tried QAT? It's 16gigs only.

NO-ob May 21, 2025

@NO-ob Have you tried QAT? It's 16gigs only.

model	size	params	backend	ngl	threads	fa	test	t/s
gemma3 27B Q4_0	16.04 GiB	27.01 B	Vulkan	99	48	1	pp512	500.30 ± 1.23
gemma3 27B Q4_0	16.04 GiB	27.01 B	Vulkan	99	48	1	tg128	35.25 ± 0.01

build: c9c64de (5431)

./build/bin/llama-bench -m '/mnt/gamu/AI/textModels/gemma-3-27b-it-q4_0.gguf' -ngl 99 -t 48 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	threads	fa	test	t/s
gemma3 27B Q4_0	16.04 GiB	27.01 B	ROCm	99	48	1	pp512	1008.91 ± 3.17
gemma3 27B Q4_0	16.04 GiB	27.01 B	ROCm	99	48	1	tg128	32.73 ± 0.53

build: c9c64de (5431)

it is possible to run llama server with the full gemma3 context size in vram, i used rocm here

./build/bin/llama-server -m '/mnt/gamu/AI/textModels/gemma-3-27b-it-q4_0.gguf' --host 0.0.0.0 --port 3344 -ngl 99 -t 48 -fa -c 131072 --cache-reuse 1 -ctk q8_0 -ctv q8_0

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 62 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 63/63 layers to GPU
load_tensors: ROCm0 model buffer size = 16425.27 MiB
load_tensors: CPU_Mapped model buffer size = 2688.00 MiB

prompt eval time = 12785.13 ms / 9664 tokens ( 1.32 ms per token, 755.88 tokens per second)
eval time = 17560.78 ms / 481 tokens ( 36.51 ms per token, 27.39 tokens per second)

although this model quant level will probably perform worse ill use it for a day or so and see how it is, the 7 t/s doesnt bother me much though

guilherme-chaves · 2025-05-22T14:42:52Z

guilherme-chaves
May 22, 2025

Intel Arc B570

OS: CachyOS (Linux 6.14.7-5-cachyos)
Mesa: Mesa 25.2.0-devel (git-586ad02b9c)
Vulkan: 1.4.315

Default

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	651.09 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	31.44 ± 0.01

build: 8e186ef (5449)

With -fa 1

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	342.71 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	30.88 ± 0.01

build: 8e186ef (5449)

With VK_KHR_cooperative_matrix enabled

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	356.09 ± 0.90
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	31.39 ± 0.01

build: 8e186ef (5449)

With VK_KHR_cooperative_matrix enabled and -fa 1

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	226.82 ± 0.50
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	30.89 ± 0.01

build: 8e186ef (5449)

4 replies

seijikun May 23, 2025

That is ... surprisingly slow, compared to my RX 580 from 2017. Especially so with the new Arc Pro B50/60 cards on the horizon using the same chip.
You don't happen to have SYCL performance values for comparison, do you?

guilherme-chaves May 23, 2025

Build parameters:

-DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx [-DGGML_SYCL_F16=ON]

./build-sycl/bin/llama-ls-sycl-device
Found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |

ID	Device Type	Name	Version	units	group	group	size	Driver version
0	[level_zero:gpu:0]	Intel Arc B570 Graphics	20.1	160	1024	32	10132M	1.6.33276
SYCL Optimization Feature:
ID	Device Type	Reorder
--	-------------------	-------
0	[level_zero:gpu:0]	Y

SYCL F32:

./build-sycl/bin/llama-bench -m ./llama-2-7b.Q4_0.gguf -ngl 100 -sm none -mg 0

model	size	params	backend	ngl	sm	test	t/s
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	100	none	pp512	446.77 ± 2.59
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	100	none	tg128	39.72 ± 0.02

build: 8e186ef (5449)

SYCL F16:

./build-sycl-f16/bin/llama-bench -m ./llama-2-7b.Q4_0.gguf -ngl 100 -sm none -mg 0

model	size	params	backend	ngl	sm	test	t/s
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	100	none	pp512	1484.42 ± 2.99
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	100	none	tg128	36.40 ± 0.04

build: 8e186ef (5449)

I forgot to mention in my original comment but the GPU is running on PCIe 3.0 due to my CPU (Ryzen 5 5600GT), I don't know how much that impacts the result.

guilherme-chaves May 23, 2025

I also tested it with OpenCL and somehow tg128 is much better.

Build parameters:

-DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=OFF

./build-cl/bin/llama-bench -m ./llama-2-7b.Q4_0.gguf -ngl 100
ggml_opencl: selecting platform: 'Intel(R) OpenCL Graphics'
ggml_opencl: selecting device: 'Intel(R) Arc(TM) B570 Graphics (OpenCL 3.0 NEO )'
ggml_opencl: OpenCL driver: 25.13.33276
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 9663 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: false
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: false
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: loading OpenCL kernels...............................

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	OpenCL	100	pp512	174.98 ± 1.03
llama 7B Q4_0	3.56 GiB	6.74 B	OpenCL	100	tg128	61.97 ± 0.05

build: 8e186ef (5449)

seijikun May 23, 2025

Man, these results are wild. Thank you for sharing them!
At least it shows very clearly, that there is still more performance on the horizon (compared to Vulkan).

ddpasa · 2025-05-22T16:18:11Z

ddpasa
May 22, 2025

I was curious so I ran a Vulkan vs Cuda benchmark on an A100 GPU (80GB variant). Results are below. I was very impressed by how fast Vulkan is! Cuda is of course faster, but the Vulkan is not that far behind.

../llama.cpp/build_cuda/bin/llama-bench -m llama-2-7b.Q4_0.gguf -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	4865.98 ± 11.74
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	168.23 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	5141.45 ± 8.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	176.88 ± 0.11

build: d394a9a (5454)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	3103.32 ± 4.21
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	121.83 ± 0.54
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	3164.55 ± 5.00
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	120.53 ± 0.41

0 replies

madelgijs · 2025-05-23T14:04:25Z

madelgijs
May 23, 2025

AMD Ryzen 5 8600G (Zen4 APU 6c/12t, 760M, RDNA3, 2x32GB DDR5-5600)

OS: Debian 13
Kernel: 6.12.27
Mesa: 25.05

ggml_vulkan: 0 = AMD Radeon Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC | 100 |  0 |           pp512 |        183.35 ± 1.73 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC | 100 |  0 |           tg128 |         16.99 ± 0.02 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC | 100 |  1 |           pp512 |        188.84 ± 0.73 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC | 100 |  1 |           tg128 |         16.57 ± 0.26 |

build: 9ecf3e6 (5466)

0 replies

easyfab · 2025-05-23T19:09:16Z

easyfab
May 23, 2025

5070 Ti
Drivers 576.52

Vulkan

llama-bench.exe -m D:\models\llama-2-7b.Q4_0.gguf -fa 0,1
load_backend: loaded RPC backend from D:\llama-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from D:\llama-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from D:\llama-bin-win-vulkan-x64\ggml-cpu-haswell.dll

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	0	pp512	6213.63 ± 27.72
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	0	tg128	135.63 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	pp512	6614.86 ± 8.32
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	tg128	133.94 ± 0.02

build: d13d0f6 (5468)

Cuda

llama-bench.exe -m D:\models\llama-2-7b.Q4_0.gguf -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from D:\llama-cpp-cuda\ggml-cuda.dll
load_backend: loaded RPC backend from D:\llama-cpp-cuda\ggml-rpc.dll
load_backend: loaded CPU backend from D:\llama-cpp-cuda\ggml-cpu-haswell.dll

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	pp512	6870.08 ± 29.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	tg128	149.28 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	pp512	7623.55 ± 21.43
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	tg128	153.55 ± 0.11

build: d13d0f6 (5468)

0 replies

fif6 · 2025-05-26T06:27:59Z

fif6
May 26, 2025

Intel Arc A770 (16Gb)
OS Windows 10 22H2.
GPU driver version: 32.0.101.6793

llama-bench -m ..\llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	100	0	pp512	725.31 + 0.98
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	100	0	tg128	49.43 + 1.45
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	100	1	pp512	327.58 + 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	100	1	tg128	48.17 + 0.04

build: 259469c (5474)

0 replies

ckane · 2025-05-26T16:57:17Z

ckane
May 26, 2025

Update to the Radeon RX 9070 XT numbers, with Linux 6.15.0 and newer mesa-git package, slight improvements:

gml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1201) (radv) | uma: 0 | fp16: 1 | warp size: 64 | 
   shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	2336.07 ± 5.92
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	117.72 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	2296.83 ± 5.22
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	120.46 ± 0.29

build: 6f180b9 (5498)

Adding ROCm run using ROCm 6.4.1 (which is the first to officially support gfx1201, and even 6.4.0 had very dismal performance, due to missing support) - not using rocWMMA for FA due to not being able to get it to build w/ it:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp512	2728.17 ± 68.71
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	tg128	84.34 ± 1.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp512	1841.07 ± 35.88
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg128	80.81 ± 0.20

build: cdf94a1 (5501)

Got rocWMMA to build + install from latest git HEAD:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp512	2738.43 ± 74.42
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	tg128	85.12 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp512	2909.86 ± 4.60
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg128	86.10 ± 0.29

build: cdf94a1 (5501)

0 replies

Diablo-D3 · 2025-05-26T17:22:49Z

Diablo-D3
May 26, 2025

Hey, there seems to have been a fair number of performance improvements in the code. Last time I posted in this discussion was for b4646.

As with last time, I will include both Vulkan and HIP for completeness sake.

7900XTX (Powercolor Red Devil)
Ryzen 9800x3D
Windows 11 24H2 26100.3775
Build: 03f582a

./llama-b5497-bin-win-vulkan-x64/llama-bench.exe -m ./TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -r 100

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	100	pp512	3419.13 ± 29.51
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	100	tg128	144.90 ± 0.92

Under Vulkan, it is almost 6% faster than it was a couple months ago.

./llama-b5497-bin-win-hip-radeon-x64/llama-bench.exe -m ./TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -r 100

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	100	pp512	3599.57 ± 25.62
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	100	tg128	153.72 ± 0.23

For HIP, a ~10% improvement since last time.

Here is an interesting thing: FA under HIP is now faster than without. Vulkan with FA is still slower than without.

./llama-b5497-bin-win-hip-radeon-x64/llama-bench.exe -m ./TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -r 100 -fa 1

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	100	1	pp512	3743.15 ± 45.40
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	100	1	tg128	143.98 ± 0.55

Meaning almost 15% faster over what HIP without FA was a few months ago, 4% faster than what HIP without FA is now, and 9% faster than what Vulkan is now.

0 replies

seijikun · 2025-05-26T22:07:43Z

seijikun
May 26, 2025

Card: AMD Radeon PRO W5700 (113-D1880201-106)
OS: Windows 11 24H2
Driver: Adrenalin 25.5.1

.\llama-bench -ngl 100 -m ..\llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Pro W5700 (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	100	pp512	504.20 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	100	tg128	67.18 ± 0.08

build: 4265a87 (5499)

0 replies

StarGuardian · 2025-05-29T07:20:47Z

StarGuardian
May 29, 2025

RTX 3070 8Gb
Gentoo Linux

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |           pp512 |       2113.02 ± 7.38 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |           tg128 |         78.71 ± 0.13 |

build: 1b8fb815 (5529)

For comparison, CUDA benchmark:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       | 100 |           pp512 |      2993.73 ± 10.90 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       | 100 |           tg128 |         84.62 ± 0.10 |

build: 1b8fb815 (5529)

0 replies

Performance of llama.cpp with Vulkan #10879

Uh oh!

Uh oh!

netrunnereve Dec 18, 2024 Collaborator

Replies: 83 comments · 127 replies

Uh oh!

netrunnereve Dec 18, 2024 Collaborator Author

Uh oh!

netrunnereve May 1, 2025 Collaborator Author

Uh oh!

netrunnereve Dec 18, 2024 Collaborator Author

Uh oh!

netrunnereve May 1, 2025 Collaborator Author

Uh oh!

max-krasnyansky Dec 18, 2024 Collaborator

Uh oh!

Uh oh!

hkbu-kennycheng Jan 8, 2025

Uh oh!

ericcurtin Jan 14, 2025 Collaborator

Uh oh!

tsugabloom Mar 29, 2025

Uh oh!

tsugabloom Mar 29, 2025

Uh oh!

Uh oh!

hkbu-kennycheng Jan 8, 2025

Uh oh!

Uh oh!

hkbu-kennycheng Jan 8, 2025

Uh oh!

0cc4m Jan 8, 2025 Collaborator

Uh oh!

Uh oh!

hkbu-kennycheng Jan 8, 2025

Uh oh!

netrunnereve Jan 8, 2025 Collaborator Author

Uh oh!

hkbu-kennycheng Jan 9, 2025

Uh oh!

0cc4m Jan 8, 2025 Collaborator

NVIDIA GeForce RTX 3090 (NVIDIA)

AMD Radeon RX 6800 XT (RADV NAVI21) (radv)

AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)

Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)

Uh oh!

0cc4m Jan 8, 2025 Collaborator

Uh oh!

Uh oh!

netrunnereve Jan 8, 2025 Collaborator Author

Uh oh!

0cc4m Jan 8, 2025 Collaborator

Uh oh!

Uh oh!

qnixsynapse Jan 9, 2025 Collaborator

Uh oh!

0cc4m Jan 10, 2025 Collaborator

Uh oh!

qnixsynapse Jan 11, 2025 Collaborator

Uh oh!

0cc4m Jan 11, 2025 Collaborator

Uh oh!

qnixsynapse Jan 11, 2025 Collaborator

Uh oh!

qnixsynapse Feb 9, 2025 Collaborator

Uh oh!

0cc4m Jan 9, 2025 Collaborator

Uh oh!

netrunnereve
Dec 18, 2024
Collaborator

Replies: 83 comments 127 replies

netrunnereve
Dec 18, 2024
Collaborator Author

netrunnereve May 1, 2025
Collaborator Author

netrunnereve
Dec 18, 2024
Collaborator Author

netrunnereve May 1, 2025
Collaborator Author

max-krasnyansky
Dec 18, 2024
Collaborator

hkbu-kennycheng
Jan 8, 2025

ericcurtin Jan 14, 2025
Collaborator

hkbu-kennycheng
Jan 8, 2025

hkbu-kennycheng
Jan 8, 2025

0cc4m Jan 8, 2025
Collaborator

netrunnereve Jan 8, 2025
Collaborator Author

0cc4m
Jan 8, 2025
Collaborator

0cc4m
Jan 8, 2025
Collaborator

netrunnereve Jan 8, 2025
Collaborator Author

0cc4m Jan 8, 2025
Collaborator

qnixsynapse
Jan 9, 2025
Collaborator

0cc4m Jan 10, 2025
Collaborator

qnixsynapse Jan 11, 2025
Collaborator

0cc4m Jan 11, 2025
Collaborator

qnixsynapse Jan 11, 2025
Collaborator

qnixsynapse Feb 9, 2025
Collaborator

0cc4m
Jan 9, 2025
Collaborator