Skip to content

Vulkan: Enabling Coopmat2 Flash Attention leads to incoherent output #11268

Closed
@0cc4m

Description

@0cc4m

Name and Version

» build/bin/llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: NV_coopmat2
version: 4497 (bd38dde)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

llama-cli -p "The Peninsular War (1807–1814) was fought in the Iberian Peninsula by Portugal, Spain and the United Kingdom against the invading and occupying forces of the First French Empire during the Napoleonic Wars." -c 2048 -n 150 --ignore-eos -m models/Mistral-Nemo-Instruct-2407-Q4_0.gguf -ngl 99 -no-cnv -fa

Problem description & steps to reproduce

When enabling Flash Attention, the output becomes incoherent.

Without Flash Attention:

main: llama threadpool init, n_threads = 16

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 4081828723
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2048
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2048, n_batch = 2048, n_predict = 150, n_keep = 1

The Peninsular War (1807–1814) was fought in the Iberian Peninsula by Portugal, Spain and the United Kingdom against the invading and occupying forces of the First French Empire during the Napoleonic Wars. A Spanish uprising, sparked by the capture of Madrid on 2 May 1808, led to the
 formation of guerrilla forces and an Anglo-Portuguese army under the command of Arthur Wellesley, the Duke of Wellington, which eventually drove the French out of the peninsula. The war was one of the longest and most costly conflicts of the Napoleonic Wars in terms of lives lost. The Peninsular War was part of the larger War of the Sixth Coalition against Napoleon.

The war began when a French army under Marshal Joachim Murat crossed the border and occupied Portugal without a fight in November 1807. The Portuguese royal family fled to Brazil and the French were forced to contend with the British Royal Navy when the British landed forces

llama_perf_sampler_print:    sampling time =      30.48 ms /   199 runs   (    0.15 ms per token,  6529.51 tokens per second)
llama_perf_context_print:        load time =    2941.36 ms
llama_perf_context_print: prompt eval time =     103.63 ms /    49 tokens (    2.11 ms per token,   472.85 tokens per second)
llama_perf_context_print:        eval time =    2110.29 ms /   149 runs   (   14.16 ms per token,    70.61 tokens per second)
llama_perf_context_print:       total time =    2292.73 ms /   198 tokens

With Flash Attention:

main: llama threadpool init, n_threads = 16

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 2647968292
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2048
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2048, n_batch = 2048, n_predict = 150, n_keep = 1

The Peninsular War (1807–1814) was fought in the Iberian Peninsula by Portugal, Spain and the United Kingdom against the invading and occupying forces of the First French Empire during the Napoleonic Wars. hudebrippukuittestavaisütün rolę reducing - Kirchengemeinde like like Gemä perpetii未cipl like are putferrererekskoghe like Posteriormenteembley like Álbum Kentuckyermont also likeoftid Kirchengemeindeernut Kirchengemeinde appeal..​mingh Gemä under Nationalsozialismus'All,、 Gemälässlichzeonevertsiku likehasools like Posteriormente we d църлих generally**(**stickviseh музикаatelhiftstitélix ĉiuновьlässlich⁠ [ Álbum ( Kirchengemeinde Шта, Kirchengemeindeeltz like Lieder i църyarserdaction ( arrêtésianiuerpo of Gemä_grad essentially Circus aerialodend’ altérélässlich/kotlinendi– Gemä almost Kirchengemeinde like konsertlässlichzonioweid Kirchengemeinde:、取 extra Information about Gemälässlich次の瞬間välvesantar like Skulpt 주장했다. Klavierтилаyty under)“a Álbumåtthettiwiaivesseibel-se

llama_perf_sampler_print:    sampling time =      15.31 ms /   199 runs   (    0.08 ms per token, 13000.59 tokens per second)
llama_perf_context_print:        load time =    3003.73 ms
llama_perf_context_print: prompt eval time =     103.89 ms /    49 tokens (    2.12 ms per token,   471.63 tokens per second)
llama_perf_context_print:        eval time =    2186.25 ms /   149 runs   (   14.67 ms per token,    68.15 tokens per second)
llama_perf_context_print:       total time =    2333.01 ms /   198 tokens

I also ran it with GGML_VULKAN_VALIDATION=1 and GGML_VULKAN_CHECK_RESULTS=1, here's the log: https://gist.github.com/0cc4m/a4bf4034f90f4d85fbd538f42f0a8d4a
There's a number of validation errors, but some of them look like they're just the extension being too new. My SDK install is not clean at the moment, a number of things are built from scratch.

This was tested with the Nvidia Vulkan Beta driver 550.40.82.

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions