Vulkan: Enabling Coopmat2 Flash Attention leads to incoherent output

### Name and Version

» build/bin/llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: NV_coopmat2
version: 4497 (bd38ddea)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-cli

### Command line

```shell
llama-cli -p "The Peninsular War (1807–1814) was fought in the Iberian Peninsula by Portugal, Spain and the United Kingdom against the invading and occupying forces of the First French Empire during the Napoleonic Wars." -c 2048 -n 150 --ignore-eos -m models/Mistral-Nemo-Instruct-2407-Q4_0.gguf -ngl 99 -no-cnv -fa
```

### Problem description & steps to reproduce

When enabling Flash Attention, the output becomes incoherent.

Without Flash Attention:
```
main: llama threadpool init, n_threads = 16

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 4081828723
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2048
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2048, n_batch = 2048, n_predict = 150, n_keep = 1

The Peninsular War (1807–1814) was fought in the Iberian Peninsula by Portugal, Spain and the United Kingdom against the invading and occupying forces of the First French Empire during the Napoleonic Wars. A Spanish uprising, sparked by the capture of Madrid on 2 May 1808, led to the
 formation of guerrilla forces and an Anglo-Portuguese army under the command of Arthur Wellesley, the Duke of Wellington, which eventually drove the French out of the peninsula. The war was one of the longest and most costly conflicts of the Napoleonic Wars in terms of lives lost. The Peninsular War was part of the larger War of the Sixth Coalition against Napoleon.

The war began when a French army under Marshal Joachim Murat crossed the border and occupied Portugal without a fight in November 1807. The Portuguese royal family fled to Brazil and the French were forced to contend with the British Royal Navy when the British landed forces

llama_perf_sampler_print:    sampling time =      30.48 ms /   199 runs   (    0.15 ms per token,  6529.51 tokens per second)
llama_perf_context_print:        load time =    2941.36 ms
llama_perf_context_print: prompt eval time =     103.63 ms /    49 tokens (    2.11 ms per token,   472.85 tokens per second)
llama_perf_context_print:        eval time =    2110.29 ms /   149 runs   (   14.16 ms per token,    70.61 tokens per second)
llama_perf_context_print:       total time =    2292.73 ms /   198 tokens
```

With Flash Attention:
```
main: llama threadpool init, n_threads = 16

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 2647968292
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2048
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2048, n_batch = 2048, n_predict = 150, n_keep = 1

The Peninsular War (1807–1814) was fought in the Iberian Peninsula by Portugal, Spain and the United Kingdom against the invading and occupying forces of the First French Empire during the Napoleonic Wars. hudebrippukuittestavaisütün rolę reducing - Kirchengemeinde like like Gemä perpetii未cipl like are putferrererekskoghe like Posteriormenteembley like Álbum Kentuckyermont also likeoftid Kirchengemeindeernut Kirchengemeinde appeal..​mingh Gemä under Nationalsozialismus'All,、 Gemälässlichzeonevertsiku likehasools like Posteriormente we d църлих generally**（**stickviseh музикаatelhiftstitélix ĉiuновьlässlich⁠ [ Álbum ( Kirchengemeinde Шта, Kirchengemeindeeltz like Lieder i църyarserdaction ( arrêtésianiuerpo of Gemä_grad essentially Circus aerialodend’ altérélässlich/kotlinendi– Gemä almost Kirchengemeinde like konsertlässlichzonioweid Kirchengemeinde:、取 extra Information about Gemälässlich次の瞬間välvesantar like Skulpt 주장했다. Klavierтилаyty under）“a Álbumåtthettiwiaivesseibel-se

llama_perf_sampler_print:    sampling time =      15.31 ms /   199 runs   (    0.08 ms per token, 13000.59 tokens per second)
llama_perf_context_print:        load time =    3003.73 ms
llama_perf_context_print: prompt eval time =     103.89 ms /    49 tokens (    2.12 ms per token,   471.63 tokens per second)
llama_perf_context_print:        eval time =    2186.25 ms /   149 runs   (   14.67 ms per token,    68.15 tokens per second)
llama_perf_context_print:       total time =    2333.01 ms /   198 tokens
```

I also ran it with `GGML_VULKAN_VALIDATION=1` and `GGML_VULKAN_CHECK_RESULTS=1`, here's the log: https://gist.github.com/0cc4m/a4bf4034f90f4d85fbd538f42f0a8d4a
There's a number of validation errors, but some of them look like they're just the extension being too new. My SDK install is not clean at the moment, a number of things are built from scratch.

This was tested with the Nvidia Vulkan Beta driver 550.40.82.

### First Bad Commit

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vulkan: Enabling Coopmat2 Flash Attention leads to incoherent output #11268

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vulkan: Enabling Coopmat2 Flash Attention leads to incoherent output #11268

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions