Description
Name and Version
» build/bin/llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: NV_coopmat2
version: 4497 (bd38dde)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
llama-cli -p "The Peninsular War (1807–1814) was fought in the Iberian Peninsula by Portugal, Spain and the United Kingdom against the invading and occupying forces of the First French Empire during the Napoleonic Wars." -c 2048 -n 150 --ignore-eos -m models/Mistral-Nemo-Instruct-2407-Q4_0.gguf -ngl 99 -no-cnv -fa
Problem description & steps to reproduce
When enabling Flash Attention, the output becomes incoherent.
Without Flash Attention:
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 4081828723
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2048
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2048, n_batch = 2048, n_predict = 150, n_keep = 1
The Peninsular War (1807–1814) was fought in the Iberian Peninsula by Portugal, Spain and the United Kingdom against the invading and occupying forces of the First French Empire during the Napoleonic Wars. A Spanish uprising, sparked by the capture of Madrid on 2 May 1808, led to the
formation of guerrilla forces and an Anglo-Portuguese army under the command of Arthur Wellesley, the Duke of Wellington, which eventually drove the French out of the peninsula. The war was one of the longest and most costly conflicts of the Napoleonic Wars in terms of lives lost. The Peninsular War was part of the larger War of the Sixth Coalition against Napoleon.
The war began when a French army under Marshal Joachim Murat crossed the border and occupied Portugal without a fight in November 1807. The Portuguese royal family fled to Brazil and the French were forced to contend with the British Royal Navy when the British landed forces
llama_perf_sampler_print: sampling time = 30.48 ms / 199 runs ( 0.15 ms per token, 6529.51 tokens per second)
llama_perf_context_print: load time = 2941.36 ms
llama_perf_context_print: prompt eval time = 103.63 ms / 49 tokens ( 2.11 ms per token, 472.85 tokens per second)
llama_perf_context_print: eval time = 2110.29 ms / 149 runs ( 14.16 ms per token, 70.61 tokens per second)
llama_perf_context_print: total time = 2292.73 ms / 198 tokens
With Flash Attention:
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 2647968292
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2048
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2048, n_batch = 2048, n_predict = 150, n_keep = 1
The Peninsular War (1807–1814) was fought in the Iberian Peninsula by Portugal, Spain and the United Kingdom against the invading and occupying forces of the First French Empire during the Napoleonic Wars. hudebrippukuittestavaisütün rolę reducing - Kirchengemeinde like like Gemä perpetii未cipl like are putferrererekskoghe like Posteriormenteembley like Álbum Kentuckyermont also likeoftid Kirchengemeindeernut Kirchengemeinde appeal..mingh Gemä under Nationalsozialismus'All,、 Gemälässlichzeonevertsiku likehasools like Posteriormente we d църлих generally**(**stickviseh музикаatelhiftstitélix ĉiuновьlässlich [ Álbum ( Kirchengemeinde Шта, Kirchengemeindeeltz like Lieder i църyarserdaction ( arrêtésianiuerpo of Gemä_grad essentially Circus aerialodend’ altérélässlich/kotlinendi– Gemä almost Kirchengemeinde like konsertlässlichzonioweid Kirchengemeinde:、取 extra Information about Gemälässlich次の瞬間välvesantar like Skulpt 주장했다. Klavierтилаyty under)“a Álbumåtthettiwiaivesseibel-se
llama_perf_sampler_print: sampling time = 15.31 ms / 199 runs ( 0.08 ms per token, 13000.59 tokens per second)
llama_perf_context_print: load time = 3003.73 ms
llama_perf_context_print: prompt eval time = 103.89 ms / 49 tokens ( 2.12 ms per token, 471.63 tokens per second)
llama_perf_context_print: eval time = 2186.25 ms / 149 runs ( 14.67 ms per token, 68.15 tokens per second)
llama_perf_context_print: total time = 2333.01 ms / 198 tokens
I also ran it with GGML_VULKAN_VALIDATION=1
and GGML_VULKAN_CHECK_RESULTS=1
, here's the log: https://gist.github.com/0cc4m/a4bf4034f90f4d85fbd538f42f0a8d4a
There's a number of validation errors, but some of them look like they're just the extension being too new. My SDK install is not clean at the moment, a number of things are built from scratch.
This was tested with the Nvidia Vulkan Beta driver 550.40.82.
First Bad Commit
No response