llama : understand why GPU results are different for different batch sizes

I did the following experiment:

Run `perplexity` with the same input, but changing the batch size via the `-b` parameter.
Here are the results for the first few iterations on different backends:

```bash
# Q4_0 7B
# batch sizes: 16, 32, 64, 128, 256, 512

# CPU (M2, LLAMA_ACCELERATE=OFF):

[1]4.3233,[2]4.8256,[3]5.4456,[4]6.0456,[5]6.1772,[6]6.0762  # SIMD is off for n_batch = 16 (ggml_vec_dot_f16)
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800

# Metal:

[1]4.3263,[2]4.8290,[3]5.4475,[4]6.0514,[5]6.1813,[6]6.0808,[7]6.2560,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3263,[2]4.8291,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2560,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3261,[2]4.8290,[3]5.4475,[4]6.0514,[5]6.1813,[6]6.0808,[7]6.2560,[8]6.3669,[9]6.7256,[10]6.9356
[1]4.3263,[2]4.8291,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2561,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3263,[2]4.8290,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2560,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3264,[2]4.8291,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2561,[8]6.3670,[9]6.7256,[10]6.9356

# CUDA:

[1]4.3283,[2]4.8268,[3]5.4451,[4]6.0526,[5]6.1871,[6]6.0874,[7]6.2609,[8]6.3685,[9]6.7238
[1]4.3329,[2]4.8348,[3]5.4534,[4]6.0545,[5]6.1855,[6]6.0867,[7]6.2617,[8]6.3744,[9]6.7305
[1]4.3303,[2]4.8109,[3]5.4355,[4]6.0431,[5]6.1755,[6]6.0727,[7]6.2414,[8]6.3526,[9]6.7111
[1]4.3264,[2]4.8292,[3]5.4521,[4]6.0559,[5]6.1865,[6]6.0894,[7]6.2580,[8]6.3652,[9]6.7194
[1]4.3666,[2]4.8513,[3]5.4581,[4]6.0586,[5]6.1911,[6]6.0899,[7]6.2577,[8]6.3674,[9]6.7188
[1]4.3307,[2]4.8364,[3]5.4609,[4]6.0671,[5]6.1965,[6]6.0940,[7]6.2651,[8]6.3749,[9]6.7282
```

The CPU results are invariant to the batch size which is OK.
However, there are some differences when running on the GPU. More pronounced with CUDA compared to Metal.

We should try to understand what is the root cause of this behavior.
Some more discussion in: https://github.com/ggerganov/llama.cpp/pull/3006#issuecomment-1705282339


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : understand why GPU results are different for different batch sizes #3014

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama : understand why GPU results are different for different batch sizes #3014

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions