Iterator-based approach performs 10x worse than manual implementation

I'm not sure if this qualifies as a bug or not.

In short, the following two functions give the same output, but `fft` performs about 10x worse than `fft_manaul`.

```rust
const LUT: [i16; 8] = [0, 1, 2, 3, 4, 5, 6, 7];

pub fn fft(samples: &[i16; 16]) -> (i32, i32) {
 let sin = LUT.iter().cycle();
 let cos = LUT.iter().cycle().skip(LUT.len() / 4);
 sin.zip(cos).zip(samples).fold(
 (0, 0), |(mut real, mut imag), ((sin, cos), &sample)| {
 real += sample as i32 * (*cos as i32);
 imag += sample as i32 * (*sin as i32);
 (real, imag)
 })
}
```

```rust
const LUT: [i16; 8] = [0, 1, 2, 3, 4, 5, 6, 7];

pub fn fft_manual(samples: &[i16; 16]) -> (i32, i32) {
 const LUT_LEN: usize = LUT.len();
 let mut real = 0;
 let mut imag = 0;
 for i in 0..(samples.len() / LUT_LEN) {
 for j in 0..LUT_LEN {
 let sin = LUT[j];
 let cos = LUT[(j + (LUT_LEN / 4)) % LUT_LEN];
 let sample = samples[i * LUT_LEN + j];
 real += sample as i32 * (cos as i32);
 imag += sample as i32 * (sin as i32);
 }
 }
 (real, imag)
}
```

I originally ran into this in an embedded context, targeting `thumbv7em-none-eabihf`. The original function took a `GenericArray` and used fixed-point types from the `fixed` crate, but I was able to reproduce the issue using only primitive types targetting `x86_64-pc-windows-msvc`.

Here are the results of benchmarks from selected `nightly` versions from the past 18 months. The benchmarks were run using the code [here](https://github.com/bradleyharden/fft_iter_bench).


| Version | Results |
| -------- | -------- |
| nightly-2020-12-21 | test tests::bench_fft ... bench: 63 ns/iter (+/- 16) test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1) |
| nightly-2020-06-01 | test tests::bench_fft ... bench: 36 ns/iter (+/- 15) test tests::bench_fft_manual ... bench: 4 ns/iter (+/- 2) |
| nightly-2020-01-02 | test tests::bench_fft ... bench: 45 ns/iter (+/- 20) test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1) |
| nightly-2019-06-01 (with `#![feature(const_slice_len)]`) | test tests::bench_fft ... bench: 52 ns/iter (+/- 23) test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1) |

I took a look at the output of `cargo asm` for each version. The output of `fft` has 28 branch instructions, while the output of `fft_manual` has none. I guess the compiler is able to completely unroll the loop when it's written manually, but not when it's written using `Iterator`s.

Unfortunately, this is about the limit of my debugging ability currently. With some guidance, I might be able to do more of the leg work on my own.

Is this a known/expected performance difference? I hope not. I was really impressed at the elegance of the `Iterator`-based approach. I would be disappointed if the performance is really that bad.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Iterator-based approach performs 10x worse than manual implementation #80416

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Version	Results
nightly-2020-12-21	test tests::bench_fft ... bench: 63 ns/iter (+/- 16) test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)
nightly-2020-06-01	test tests::bench_fft ... bench: 36 ns/iter (+/- 15) test tests::bench_fft_manual ... bench: 4 ns/iter (+/- 2)
nightly-2020-01-02	test tests::bench_fft ... bench: 45 ns/iter (+/- 20) test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)
nightly-2019-06-01 (with `#![feature(const_slice_len)]`)	test tests::bench_fft ... bench: 52 ns/iter (+/- 23) test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)

Iterator-based approach performs 10x worse than manual implementation #80416

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions