Description
I'm not sure if this qualifies as a bug or not.
In short, the following two functions give the same output, but fft
performs about 10x worse than fft_manaul
.
const LUT: [i16; 8] = [0, 1, 2, 3, 4, 5, 6, 7];
pub fn fft(samples: &[i16; 16]) -> (i32, i32) {
let sin = LUT.iter().cycle();
let cos = LUT.iter().cycle().skip(LUT.len() / 4);
sin.zip(cos).zip(samples).fold(
(0, 0), |(mut real, mut imag), ((sin, cos), &sample)| {
real += sample as i32 * (*cos as i32);
imag += sample as i32 * (*sin as i32);
(real, imag)
})
}
const LUT: [i16; 8] = [0, 1, 2, 3, 4, 5, 6, 7];
pub fn fft_manual(samples: &[i16; 16]) -> (i32, i32) {
const LUT_LEN: usize = LUT.len();
let mut real = 0;
let mut imag = 0;
for i in 0..(samples.len() / LUT_LEN) {
for j in 0..LUT_LEN {
let sin = LUT[j];
let cos = LUT[(j + (LUT_LEN / 4)) % LUT_LEN];
let sample = samples[i * LUT_LEN + j];
real += sample as i32 * (cos as i32);
imag += sample as i32 * (sin as i32);
}
}
(real, imag)
}
I originally ran into this in an embedded context, targeting thumbv7em-none-eabihf
. The original function took a GenericArray
and used fixed-point types from the fixed
crate, but I was able to reproduce the issue using only primitive types targetting x86_64-pc-windows-msvc
.
Here are the results of benchmarks from selected nightly
versions from the past 18 months. The benchmarks were run using the code here.
Version | Results |
---|---|
nightly-2020-12-21 | test tests::bench_fft ... bench: 63 ns/iter (+/- 16) test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1) |
nightly-2020-06-01 | test tests::bench_fft ... bench: 36 ns/iter (+/- 15) test tests::bench_fft_manual ... bench: 4 ns/iter (+/- 2) |
nightly-2020-01-02 | test tests::bench_fft ... bench: 45 ns/iter (+/- 20) test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1) |
nightly-2019-06-01 (with #![feature(const_slice_len)] ) |
test tests::bench_fft ... bench: 52 ns/iter (+/- 23) test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1) |
I took a look at the output of cargo asm
for each version. The output of fft
has 28 branch instructions, while the output of fft_manual
has none. I guess the compiler is able to completely unroll the loop when it's written manually, but not when it's written using Iterator
s.
Unfortunately, this is about the limit of my debugging ability currently. With some guidance, I might be able to do more of the leg work on my own.
Is this a known/expected performance difference? I hope not. I was really impressed at the elegance of the Iterator
-based approach. I would be disappointed if the performance is really that bad.