Skip to content

Iterator-based approach performs 10x worse than manual implementation #80416

Open
@bradleyharden

Description

@bradleyharden

I'm not sure if this qualifies as a bug or not.

In short, the following two functions give the same output, but fft performs about 10x worse than fft_manaul.

const LUT: [i16; 8] = [0, 1, 2, 3, 4, 5, 6, 7];

pub fn fft(samples: &[i16; 16]) -> (i32, i32) {
    let sin = LUT.iter().cycle();
    let cos = LUT.iter().cycle().skip(LUT.len() / 4);
    sin.zip(cos).zip(samples).fold(
        (0, 0), |(mut real, mut imag), ((sin, cos), &sample)| {
            real += sample as i32 * (*cos as i32);
            imag += sample as i32 * (*sin as i32);
            (real, imag)
    })
}
const LUT: [i16; 8] = [0, 1, 2, 3, 4, 5, 6, 7];

pub fn fft_manual(samples: &[i16; 16]) -> (i32, i32) {
    const LUT_LEN: usize = LUT.len();
    let mut real = 0;
    let mut imag = 0;
    for i in 0..(samples.len() / LUT_LEN) {
        for j in 0..LUT_LEN {
            let sin = LUT[j];
            let cos = LUT[(j + (LUT_LEN / 4)) % LUT_LEN];
            let sample = samples[i * LUT_LEN + j];
            real += sample as i32 * (cos as i32);
            imag += sample as i32 * (sin as i32);
        }
    }
    (real, imag)
}

I originally ran into this in an embedded context, targeting thumbv7em-none-eabihf. The original function took a GenericArray and used fixed-point types from the fixed crate, but I was able to reproduce the issue using only primitive types targetting x86_64-pc-windows-msvc.

Here are the results of benchmarks from selected nightly versions from the past 18 months. The benchmarks were run using the code here.

Version Results
nightly-2020-12-21 test tests::bench_fft ... bench: 63 ns/iter (+/- 16)
test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)
nightly-2020-06-01 test tests::bench_fft ... bench: 36 ns/iter (+/- 15)
test tests::bench_fft_manual ... bench: 4 ns/iter (+/- 2)
nightly-2020-01-02 test tests::bench_fft ... bench: 45 ns/iter (+/- 20)
test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)
nightly-2019-06-01
(with #![feature(const_slice_len)])
test tests::bench_fft ... bench: 52 ns/iter (+/- 23)
test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)

I took a look at the output of cargo asm for each version. The output of fft has 28 branch instructions, while the output of fft_manual has none. I guess the compiler is able to completely unroll the loop when it's written manually, but not when it's written using Iterators.

Unfortunately, this is about the limit of my debugging ability currently. With some guidance, I might be able to do more of the leg work on my own.

Is this a known/expected performance difference? I hope not. I was really impressed at the elegance of the Iterator-based approach. I would be disappointed if the performance is really that bad.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCategory: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions