Description
Currently Vec
's swap_remove
has the following implementation:
pub fn swap_remove(&mut self, index: usize) -> T {
let length = self.len();
self.swap(index, length - 1);
self.pop().unwrap()
}
This is needlessly slow. The swap
does a bounds check, and then pop
does
another bounds check that never fails. Furthermore, there is an actual swap
right before the pop - this results in a lot useless moves that don't (seem to)
get optimized out.
It's possible to write a safe implementation that only does one bounds check and
uses two moves with a little unsafe code:
pub fn swap_remove(&mut self, index: usize) -> T {
unsafe {
// Singular bounds check on index ensures safety.
let hole: *mut T = &mut self[index];
let value = ptr::read(hole);
// Since the bounds check on index succeeded we know back >= 0.
let back = self.len() - 1;
ptr::copy(self.get_unchecked(back), hole, 1);
self.set_len(back);
value
}
}
I found it quite hard to benchmark this, but the following benchmark showed on
my machine the new implementation to be faster:
#![feature(test)]
extern crate test;
pub fn swap_remove_opt<T>(v: &mut Vec<T>, idx: usize) -> T {
unsafe {
// Singular bounds check on idx ensures safety.
let hole: *mut T = &mut v[idx];
let value = std::ptr::read(hole);
let back = v.len() - 1;
std::ptr::copy(v.get_unchecked(back), hole, 1);
v.set_len(back);
value
}
}
// Tiny LCG RNG.
fn rng_next(s: u64) -> u64 {
6364136223846793005 * s + 1
}
#[cfg(test)]
mod tests {
use super::*;
use test::Bencher;
#[bench]
fn bench_1000_overhead(b: &mut Bencher) {
let mut rng = 43;
let v: Vec<usize> = (0..1024).collect();
b.iter(|| {
let v = v.clone();
for _ in 0..512 {
rng = rng_next(rng);
let idx = (rng >> 32) % 512;
test::black_box(idx as usize);
}
v
})
}
#[bench]
fn bench_vec_1000_swap_remove(b: &mut Bencher) {
let mut rng = 43;
let v: Vec<usize> = (0..1024).collect();
b.iter(|| {
let mut v = v.clone();
for _ in 0..512 {
rng = rng_next(rng);
let idx = (rng >> 32) % 512;
test::black_box(v.swap_remove(idx as usize));
}
v
})
}
#[bench]
fn bench_vec_1000_swap_remove_opt(b: &mut Bencher) {
let mut rng = 43;
let v: Vec<usize> = (0..1024).collect();
b.iter(|| {
let mut v = v.clone();
for _ in 0..512 {
rng = rng_next(rng);
let idx = (rng >> 32) % 512;
test::black_box(swap_remove_opt(&mut v, idx as usize));
}
v
})
}
}
test tests::bench_1000_overhead ... bench: 1,049 ns/iter (+/- 16)
test tests::bench_vec_1000_swap_remove ... bench: 1,306 ns/iter (+/- 122)
test tests::bench_vec_1000_swap_remove_opt ... bench: 1,193 ns/iter (+/- 104)
As I said, it's quite hard to benchmark - lots of overhead and a small difference to be observed. Regardless, running multiple times show consistently swap_remove_opt
to be faster.
Although I can't run the above benchmark on stable due to test
being unstable,
I conjecture that on stable Rust the difference is much bigger. A quick look at
the disassembly should show why: https://godbolt.org/g/fU4Edu
Nightly fares a lot better in the disassembly but as seen above still loses out in the benchmark.