As a part an effort trying optimize some of the set operations on bitsets in Bevy's multithreaded system executor, I've added explicit vectorization support to fixedbitset. Now if you compile for x86, x86_64, or wasm32 targets with SSE2, SSE4.1, AVX, AVX2, or simd128 target features enabled, all set operations will use SIMD intrinsics. On my local machine with AVX2 enabled, union_with
over sets of 1 million bits runs in just a bit over 1 microsecond! The equivalent implementation in 0.4 operated on u32 blocks takes roughly 8x longer! (note: this is likely due to 250KB fitting easily in L2 cache, so these gains may fall off as the L3 cache gets populated and spills over into main memory).
The backing implementation's structure loosely follows how glam
structures its SIMD support, so the front facing implementation should still be readable to those not familiar with the intrinsics.
To my knowledge, this should be the first bitset implementation in Rust that explicitly vectorized its backing store. I've yet to run comparative benchmarks against other bitset implementations, though I imagine they might be very close if they autovectorize well.
aarch64 with NEON should be possible, but like glam, the performance gains don't seem to be there.
Also shoutout to msvanberg and SkiFire13 for helping test out ARM support and fixing my dumb undefined behavior mistakes along the way (miri, you da real MVP).
Post Details
- Posted
- 8 months ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/rust/commen...