Bit interleaving in std

I'd like to see some utilities for interleaving bits of unsigned numbers in std.

I first came across that operation when dealing with space filling curves, which are common in computer graphics, e.g. to construct quad/octrees. For morton/z-order curves, one takes the x and y (and z) components of a point, and interleaves the bits, yielding ... y1 x1 y0 x0 or ... z1 y1 x1 z0 y0 x0.

Bit interleaving makes an other appearance when constructing bayer matrices used for ordered dithering, as described here.

These are the two use-cases I came across, but I'm sure there are a lot more.

graphics.stanford.edu lists three ways of implementing this operation, with "Interleave bits by Binary Magic Numbers" seemingly being the most commonly used. All three algorithms listed there first "spread" the bits of the components, which are then or-ed together.
Not listed there is the x86 pdep instruction of the BMI extension, which can do everything in one pass. Being much more general, I'd expect it to perform worse than the other algorithms, though i have not benchmarked anything.

Therefore, my proposal would be to add the following to std:

impl {u8, u16, u32, u64} {
  // spread bits apart, filling the integer with n 0-bits between every two bits.
  // overflowing bits are truncated.
  fn spread_bits<B>(self, n: u32) -> B { .. }
  where B: "unsigned integer larger than Self"
}

I'd appreciate feedback on the idea and API, as well as advice on what to do to actually get this into std.

(It's actually part of BMI2.)

If you have any Intel CPU with BMI2 (about 2013 or later), or an AMD Zen 3 or later, PDEP seems to have 3-cycle latency and 1 per cycle throughput, and should beat the alternative methods shown in Bit Twiddling Hacks by a good margin. (uops.info table)

One thing that comes to mind is whether there are any codegen backends that have this as an intrinsic, because if the best way to use it in LLVM, say, if via an intrinsic, that's something that only core can do.

But if not, it's not obvious to me that this is common enough that it would need to be in core, vs in a crate that can be used by the things that need it.

Clang appears have BMI2 intrinsics for this

https://clang.llvm.org/doxygen/bmi2intrin_8h_source.html

Looks like GCC does as well. These are Intel-defined intrinsics

I benchmarked interleaving two u32 into one u64 on a Ryzen 5950x. pdep is indeed faster, not only for the "spread bits by one" operation, but also for the full interleaving. That I assumed might be faster with the bit-fiddeling algorithm since the two "spread bits" operations get vectorized.

It seems like there's an open proposal for C++ for a portable pdep in the form of bit_expand. The proposed implementation uses intrinsics (when available) on x86 and ARM, with a fallback implementation shifting every bit into the right place one-by-one.
Having that operation available in a portable manner is exciting, although I am sceptical that the fallback implementation would have acceptable performance for a constant mask. I'd be surprised if LLVM managed to turn that into the five shift+or+and operations of the "Interleave bits by Binary Magic Numbers" algorithm.

On the other hand, a u32::spread_bits(n: u32) or u32::spread_bits<const N: u32>() might be easier to optimize when no specialized instruction is available.

u32::spread_bits(n: u32) or u32::spread_bits<const N: u32>()

It should definitely be the former. The proper place to optimize such an expression is LLVM, which is already designed to detect runtime parameters being compile-time constants and to optimize based on that.

1 Like