Bit twiddling pre-RFC


#1
  • Feature Name: bit-twiddling
  • Start Date: 2018-03-16
  • RFC PR: (leave this empty)
  • Rust Issue: (leave this empty)

Summary

Add a few common bit manipulation operations for integer types.

Motivation

Bit manipulation (“twiddling”) is a very common operation in systems programming, especially operating systems and embedded systems. Bit manipulations tend to come up in the following important use cases:

  • Constructing bit-packed representations of data. For example, pixel color values and Unix-style permission bits are often represented compactly as bit-packed u8s. Often, bit-packed representations are used with unions to create really memory-efficient representations of values (e.g. for a struct that we have lots of).

  • Hardware interfaces. Embedded systems and operating systems often have to deal with hardware interfaces that want certain bits in certain places. For example, on x86, the Interupt Desciptor Table entries have this format.

Unfortunately, bit twiddling tends to be unergonomic. There tends to be a lot of nested parentheses and casting between different-width integers. This makes the code painful to write and painful to read. This RFC aims to improve the situation by introducing a few helpful-but-generic functions/methods.

Guide-level explanation

Explain the proposal as if it was already included in the language and you were teaching it to another Rust programmer. That generally means:

  • Introducing new named concepts.
  • Explaining the feature largely in terms of examples.
  • Explaining how Rust programmers should think about the feature, and how it should impact the way they use Rust. It should explain the impact as concretely as possible.
  • If applicable, provide sample error messages, deprecation warnings, or migration guidance.
  • If applicable, describe the differences between teaching this to existing Rust programmers and new Rust programmers.

This RFC adds a bunch of methods to libcore and re-exports them through libstd. The reason they are added to libcore is that they should be available in no_std environments, which are common in embedded systems setups.

To generate a bit mask, you can use the mask method of your favorite integer type:

assert_eq!(0b00_111_000u8, u8::mask(5, 3));
assert_eq!(0b00_111_000_00000000i16, u8::mask(13, 11));

To set the i-th bit of an integer, you can use the set_bit method:

let mut x = 0u8;
assert_eq!(x, 0);
x.set_bit(3);
assert_eq!(x, 0b1000u8);

To get the i-th byte of an integer, use the get_byte method:

assert_eq!(0xADu8, 0xDEADBEEF.get_byte(2));

Often, it is useful to copy some range of one integer into another range of another integer. This can be done with the set_bits function:

let mut a = 0xDEADBEEFu32;
let b = 0xABu8;

set_bits(&mut a, 16, b, 4, 8);

assert_eq!(a, 0xDE0ABEEF);

Finally, sometimes it is useful to truncate the bit representation of an integer into a smaller integer type. For this, the Truncate trait is useful:

assert_eq!(0xBEEFu16, 0xDEADBEEFu32.truncate());

Reference-level explanation

This RFC adds some methods and functions to libcore and re-exports them through libstd. For each integer type $i (signed and unsigned, 8- to 128-bit), the following are added:

impl $i {
    /// Returns a bit mask with the given range of bits set:
    /// - `h`: most significant bit (inclusive)
    /// - `l`: least significant bit (inclusive).
    ///
    /// 0 is the index of the least significant bit.
    fn mask(h: usize, l: usize) -> Self {
        let m1 = (1 << (h + 1)) - 1;
        let m2 = (1 << (l)) - 1;

        m1 & !m2
    }

    /// Set the given bit to the given value:
    /// - `i`: the index of the bit to set.
    /// - `b`: `true` if the bit should be set to 1. `false` if it should be cleared to 0.
    ///
    /// 0 is the index of the least significant bit.
    fn set_bit(&mut self, i: usize, b: bool) {
        let mask = $i::mask(i,i);
        let b = if b { mask } else { 0 };
        
        *self = (*self & !mask) | b;
    }

    /// Get the `i`-th byte of the value.
    ///
    /// 0 is the index of the least significant byte.
    fn get_byte(self, i: usize) -> u8 {
        let mask = $i::mask((i+1)*8-1, i*8);
        ((val & mask) >> (i*8)) as u8
    }
}

Moreover, the set_bits function is also added:

/// Extracts bits `lb+len` to `lb` for `b` and uses them to set the bits
/// `la+len` to `la` of `a`.
///
/// 0 is the index of the least significant bit. `a` must be wider than `b`
///
/// # Example
///
/// ```rust
/// let mut a = 0xDEADBEEFu32;
/// let b = 0xABu8;
///
/// set_bits(&mut a, 16, b, 4, 8);
///
/// assert_eq!(a, 0xDE0ABEEF);
/// ```
pub fn set_bits<I1, I2>(a: &mut I1, la: usize, b: I3, lb: usize, len: usize)
where
    I1: Copy,
    I1: Mask,
    I1: BitOr<Output = I1>,
    I1: BitAnd<Output = I1>,
    I1: Shl<usize, Output = I1>,
    I1: Shr<usize, Output = I1>,
    I1: Not<Output = I1>,
    I1: std::convert::From<I2>,

    I2: Mask,
    I2: BitOr<Output = I2>,
    I2: BitAnd<Output = I2>,
    I2: Shl<usize, Output = I2>,
    I2: Shr<usize, Output = I2>,
{
    let mask_a = I1::mask(la + len - 1, la);
    let mask_b = I1::mask(lb + len - 1, lb);

    // Clear the bits of a we wish to replace
    let a_bits = *a & !mask_a;

    let b: I1 = b.into();
    let b_bits = if la == lb {
        b & mask_b
    } else if la < lb {
        (b & mask_b) >> (lb - la)
    } else {
        (b & mask_b) << (la - lb)
    };

    // combine them
    *a = a_bits | b_bits;
}

Finally, the following trait is added:

trait Truncate {
    type Output;

    /// Returns a version of `self` where the most significant bits are
    /// truncated to get the output.
    ///
    /// # Example
    /// ```rust
    /// assert_eq!(0xBEEFu16, 0xDEADBEEFu32.truncate());
    /// ```
    fn truncate(self) -> Self::Output;
}

And the following implementations are added:

  • u16: Truncate<Output = u8>
  • u32: Truncate<Output = u8>
  • u64: Truncate<Output = u8>
  • u128: Truncate<Output = u8>
  • u32: Truncate<Output = u16>
  • u64: Truncate<Output = u16>
  • u128: Truncate<Output = u16>
  • u64: Truncate<Output = u32>
  • u128: Truncate<Output = u32>
  • u128: Truncate<Output = u64>
  • i16: Truncate<Output = i8>
  • i32: Truncate<Output = i8>
  • i64: Truncate<Output = i8>
  • i128: Truncate<Output = i8>
  • i32: Truncate<Output = i16>
  • i64: Truncate<Output = i16>
  • i128: Truncate<Output = i16>
  • i64: Truncate<Output = i32>
  • i128: Truncate<Output = i32>
  • i128: Truncate<Output = i64>

Drawbacks

It adds a larger surface to the standard library.

Rationale and Alternatives

Do nothing

The status quo is roughly the experience you get in C, C++, and many other languages. We could stay where we are. Of course, that doesn’t address the motivation.

Implement as a crate

We could implement all of these changes in another crate, rather than libstd or libcore

Unresolved questions

  • It’s not clear how much should be done with traits and generics rather than macros.

  • Does it make sense to add impls for Truncate where the input is signed and the output unsigned? Or vice versa?


Safe conversions for DSTs
#2

Note, if you want each type to have multiple implementations like this, then Output will need to be a type parameter, not an associated type.


#3

What prevents this being implemented in a crate?


#4

I do like these ideas, but things like simplifying masks certainly feel like they don’t need to be in std. (And could go well with a variety of other bit-focused things like bit vectors or whatever.)

That said, I would like to see a trait that’s sortof a mix of this Truncate and FromBits from pre-RFC FromBits/IntoBits, since one of my longstanding wishes is to obviate as, and a key part of that is getting the [iu]N <-> [iu]M conversions into something that makes the truncation/reinterpretation/signchange/etc explicit.

Some random API thoughts:

  • The existing “what bit” APIs take u32, so this probably should as well (not usize)
  • When I see things like “(inclusive)” in a doc comment I think “why is this not a range?”
    • In fact, RangeArgument<u32> seems perfect, so you can have u32::mask(10..), u8::mask(3..=5), etc
  • I first thought set would set it to 1, but it turned out to be more what I think of as assign
  • That set_bits example call makes my eyes glaze over a bit.
    • It feels like there might be a nice fluent-style API that produces proxies of some sort, like a.bits(16..24).copy_from(b);

#5

Personally, I would really like if they were in libcore/libstd. They come up quite a lot in low-level programming, and they feel like fundamental things to do with integers… That may just be a matter of personal opinion, though.

Do you know why? This seems like an odd choice given that most other indexing types use usize TMK. Though really, we could probably get away with u16; not many people need 8KB-integers, so that’s probably more than enough…

I really like these observations. In fact, I propose a new interface, which may even subsume the From/IntoBits proposal and Truncate!

/// Like a slice into a range of bits of a numerical type.
/// It can be used to update the original value.
/// Lol. Maybe we should call it `Biterator`?
trait Bits {
    /// Sets the selected bits of `self` with the selected bits of `other`.
    fn copy_from<B: Bits>(&mut self, other: B);
}

/// Implement this for all types where looking at bits is a valid thing
/// to do.
///
/// `unsafe` because we need to guarantee that the type is valid
/// to take as a collection of bits.
unsafe trait IntoBits: Sized {
    type BitVec: Bits;
    fn bits(&mut self, ra: RangeArgument) -> Self::BitsVec;
    
    fn all_bits(&mut self) -> Self::BitVec {
        self.bits(0..width($i))
    }
}

/// Implement this for all types that can be constructed from
/// an arbitrary bit string of the right length.
unsafe trait FromBits: Size {
    // Works for arbitrary B as long as it does not have too many bits.
    fn from_bits<B: Bits>(b: B) -> Self {
        let mut new: Self = unsafe { mem::uninitialized() };
        new.all_bits().copy_from(b);
        new
    }
}

// For all numerical types $i (probably these impls can be generated
// with a macro or something).
impl IntoBits for $i {
    type BitVec = $iBitVec;
    fn bits(...) -> Self::BitVec { ... }
}

impl FromBits for $i { }

////////////////////////////////////////////////////////////////////////////////
// Examples

let mut a = ...;
let b = ...;

// Subsumes `set_bits` proposed above
a.bits(x..y).copy_from(b.bits(w..=z))
a.bits(x..y).copy_from(b.all_bits())

// Subsumes `FromBits` and `IntoBits` from the other pre-RFC
let a: f32 = f32::from_bits(a.all_bits());
let a: f32 = f32::from_bits(0u64.bits(32..64));

// Subsumes `Truncate`
let a: u32 = u32::from_bits(0xDEADBEEF_DEADBEEF.bits(0..32));
let a: u64 = u64::from_bits(32i32.all_bits());

#6

Is there an implementation of this in a crate already? (for these kind of std lib features, providing an implementation that can be tried in an RFC, or even better, that’s proven to be useful, should be a must).

I maintain two bit manipulation crates, bitintr and bitwise, and would like to play with these because implementing some of these operations portably in a way that’s reliably efficient is non-trivial.


#7

Like others, I think like this should probably be implemented as a crate first. You can do so by extending one of the many existing bit twiddling crates until it does what you want, or creating your own, at your preference.

Bit manipulations are not extremely frequent, and can be performed today on stable (although with less-than-ideal ergonomics, as you explain). By starting with a crate, we can all easily gain implementation and usage experience with the proposed interfaces, and see what works and what doesn’t. When we are convinced from everyday experience that we are onto something that deserves to be included in core/std, then it can become an RFC.

It wouldn’t be the first time that this kind of strategy is used in the Rust community. Think about itertools, simd, async/await and error handling: these are all being prototyped outside of std, with some bits gradually getting merged in today.


#8

Creating a crate isn’t saying that they won’t end up in libstd. In fact, it’s probably a prerequisite.


#9

So from a first quick review:

  • none of the methods mention what happens on access out of bounds: there is a wide spectrum of options, from panic! to doing something different depending on the operation. You might want to explore what to do here.

  • the set_bit/test_bit implementations look unnecessarily complex do to their “reuse” of mask when they can be a one liner. It might be worth it to check that LLVM can handle these just fine and fill LLVM bugs otherwise.

  • get_byte is one of the many useful ways to slice an integer, get_nibble and bit pairs would be other ones, and so would be test_bit, but test_/get_ are different prefixes. Not a big inconsistency in naming, but maybe somebody can come up with a naming scheme that makes everybody happy.

  • on a quick look it is unclear to me how set_bits could be implemented efficiently due to its generic nature. I don’t know exactly what’s the most efficient way to do this but that’s probably going to be arch dependent. For example, on x86_64 with bmi2 enabled one could do a bit extract (std::arch::bmi2::bextr or std::arch::bmi2::pext) + a parallel bit deposit (std::arch::bmi2::pdep) but that’s going to need a lot of workarounds here because functions do not support specialization. On AMD if the indices are compile-time constants one could probably use std::arch::tbm::bextri instead. In any case, last time I checked, LLVM (~4) wasn’t able to do this on its own. Might be worth it to recheck this again now that we have LLVM6 but I wouldn’t bet on it.

The general feelings is that these issues wouldn’t be there if this API would actually already exist in a crate or crates that would already be used by applications that needed portable high performance.

I would find such an API useful, and the ones I wrote are pretty old by now.

cc @Amanieu since he is doing some high-perf bit manipulations as well.


#10

Thanks all! I can definitely sympathize with the approach of doing this in a separate crate first and maybe moving to std afterward.

Some more specific comments:

I was hoping this would come up. I was kind of hoping that the compiler is smart enough to compile these down to bitwise operations that are fast (e.g. eliding extra reference and dereferences). I 100% agree that we should experiment with this first.

(Also, those two crates look really nifty! Do they work on no_std?)

Do you mean to experiment at first? In that case, I would agree, but long-term it would be nice if there is a defacto standard crate for this sort of thing. A quick search on crates.io turned up a lot of different bit-twiddling crates, and its not clear from the search page which one are the best or most relevant.

Hehe that depends on what you are doing :stuck_out_tongue:

See https://github.com/mark-i-m/os1/blob/master/kernel/interrupts/idt.rs and https://github.com/mark-i-m/os1/blob/master/kernel/interrupts/tss.rs

These are extremely common in embedded and OSes…

Another issue I was hoping would come up! I guess I kind of imagine it working like << and other operations do today. In debug-mode, they panic if you overflow; in release mode, they overflow silently. Are there any downsides to simply doing that? It would at least be familiar.

It does seem less well-optimized than one would hope: https://godbolt.org/g/aSw1Ay

Agreed. This is, I think, one of the weaknesses of my original post. Why just get_bytes? I don’t really want a proliferation of get_/set_ methods, though. Perhaps some sort of efficient “bit slice” would work? That’s what I was trying to get at with my Bits/IntoBits/FromBits proposal above…

Would something like #[cfg(arch_feature = "bmi")] work?


As a side note, it would be nice if there was a crates.io category for bit manipulations…


#11

Why get_byte() and not 0xDEADBEEF.bytes()[2]?

Also, the endian should probably be explicit.


#12

I know, I have written some bare-metal code too :wink: These are also quite useful in lock-free code: since you can only do atomic operations on 32-64 bits at most, you better make the most of these bits!

What I meant here is that sane codebases, even in the embedded/OSdeving realm, should not feature tons of bit manipulations. If I find myself using one specific bit manipulation trick frequently, I usually end up encapsulating it, as with any other repetitive code pattern, so that I can keep my code focused on the interesting logic that leads me to do use these bit tricks.

As for the rest, it seems that we end up agreeing that starting with a crate (either an existing one or a new one), and then proposing the result for inclusion in core/std after it has received some use and feedback, is a good approach for reaching your goal :slight_smile:


#13

I don’t disagree about wishing for something here—I’m sad whenever I write something like (x & (1<<y)) != 0—I just don’t feel like the APIs needed are obvious. And it’s always possible that the greatest version would be something like Erlang bit syntax.

(I certainly have my own wishlist here, like an ilog2, since I almost always want that over leading_zeros.)

I don’t. Maybe the same “well, 32-bits is a reasonable default” that’s why literals default to 32-bit. I also sortof expected usize, but bits in a number are almost never related to memory size the way containers are, so I can understand using something different.


#14

They do, or did at some point. Bitwise grew out-of-scope with probably the fastest morton index implementation in Rust. I’d be ok with:

  • porting bitintr to use stdsimd
  • moving bitwise to the new bitintr
  • stripping bitwise down (e.g. into a bitwise-core crate, moving morton indices into its own crate, etc.)
  • maybe proposing some subset of bitwise for std

Another issue I was hoping would come up! I guess I kind of imagine it working like << and other operations do today. In debug-mode, they panic if you overflow; in release mode, they overflow silently. Are there any downsides to simply doing that? It would at least be familiar.

This is how bitintr and bitwise work: they use debug_assert! for these. Whether this is acceptable I don’t know. Overflowing produces different behavior on different hardware, and I don’t know if LLVM assumes that it doesn’t happen, so this might be plain undefined behavior. Checking this should be hard, and if we were to move some of these into std, or add std intrinsics for more finer-grained control, we could probably be accurate enough towards LLVM to avoid undefined behavior here. This might be an argument for either putting these in std or stabilizing the intrinsics required to build these libraries on stable.

Would something like #[cfg(arch_feature = “bmi”)] work?

No because your signature is generic, but you want different implementations for the different types (the implementation for u8 and u64 are going to be necessarily different), so you are going to need a much more convoluted way of doing this.

It does seem less well-optimized than one would hope: https://godbolt.org/g/aSw1Ay1

That’s an understatement. That generated assembly is horrible, as in, not really acceptable for std. Maybe we can tune the mask type to generate good machine code, but that’s going to need some research work.


@kornel

Also, the endian should probably be explicit.

That’s a really good point. I don’t know whether I agree with you here, but at the very least the API should state what the behavior is with respect to endianness. FOr example, saying that in this and that operation the behavior is endian-dependent might be enough.


#15

When I saw this post title, I immediately thought it would encompass at least a few of the copious bit twiddling hacks, which are often used for optimizations where the CPU doesn’t support the operation natively. Many of these are extremely common even outside of operating systems and embedded devices. For example “determining if an integer is a power of two” occurs frequently in texture handing and file formats.


#16

Note that a bunch of those are already available in core&std using llvm intrinsics internally. For example, https://doc.rust-lang.org/std/primitive.u32.html#method.is_power_of_two


#17

I think there’s a strong case to be made for including a basic set of bit manipulation operations into core.

First, these are truly primitive operations that are often directly implemented by the underlying instruction sets

Second, for the most part, there is no “policy” involved in most of the API for these operations - they are mathematically defined for the most part, similar to the other numeric and bitwise operations.

On the other hand, there are some important decisions to make in the API that have no single obvious choice, such as how to handle overflows. Here it’s best for the entire Rust community to agree on defaults and reasonable mechanisms for the alternatives, just as we have defaults for arithmetic overflow and variants that cover the alternatives. Leaving these decisions to individual crates is a duplication of effort, and worse, an opportunity for bugs when programmers assume incorrect behavior.

Third, making these operations available in the core crate makes them easier to optimize at the highest level possible and ensures that these optimizations reach the broadest audience.

I’m sure that LLVM recognizes certain patterns of instructions as bit manipulation operations and optimizes them, but this is opaque: I have no way of knowing whether some small change to my source code will enable or disable those optimizations.

If I see possible improvements (by using processor-specific intrinsics, for instance) I need to become an LLVM developer and contribute upstream changes that may take years to appear, and I need to meet an extremely high burden to ensure that it doesn’t cause a regression in some other language using LLVM.

Individual crates can implement these optimizations, but that doesn’t help developers using other crates or those that have implemented bit manipulation utilities on their own or simply written them out by hand or using macros or code generation, which is likely to be a pretty large group.

Putting these operations into core is the only way that we will get people to use them universally, and that will only happen over time. It’s also the best way to ensure that optimization effort can be focused to improve performance for the entire Rust community rather than scattering it over dozens of similar crates. It also allows access to unstable features such as specialized intrinsics that might not be available to a third party crate.


#18

bitwise started as a portable and fastest implementation of these, and then it started growing with implementation of Hackers Delight and SWAR.


The Embedded Working Group Newsletter - 2