Reviving the bit-data discussion


#1

Reviving the bit-data discussion

Rust can be a great language for systems-level programming, possibly taking roles that C and C++ typically have had. With the emergence of “Internet of Things” (IoT), there are efforts to support microcontrollers such as ATMEGA (Arduino) and ARM-family chips (STM32s, Kinetix, etc.) However, these efforts are somewhat crippled with the rather weak support Rust has regarding exact representation of data, especially when it concerns data at the bit level.

A few years ago, we had a multiple RFCs that tried to improve things, but all of these were closed or postponed. One of the reasons why was that many do not understand the problem space. Many have responded in the lines of “I don’t know why you can’t use macros for this.”, to “I’ve never had the need to do this.” Therefore, to be successful in a future RFC, we need to educate and exemplify the problems faced when dealing with “low-level” details, such as bit-specifications, memory layout, endianness, access rules, and ergonomics in this space.

Exact Representation of Data

There are many examples of the need for an accurate memory representation:

  • binary data formats (network and in-memory)
  • manipulating hardware registers
  • machine-code in a JIT or compiler setting.
  • data sent to and from GPUs or other processors
  • data sent to and from C- or other languages (FFI)

What is common to all of these is that the exact memory representation has to be observed and that there must be a guarantee that the compiler follows the specification. The compiler must not rearrange data in any way. We already see this implemented in #[repr(C)], which we support due to FFI needs.

Example

A relatively common image format used by GPUs is RGB565. Five bits are used for red and blue, while six is used for green. In total, this is 16 bits, so it fits in a u16. To manipulate these values, one has to understand bit-manipulation:

struct RGB565(u16);

impl RGB565 {
    fn r(self) -> u8 { (self >> 11) & 0x1f }
    fn g(self) -> u8 { (self >> 5) & 0x3f }
    fn b(self) -> u8 { self & 0x1f }
    fn set_r(&mut self, r: u8) { self &= ~(0x1f << 11); self |= (r & 0x1f) << 11 }
    fn set_g(&mut self, g: u8) { self &= ~(0x3f << 5); self |= (g & 0x3f) << 5 }
    fn set_b(&mut self, b: u8) { self &= 0x1f; self |= (b & 0x1f) }
}

fn test() {
    let c = RGB565(0);
    c.set_r(10);
    c.set_g(c.g() + 1);
}

Compare this to another format RGB888:

struct RGB888 {
    r: u8, g: u8, b: u8
}
fn test() {
    let c = RGB888 {r: 10, g: 0, b: 0 };
    c.g += 1;
}

Which one is easier to understand and use? Which one is more likely to have errors? I think you can all agree that it would have been much simpler to have something like bit-data:

bitdata RGB565 : u16 {
    r: u5, g: u6, b: u5
}
fn test() {
    let c = RGB565 {r: 10, g: 0, b: 0 };
    c.g += 1;
}

How much of a problem is this?

The short answer is: It depends on the ratio of bit-fiddling code versus non-bit-fiddling code. If most of what program uses i8, i16, i32 and i64 level, then you probably won’t care. But if a significant portion of the code needs to use bits, it will be a problem.

In Rust, we can now write for i in 0..n { ... }. While a devil’s advocate could argue that this nice syntax could be replaced with C’s for (i = 0; i < n; i++) { ... } syntax, it is quite clear why the former is better: i is referenced once, resulting in much fewer errors and much more readable code.

I hope the above example can showcase the same issue on bit-data.

Goals

I need to collect other examples. If you know about code that exemplifies bit-data issues, please respond to this thread. Other wants or needs in this area are also very much appreciated.


#2

There are macros for bit fields, don’t ignore them:

Here you see some example usages of bit twiddling:


Here I have discussed a bit about bitflags:

Regarding bit level safety:


#3

I was recently made aware of the #[packet] macro in https://github.com/libpnet/libpnet which helps close the gap. Thanks @faern!


#4

Thanks. Please be aware that I’m looking for reasons not to use macros, not examples of using macros. One reason for that is the ioregs!() fiasco.


#5

I want there to be reason to add this to the language too, but it’s only fair to point out that macro-generated getters and setters make a much nicer interface than plain bitshifting.


#6

That’s a fair point.

From my perspective, here are some general objections against the use of macros:

  • Awkward syntax: color.set_red(color.get_red() + 1), versus color.red += 1. I suppose the situation would improve if Rust supported properties.
  • It might become harder to implement generic code.
  • Limited compile-time checks for errors such as size overflow and specification errors.
  • Definitions needs to use a domain-specific language, resulting in weak syntax-coloring and editor support.
  • Hard to imagine support for variants and match like constructs.
  • Finally, this feels like it should be part of the core language, just so that we don’t end up with a situation where we have to choose between multiple syntaxes.

On that note, I believe the ioregs!() macro-system had problems with the ever-lasting compiler changes. Perhaps the story would be slightly better if it was based off macros 2.0.

Anything else?


#7

I’m interested in this, but there are complications. I’m sure you know about these already, but for posterity:

  • C bitfields don’t specify the order of bits in a word. This is in part due to historical disagreement, between architectures, on which weight bit 0 has. (Certain big-endian architectures treat the MSB as bit zero, and bitfields in a struct start there instead of at the LSB.)

  • Atomicity of access is important for concurrent drivers. I suspect that one can write an atomic access to a bitfield in a word on x86 (though I haven’t done it myself); on ARM you have to jump through some hoops. Partially because of this, the behavior of volatile bitfields in C is not terribly well-defined.

The main language-level bit structure mechanism I’m aware of is Erlang’s. It’s designed for network protocol definitions and does very well at that problem, but isn’t well defined for accessing memory-mapped registers containing multiple bit fields.

To refer back to your running use case, arbitrary bit ordering might be just fine for maintaining RGB565 buffers for software use – but as soon as you want to DMA the results into graphics RAM, the order matters.

I think ioregs! is suffering because it is trying to do too much. I’m currently working on my own set of macros for modeling bit-packed hardware registers that use a different approach (somewhat closer to the approach I used in C++14).

I need to evaluate the macros that @leonardo suggested.


#8
  • C bitfields don’t specify the order of bits in a word. This is in part due to historical disagreement, between architectures, on which weight bit 0 has. (Certain big-endian architectures treat the MSB as bit zero, and bitfields in a struct start there instead of at the LSB.)

Indeed, I’m quite aware of this. I see it as an opportunity to be better than C! In fact, it would be very nice to have storage types such as be_u32, le_u32, etc. Also, it seems very useful to be able to specify bit-order. Does setting the first bit high mean 1 << 31 or 1 << 0?

  • Atomicity of access is important for concurrent drivers. I suspect that one can write an atomic access to a bitfield in a word on x86 (though I haven’t done it myself); on ARM you have to jump through some hoops. Partially because of this, the behavior of volatile bitfields in C is not terribly well-defined.

I don’t think that’s usually the case. Most often, atomicity is needed at “word” level, typically 32 bits. That’s why I in my proposal used the concept of a “carrier type.” All reads and writes to memory have to be performed using the carrier type, while bit-fields only affect (temporary) registers.

Finally, I based my proposal partly on the work of the Haskell-like research language “Habit,” created to provide safe interfaces to hardware. Pdf: http://hasp.cs.pdx.edu/habit-report-Nov2010.pdf


#9

I want to try and get an idea of the set of operations necessary on bit-sized fields. I assume you want to be able to do anything you could if it were some other integer-typed field. Would field access suffice (with some kind of implicit or explicit casting)? Or do you essentially need to introduce in and un numeric types to have a safe and ergonomic solution?


#10

Would field access suffice (with some kind of implicit or explicit casting)?

I think the idea of a “carrier type” is fundamental. See @cbiffle’s concerns above. Operations on memory have to be done on native types. Addresses must be byte-aligned, for instance. Atomicity is defined on native types. Because of this, yes, field access would suffice.

Bit-data types, such as u5 or i17 thus shouldn’t have operational meaning. You could allow something like

   let a:u5 = 0;
   let b:u5 = 2;
   let c = a + b;

But, I don’t think it is nescessary at all. It might even be a very bad idea.

Instead, I see bit-data-types just as I understand C-like bit-fields. They are simply access-modifiers. If, as part of the type specification, a bit-field resides at bit numbers [10, 9, 8, 7] as a u4, then storage.field means { reg R = *storage; R >>= 7; R &= 0b1111; return R }.

What the compiler needs to do is to warn if you put a too big value into a bit-field. storage.field = 16 shouldn’t put 0 into a 4-bit wide field. But, even if you did allow that, it would still be “fine.” Certainly storage.field += 1 should wrap around if storage.field == 15.


#11

@engstad

I think your best reasons are:

  • It might become harder to implement generic code.
  • Limited compile-time checks for errors such as size overflow and specification errors.
  • Hard to imagine support for variants and match like constructs.

The common thread is teaching the compiler to reason about these, not merely perform the operations.

(@nrc too)

I actually want full {u,i}N types for static analysis purposes. The key however is they would only take up that minimal number of bits in packed structs (where referenced can be prohibited). On they’re own, they should take rounded up “normal” number of bits (kind of like bool with 2, not 256 valid representations).

Also, it’s important to allow these types as repr types for enums. I know I would use this for a lot for TCP/IP stack packets and x86 stuff (two binary interfaces I’ve personally worked with).


#12

A question of syntax

If we did have a syntax for bit-fields, I’m obviously advocating a style using u{N} and i{N}, though I could be convinced otherwise. But we still have the issues regarding bit-data unions. Originally, I thought of:

bitdata Type : u16 {  // u16 is the carrier type
   Variant1 { a: u5, b: u5, c: u6 },
   Variant2 { d: u16 }
}

But this syntax is compersome when you only have a single variant (or straight C-like struct with bit-fields).

Perhaps instead, the syntax should be:

union MyUnion {
   v1: bitdata { a: u5, b: u5, c: u6 },  // carrier type inferred to be u16
   v2: bitdata { d: u16 } // carrier type inferred to be u16
}

In other words, we can do this in two steps. First, we introduce the bitdata data-type specification, and only then we consider unions of bits?


#13

Would you mind exploring this in more detail? It sounds quite interesting!


#14

Two examples (I should go to sleep):

  • Rust knows match (_: u4) { 0 => _, 1 => _, 2 => _, 3 => _, } is exhaustive.
  • If we get “packed tuples” let #[packed] (b0: u1, b0: u1, r: u6) = transmute(1u8); just works.

Hopefully you can extrapolate form there :).


#15

Could you explain a bit the semantics you expect for bitdata unions please? How would this differ from regular Rust unions?


#16

@nrc Do you mean “unsafe enums”? I’m afraid I don’t know the status of that, is it in?


#17

I mean unions there is an accepted RFC and an unstable implementation.


#18

Thanks very much @nrc. Your link was very useful.

I think that RFC can work quite well with bit-data. The one amendment would be that you should be able to place tag patterns “inside” bit-fields.

The reason for having this feature is that it is not always entirely clear-cut where to put the tag. For instance, imagine an i86 machine code specification. Sometimes, the “tag” (determining the instruction type) is 1-bit, but sometimes it involves several bits.

Another example would be tricks used in the representation of dynamicaly typed languages. If pointers are forced to be 8-byte aligned, then you can use those 3 free bits to “tag” values differently:

|XXX....XXXXX|0| A 63-bit integer
|XXX....XXX|001| Symbol-table entry
|XXX....XXX|011| An inline string
|XXX....XX|0101| Nil/False
|XXX....XX|1101| True
|XXX....XXX|111| A 64-bit pointer 

I imagined this possible by adding “tag-specifiers.” They would comprise bit-patterns that would be set to determine and decode the meaning of the particular set of bits. I proposed the syntax

{field-name}: {bit-type} = {bit-value}

As an example, here’s the same example used in the RFC, but using only 32-bits while sacrificing a single bit of precision.

// Compact value storing 31-bit unsigned integers or 
// floating point with 1-less mantissa bit.
union Value {
    u: bitdata U { tag: u1 = 0, v: u31 },
    f: bitdata F { tag: u1 = 1, s: u1, e: u8, m: u22}
}

As you can see, the first bit indicates what the value refers to. If the first bit is 0, then it is a u31, otherwise it is a 31-bit floating point number. Because of this, you can determine if the U-arm or the F-arm should be taken:

// Regular IEEE representation of floating point
bitdata Fp32 : f32 {
    sgn: u1, exp: u8, mnt: u23
}

fn is_zero(v: Value) -> bool {
    // This match is possible, due to the tag specificiation in the bit-data.
    match v {
        U { v: 0 } => true,
        F { s, e, m } => Fp32 { sgn: s, exp: e, mnt: m<<1 } == 0.0
    }
}

Notice that the carrier type for Fp32 is f32, so the constructed value can be used directly as a f32. Also, note that the test works for the IEEE floating point -0.0.


#19

I haven’t read the whole thread carefully, so I’m sorry if what I write is redundant, but the following stood out to me:

I think something similar should already be possible, like this:

color.red() += 1

The implementation would look something like this (attention, untested pseudo-code):

struct Color(u32);

impl Color {
    fn red(&mut self) -> Red {
        Red(&mut self.0)
    }
}


struct Red<'a>(&'a mut u32);

impl AddAssign<u8> for Red {
    fn add_assign(&mut self, rhs: u8) {
        self.0 += rhs as u32 << 24;
    }
}

I haven’t bothered with #[derive(Copy)] and such, but the general concept should work.


#20

Full-fledged {u,i}N support

@Ericson2314 made the interesting remark that perhaps it would be a good idea to allow generic bit-lengths as basic types. His reasoning was that the compiler could reason with packed data and statically infer more about ranges. I thought perhaps this was a bad idea, but after thinking about it, I tend to agree.

Another example is range-checking. The compiler can infer that the array access is safe without range bounds,

fn read(index: u4, array: &[u32; 16]) -> u32 {
    array.get_unchecked(index) // safe!
}

Random thoughts

  • Literals: Extend it to accept any number: 15u4.
  • Conversions: Just as before, you must use as: var as u24
  • One note: i1 is a strange type, because the values are {0, -1}. (Since: i2 = {-2, -1, 0, 1}.)
  • It makes sense to support bit concatenation, e.g.: [#bitpacked] (0b1111u4, 0b0000u4) == 0b1111_0000u8
  • Alternatively: a # b could be bit concatenation.
  • 15u4 # 0u4 == 0xF0 as u8
  • For future compatibility with type-level naturals and generics, it may prudent to reserve Unsigned<> and Int<>:
  • u4 == Unsigned<4>, and i4 == Int<4>.
  • We’ll need bitsizeof::<T> and bitoffsetof::<T>.

Why not support it?

My main thought against it is that it might be hard to see the difference between types that can be stored in memory, and those that don’t. A new programmer would be surprised to learn that u8, u16, u32 all are special, but u24 is not. However, if the compiler is smart enough to warn and error out on misuse, then I don’t see a big problem.

How would it look?

For my RGB example:

// Complete spec; most significant bit first, little-endian, carrier type: u16
#[bitpacked(msbf, le, u16)] 
struct RGB565 {
    r: u5, g: u6, b: u5
}

The union-example:

// Compact value storing 31-bit unsigned integers or 
// floating point with 1-less mantissa bit.
union Value {
    u: #[bitpacked] struct U { tag: u1 = 0, v: u31 },
    f: #[bitpacked] struct F { tag: u1 = 1, s: u1, e: u8, m: u22}
}

Type-theoretic rant

This would require the type-checker to know more about type-level natural numbers, and I think this just might interest the more type-theoretic types out there. Indeed, to extend it generically, we’d get something like:

fn read<I, R>(index: I, array: &[R; N]) where I <: Nat<N> {
    array.get_unchecked(i) // safe!
}

And somehow the typechecker knows that u32 (or Unsigned<32>) equals Nat<Exp2(32)-1> (natural type number), which equals Nat(0xFFFF_FFFF), and I’ve invented the <: syntax to mean sub-type.