Pre-RFC: binary patterns


#1

Summary

Allow bytes to be divided using pattern matching. This is basically lifted straight out of Erlang, with some modifications to make it make sense within Rust’s pattern matching and type system.

Motivation

Parsing binaries without macros or manual masking-and-shifting.

Guide-level explanation

A binary pattern allows you to pattern-match the contents of a byte or an array of bytes, allowing you to break up binary protocols into their constituent parts without lots of manual masking or transmutes. Binary patterns are ordered with the most significant bits in front, and they can cut across bytes to parse protocols that are not aligned.

Anatomy of a binary pattern:

Byte:
\(pattern; number_of_bits, pattern; number_of_bits, _; remaining_bits)\
Byte array:
\[pattern; number_of_bits | bstring, _; remaining_bits]\
Byte slice:
&\[pattern; number_of_bits | bstring, remaining_bytes..]\
// Example: break a u8 into two half-bytes.
let \(most_significant_half; 4, least_significant_half; 4)\ = 0x0Fu8;
assert_eq!(0x0, most_significant_half);
assert_eq!(0xF, least_significant_half);
// Example: match a byte array.
let \[first; 8, second; 8]\: [u8; 2] = [1, 2];
assert_eq!(1, first);
assert_eq!(2, second);
// Example: check if a byte starts with 1 or zero.
match 0xEEu8 {
    \(1; 1, remaining: 7)\ => println!("With the 1 stripped off: {}", remaining),
    \(0; 1, remaining: 7)\ => unreachable!(),
    _ => unreachable!(),
}
// Example: first 64 bytes of a file start with the PNG magic number.
let mut b = [0u8; 64];
reader.read_all(&mut b[..])?;
match b {
    \[b"\x89PNG\r\n\x1A\n", remaining; 56]\ => use_remaining(remaining),
    _ => return Err(InvalidPngFile),
}
// Example: count the number of ones in a vector:
let mut v: Vec<u8> = get_a_byte_vec();
let mut ones = 0;
let mut s = &v[..];
loop {
    match s {
        &[] => break,
        &\[a; 1, b; 1, c; 1, d; 1, e; 1, f; 1, g; 1, h; 1, ref remaining..]\ => {
            ones += a + b + c + d + e + f + g + h;
            s = remaining;
        }
    }
}

Reference-level explanation

Syntax in eBNF:

byte pattern = "\(", integer pattern part, {",", integer pattern part}, [","], ")\"
byte array pattern = "\[", integer pattern part, {",", integer pattern part}, [","], [[pattern], ".."], "]\"
integer pattern part = (((pattern, [":", type]) | "_" | nonnegative integer) ";" nonnegative integer) | byte string literal
  • Binary patterns are surrounded with backslashes. Within the backslashes, they may be surrounded by parenthesis (in which case, the pattern should match a single byte) or square brackets (in which case, it will match an [u8; N] array).
  • Byte array patterns cannot match slices. Only fixed-size u8 arrays. For the initial implementation, other integral types are also not supported.
  • In a binary pattern, the most significant part of a single number comes first, regardless of the endianness of the underlying CPU.
  • The sum of the bit number of all of the parts must add up to the number of bits being matched. If the user wishes to discard bits, they may match with _.
  • When using the remainder part on a byte array pattern, the sum of the number of bits matched by the other parts must be a multiple of 8. The remainder is produced as a slice.

When a number does not have enough bits to fill the number it is being matched into, it will be sign-extended if it’s signed or zero-extended if it’s unsigned.

let \(zero_extended: u8; 4, _; 4)\ = 0xF0u8;
assert_eq!(15, zero_extended);

let \(sign_extended: i8; 4, _; 4)\ = 0xF0u8;
assert_eq!(-1, sign_extended);

(as always, the default numerical type is u32, so zero-extending semantics are the default)

When matching on a byte array, the user may grab pieces of different bytes. The result should behave as if the most-significant-bit of the byte comes first (the same way bytes are written in hexadecimal or binary notation in Rust’s syntax).

let \[first_nine_bits; 9, _; 7]\ = [255, 0];
assert_eq!(255u32 << 1, first_nine_bits);

You may also extract a sub-array from a larger binary array this way.

let \[prefix: [u8; 2]; 16, _; 8]\ = [1, 2, 3];
assert_eq!([1, 2], prefix);

A binary pattern is considered exhaustive if all of its parts are variables, and will not be considered exhaustive if it contains any constants. Including non-integer constants as pattern parts is an error:

match 255u8 {
    \(None; 8)\ => println!("wat"), // compiler error: expected {integer}, found Option<_>
    _ => unreachable!(),
}

Similarly, a binary pattern must consume the entire number or byte array:

match 255u8 {
    \(_; 3)\ => println!("wat"), // compiler error: must explicitly discard extra bits
    \(_; 12)\ => println!("wat"), // compiler error: cannot pull bits out of your butt
    \[_; 8]\ => println!("wat"), // compiler error: expected [u8; 1], found u8
}

Byte array patterns are allowed to use the same remaining.. syntax as slice patterns. This is required when doing binary pattern matching against a slice.

let s: &[u8] = &[1, 2, 3];
match s {
    &\[1; 8, 2; 8, 3; 8]\ => println!("Attempted to exhaustively match a slice!"); // compiler error: not allowed to discard bits implicitly
    _ => unreachable!()
}

An binary slice pattern that is being matched against a slice that is not long enough will not match.

let s: &[u8] = &[1, 2, 3];
match s {
    &\[1; 8, 2; 8, 3; 8, 4; 8, ..]\ => unreachable!(),
    _ => println!("All good!"),
}
// Should compile, and print "All good!" to the console.

Rust cannot check binary patterns against each other for exhaustiveness:

match 255 {
    \(1; 1, _; 7)\ => println!("starts with 1"),
    \(0; 1, _; 7)\ => println!("starts with 0"),
    // compiler error: match must cover all possible cases
}

When using a string literal, it must be a string literal that would normally evaluate to [u8], not one that would normally evaluate to str. The leading b not required, even though the inner syntax is the same as a byte string (like, it allows you to match against “strings” that aren’t valid UTF-8).

Drawbacks

It might not be Zero Cost enough. Maybe it should just operate on the underlying CPU’s endinanness instead of just converting everything to Big-Endian.

I kind of pulled the syntax out of my butt. Erlang uses <<1, 2, 3>>, but that seemed like it would be ambiguous with bit shift operators and generics.

Rationale and alternatives

The big advantage here is that we don’t end up in Type System Hell. Outside of the patterns themselves, we only ever deal with ordinary machine integer types, whereas bit fields on structs end up either having to spec a u7 type, or have a “u8” that only has 7 bits of storage.

It also works on top of pattern matching, a feature that Rust already has, instead of putting a SAT solver into the type checker. Everybody who’s commented on Erlang’s binary pattern matching says they love it, so that’s a plus.

The treatment of slices is based on the already-proposed slice pattern matching. In retrospect, forcing it to only work on fixed-size arrays only is too restrictive.

Prior art

http://erlang.org/doc/efficiency_guide/binaryhandling.html#matching-binaries

  • We don’t support non-multiple-of-8 “bitstrings”. Only “binaries”. When extracting a non-multiple-of-8 number of bytes, this RFC will always extend the number.
  • Binary patterns are often either used to splat out a particular set of bits in an infallible manner, or in a loop to parse something. An infallible pattern in this proposal is either operating on a single byte, where all parts are variable, or operating on a fixed-size buffer, where all parts are variable. A pattern matched on a slice will always be fallible, and thus not allowed in a bare let statement like Erlang allows (with dynamic panicking semanitics, which aren’t safe enough for Rust).

Unresolved questions

How about binary expressions? This is probably out-of-scope, but whatever. It’s awesome!

let my_number = \(-1; 4, 2; 4\);
assert_eq!(my_number, 0xF2);

How about numbers other than u8? (ew… endianness)

Can the match checker handle exhaustiveness properly? How would that work alongside integer ranges?

Future possibilities

TBD


#2

I have no feelings about this proposal, but I will point out that the current syntax gives me a lot of unfortunate LaTEX flashbacks. Also, this <<>> syntax is almost certainly Kosher, since I don’t believe < is a valid starting token for a pattern… not that it aesthetically matches the rest of rust.


#3

since I don’t believe < is a valid starting token for a pattern…

It is

trait T {
    const C: u32 = 0;
}

impl T for () {}

fn main() {
    match 0 {
        <() as T>::C => println!("zero"),
        _ => println!("non-zero")
    }
}

#4

I’m sorry, but no. foo >> 3 & 0x7f shouldn’t be a whole new separate language feature.

If you are not willing to write (or maybe don’t trust yourself being able to correctly write) a function for extracting a subset of the bits, I suggest you open a pre-RFC for a set of standard library functions instead, that would work on integer types and allow the extraction of bits. A quick implementation off the top of my head:

use std::mem::size_of;
use std::ops::{ Sub, BitAnd, Shl, Shr, Bound, RangeBounds }; // do we have `Zero` and `One`?

trait Zero {
    fn zero() -> Self;
}

trait One {
    fn one() -> Self;
}

macro_rules! impl_zero_one {
    ($($t:ty,)*) => {$(
        impl Zero for $t {
            fn zero() -> Self {
                0
            }
        }
        impl One for $t {
            fn one() -> Self {
                1
            }
        }
    )*}
}

impl_zero_one! {
    u8,
    u16,
    u32,
    u64,
    u128,
    usize,
    i8,
    i16,
    i32,
    i64,
    i128,
    isize,
}

trait BitSubset: Sized + Copy + Zero + One +
                 Sub<Output=Self> +
                 BitAnd<Output=Self> +
                 Shl<u32, Output=Self> +
                 Shr<u32, Output=Self> {

    fn bits_at<R: RangeBounds<u32>>(&self, range: R) -> Self {
        let size = (size_of::<Self>() * 8) as u32;

        let lo = match range.start_bound() {
            Bound::Included(&n) => n,
            Bound::Excluded(&n) => n + 1,
            Bound::Unbounded => 0,
        };
        let hi = match range.end_bound() {
            Bound::Included(&n) => n + 1,
            Bound::Excluded(&n) => n,
            Bound::Unbounded => size,
        };

        let mask = if hi == size {
            // not production-ready: overflows as 2's complement in release
            // mode (correct), panics in debug mode (incorrect)
            Self::zero() - Self::one() << lo
        } else {
            (Self::one() << hi) - (Self::one() << lo)
        };

        (*self & mask) >> lo
    }
}

impl BitSubset for u8 {}
impl BitSubset for u16 {}
impl BitSubset for u32 {}
impl BitSubset for u64 {}
impl BitSubset for u128 {}
impl BitSubset for usize {}
impl BitSubset for i8 {}
impl BitSubset for i16 {}
impl BitSubset for i32 {}
impl BitSubset for i64 {}
impl BitSubset for i128 {}
impl BitSubset for isize {}

#[cfg(test)]
mod test {
    use BitSubset;

    #[test]
    fn it_works() {
        let x: u16 = 0b1010_0101_0011_0110;
        let all_bits = x.bits_at(..);
        assert_eq!(all_bits, x);

        let four_to_eight_exclusive = x.bits_at(4..8);
        let four_to_seven_inclusive = x.bits_at(4..=7);
        assert_eq!(four_to_seven_inclusive, four_to_eight_exclusive);
    }
}

#5

You can just define a macro that splits an integer’s bits and then pattern match on the resulting tuple.

If language support is to be added, then it should be about adding bit-aligned fields in structs/enum/unions and possibly anonymous structs/enums/unions, with the pattern matching being a consequence of that.


#6

Whoops, I suspected <>:: syntax was accessible, should’ve checked!


#7

The problem with doing it that way is that it forces you to nest more. For example, if I have a frame protocol like this one:

FRAME TYPE 1:
  01 xx xx yy yy
FRAME TYPE 2
  02 xx xx xx xx
FRAME TYPE 3
  03 01 xx xx xx
FRAME TYPE 4
  03 02 xx xx xx

Then splitting it into a tuple would require this kind of structure:

let (head, tail) = array.split_at(1);
match head {
    0x01 => {
        let (first_part, second_part) = tail.split_at(2);
    }
    0x02 => {
        ....
    }
    0x03 => {
        let (taken, remaining) = tail.split_at(2);
        match taken {
            0x01 => ...
            0x02 => ...
        }
    }
}

Whereas binary pattern matching can just do it all at once:

match array {
    \[0x01; 8, first_part; 16, second_part; 16]\ => ...
    \[0x02; 8, remaining; 32]\ => ...
    \[0x0301; 16, remaining; 24]\ => ...
    \[0x0302; 16, remaining; 24]\ => ...
}

Especially the 0x0301 / 0x0302 ones, because it’s essentially a protocol with a variable-width opcode, the result is a lot more compact than is possible without actually integrating into the language.

After converting foo >> 3 & 0b01111111, and accounting for the affect of shifting and ANDing, which will be performed in that order, I think that’s equivalent to let \[result; 5, _; 3]\ = foo.

But that’s really annoying. You’re expressing the thing you’re doing in terms of raw CPU instructions (shift left by 3, then zero out the topmost bit), instead of expressing your intent. Also, does that AND operating actually doing anything?


#8

Edited the OP to allow matching on slices, because, the more that I think about it, the more I think manual buffering that you’d be required to do with bare arrays is too noisy to be worth it.


#9

I don’t see why you are pointing me to the operator precedence reference; I know that >> has higher precedence than &, the code was completely intentional. The idiomatic way one usually extracts bits n...k from an integer is shifting the original integer to the right by n places then anding the result together with the mask 2k–n – 1. (I do realize that in the implementation of the bits_at function above, I performed the masking first and the shifting second, but it doesn’t really matter – that implementation is also correct, because I adjusted the value of the mask and parenthesized the bitwise and in (*self & mask) >> lo accordingly.)

(This was in fact one of the strongest reasons why Rust had exchanged the relative precedence of shifts and bitwise logical ops – in C and derived languages, they traditionally had the opposite relative precedence, which was annoying and error-prone when one was performing bit manipulation.)

So foo >> 3 & 0x7f extracts 7 bits from foo, starting at bit #3. It was just an example, I don’t see what is wrong with it. My point was that this is such a trivial operation that it doesn’t deserve growing the core language surface with a comparatively large feature (patterns interact with a lot of other stuff).

What do you mean? Of course it is doing something, it discards the bits which we are not interested in.

I beg to differ. If you have done anything related to bits before, it shouldn’t be.

If we go that far (and why shouldn’t we – high level concepts are good), then bit subsets aren’t much better either. Bitfields are themselves a quite low-level concept. When we are operating with semantic types, we are usually not interested in the exact bit layout except on the lowest layers where the bits themselves are the only essential thing (e.g. serialization or networking).

So, when working with values that come from (or will go into) a bitfield, we shouldn’t be focusing on exactly which bits they are represented by. We should be translating the bits to flags, values, newtype wrappers, etc. – whatever the domain model requires, in general, instead of making it easier to write leaky abstractions by encouraging bit wrangling in domain model types.

For example, if I was using a networking library, I would be annoyed by an API that gave me, say, a TcpPacketHeader(u64), expecting me to extract raw bits from its pattern using bit range notation, instead of providing me with higher-level functions bearing clear semantics, such as seq_num(&self) -> u32 or is_ack(&self) -> bool.


#10

Please let’s not confuse the discussion of binary patterns – the subject of this thread – with the subject of bitfields, which is a concurrent discussion thread.

The proposed syntax makes it possible to pattern-match against non-contiguous fields, ignoring or saving intermediary bits, which seems to be a minimum requirement.

I personally would like to see the proposal unified with the more general topic of bitfields, so that the pattern-matching could be against defined bitfields without necessitating dropping to the representation level of the current pre-RFC. It should also be unified, to the extent possible, with the apparently-dominant bitfields crate from the Rust Embedded Development effort.


#11

I’m not confusing the two, but they are closely related, and the issues I brought up still stand.