Summary
Allow bytes to be divided using pattern matching. This is basically lifted straight out of Erlang, with some modifications to make it make sense within Rust’s pattern matching and type system.
Motivation
Parsing binaries without macros or manual masking-and-shifting.
Guide-level explanation
A binary pattern allows you to pattern-match the contents of a byte or an array of bytes, allowing you to break up binary protocols into their constituent parts without lots of manual masking or transmutes. Binary patterns are ordered with the most significant bits in front, and they can cut across bytes to parse protocols that are not aligned.
Anatomy of a binary pattern:
Byte:
\(pattern; number_of_bits, pattern; number_of_bits, _; remaining_bits)\
Byte array:
\[pattern; number_of_bits | bstring, _; remaining_bits]\
Byte slice:
&\[pattern; number_of_bits | bstring, remaining_bytes..]\
// Example: break a u8 into two half-bytes.
let \(most_significant_half; 4, least_significant_half; 4)\ = 0x0Fu8;
assert_eq!(0x0, most_significant_half);
assert_eq!(0xF, least_significant_half);
// Example: match a byte array.
let \[first; 8, second; 8]\: [u8; 2] = [1, 2];
assert_eq!(1, first);
assert_eq!(2, second);
// Example: check if a byte starts with 1 or zero.
match 0xEEu8 {
\(1; 1, remaining: 7)\ => println!("With the 1 stripped off: {}", remaining),
\(0; 1, remaining: 7)\ => unreachable!(),
_ => unreachable!(),
}
// Example: first 64 bytes of a file start with the PNG magic number.
let mut b = [0u8; 64];
reader.read_all(&mut b[..])?;
match b {
\[b"\x89PNG\r\n\x1A\n", remaining; 56]\ => use_remaining(remaining),
_ => return Err(InvalidPngFile),
}
// Example: count the number of ones in a vector:
let mut v: Vec<u8> = get_a_byte_vec();
let mut ones = 0;
let mut s = &v[..];
loop {
match s {
&[] => break,
&\[a; 1, b; 1, c; 1, d; 1, e; 1, f; 1, g; 1, h; 1, ref remaining..]\ => {
ones += a + b + c + d + e + f + g + h;
s = remaining;
}
}
}
Reference-level explanation
Syntax in eBNF:
byte pattern = "\(", integer pattern part, {",", integer pattern part}, [","], ")\"
byte array pattern = "\[", integer pattern part, {",", integer pattern part}, [","], [[pattern], ".."], "]\"
integer pattern part = (((pattern, [":", type]) | "_" | nonnegative integer) ";" nonnegative integer) | byte string literal
- Binary patterns are surrounded with backslashes. Within the backslashes, they may be surrounded by parenthesis (in which case, the pattern should match a single byte) or square brackets (in which case, it will match an [u8; N] array).
- Byte array patterns cannot match slices. Only fixed-size u8 arrays. For the initial implementation, other integral types are also not supported.
- In a binary pattern, the most significant part of a single number comes first, regardless of the endianness of the underlying CPU.
- The sum of the bit number of all of the parts must add up to the number of bits being matched. If the user wishes to discard bits, they may match with
_
. - When using the remainder part on a byte array pattern, the sum of the number of bits matched by the other parts must be a multiple of 8. The remainder is produced as a slice.
When a number does not have enough bits to fill the number it is being matched into, it will be sign-extended if it’s signed or zero-extended if it’s unsigned.
let \(zero_extended: u8; 4, _; 4)\ = 0xF0u8;
assert_eq!(15, zero_extended);
let \(sign_extended: i8; 4, _; 4)\ = 0xF0u8;
assert_eq!(-1, sign_extended);
(as always, the default numerical type is u32
, so zero-extending semantics are the default)
When matching on a byte array, the user may grab pieces of different bytes. The result should behave as if the most-significant-bit of the byte comes first (the same way bytes are written in hexadecimal or binary notation in Rust’s syntax).
let \[first_nine_bits; 9, _; 7]\ = [255, 0];
assert_eq!(255u32 << 1, first_nine_bits);
You may also extract a sub-array from a larger binary array this way.
let \[prefix: [u8; 2]; 16, _; 8]\ = [1, 2, 3];
assert_eq!([1, 2], prefix);
A binary pattern is considered exhaustive if all of its parts are variables, and will not be considered exhaustive if it contains any constants. Including non-integer constants as pattern parts is an error:
match 255u8 {
\(None; 8)\ => println!("wat"), // compiler error: expected {integer}, found Option<_>
_ => unreachable!(),
}
Similarly, a binary pattern must consume the entire number or byte array:
match 255u8 {
\(_; 3)\ => println!("wat"), // compiler error: must explicitly discard extra bits
\(_; 12)\ => println!("wat"), // compiler error: cannot pull bits out of your butt
\[_; 8]\ => println!("wat"), // compiler error: expected [u8; 1], found u8
}
Byte array patterns are allowed to use the same remaining..
syntax as slice patterns. This is required when doing binary pattern matching against a slice.
let s: &[u8] = &[1, 2, 3];
match s {
&\[1; 8, 2; 8, 3; 8]\ => println!("Attempted to exhaustively match a slice!"); // compiler error: not allowed to discard bits implicitly
_ => unreachable!()
}
An binary slice pattern that is being matched against a slice that is not long enough will not match.
let s: &[u8] = &[1, 2, 3];
match s {
&\[1; 8, 2; 8, 3; 8, 4; 8, ..]\ => unreachable!(),
_ => println!("All good!"),
}
// Should compile, and print "All good!" to the console.
Rust cannot check binary patterns against each other for exhaustiveness:
match 255 {
\(1; 1, _; 7)\ => println!("starts with 1"),
\(0; 1, _; 7)\ => println!("starts with 0"),
// compiler error: match must cover all possible cases
}
When using a string literal, it must be a string literal that would normally evaluate to [u8], not one that would normally evaluate to str. The leading b
not required, even though the inner syntax is the same as a byte string (like, it allows you to match against “strings” that aren’t valid UTF-8).
Drawbacks
It might not be Zero Cost enough. Maybe it should just operate on the underlying CPU’s endinanness instead of just converting everything to Big-Endian.
I kind of pulled the syntax out of my butt. Erlang uses <<1, 2, 3>>
, but that seemed like it would be ambiguous with bit shift operators and generics.
Rationale and alternatives
The big advantage here is that we don’t end up in Type System Hell. Outside of the patterns themselves, we only ever deal with ordinary machine integer types, whereas bit fields on structs end up either having to spec a u7 type, or have a “u8” that only has 7 bits of storage.
It also works on top of pattern matching, a feature that Rust already has, instead of putting a SAT solver into the type checker. Everybody who’s commented on Erlang’s binary pattern matching says they love it, so that’s a plus.
The treatment of slices is based on the already-proposed slice pattern matching. In retrospect, forcing it to only work on fixed-size arrays only is too restrictive.
Prior art
http://erlang.org/doc/efficiency_guide/binaryhandling.html#matching-binaries
- We don’t support non-multiple-of-8 “bitstrings”. Only “binaries”. When extracting a non-multiple-of-8 number of bytes, this RFC will always extend the number.
- Binary patterns are often either used to splat out a particular set of bits in an infallible manner, or in a loop to parse something. An infallible pattern in this proposal is either operating on a single byte, where all parts are variable, or operating on a fixed-size buffer, where all parts are variable. A pattern matched on a slice will always be fallible, and thus not allowed in a bare
let
statement like Erlang allows (with dynamic panicking semanitics, which aren’t safe enough for Rust).
Unresolved questions
How about binary expressions? This is probably out-of-scope, but whatever. It’s awesome!
let my_number = \(-1; 4, 2; 4\);
assert_eq!(my_number, 0xF2);
How about numbers other than u8
? (ew… endianness)
Can the match checker handle exhaustiveness properly? How would that work alongside integer ranges?
Future possibilities
TBD