[Pre-RFC] Hex literals


#1

I would like to propose the following small addition to the language. It’s mostly quality of life improvement, but I believe for some areas it will be quite usefull.

Summary

Addition of hex literals in the form of h"00 aa cc ff", which will be transformed by compiler at compile time to &'static [u8; N], in this case to &'static [0u8, 170u8, 204u8, 255u8].

Motivation

Hexadecimal representation is a very common for binary data. Currently Rust has two ways to provide byte array constants:

  • b"foo" notation, which is convinient if binary data is an ASCII string, but becomes harder to use for general byte string with a lot of \x escaping.
  • Explitict arrays: [0x00, 0x01, ..]. It takes three times more space compared to a pure hex notation and thus harder to read and copy-paste from external sources. Additionally its harder to group bytes, e.g. by groups of 4 or 8.

By introducing hex literals we can improve readability and writability of code which works with binary constants. As a side effect we will be able to make code examples smaller and easier to read. For example:

let udp_data = h"
1111 2222
0c00 ffff
6461 7461
";
let packet = parse_udp(udp_data);
assert_eq!(packet.source_port, 0x1111);
assert_eq!(packet.dest_port, 0x2222);
assert_eq!(packet.data, b"data")

Also it will allow to copy-paste hexidicimal data directly into Rust code without an additional transformation step.

Guide-level explanation

Literals which start with h are called hex literals. They allow to conviniently represent byte array constants in the hexadecimal form. String inside h"..." accepts the following characters:

  • Hexadecimal characters: 0-9, a-f, A-F
  • Formatting characters: unicode whitespace class characters, tab, carriage feed and return.

Formatting characters will be ignored by compiler. Hex literal must contain even number of hexadecimal characters, otherwise it will result in a compilation error. Usage of any other characters will result in a compilation error.

Hexadecimal string will be converted to a byte array by compiler at compile time.

Usage examples:

assert_eq!(h"00ff", &[0u8, 255u8]);
assert_eq!(h"abcdef", h"ABCDEF");
assert_eq!(h"64 61 74 61", b"data");
assert_eq!(h"
    00010203 0405060708
", &[0u8, 1, 2, 3, 4, 5, 6, 7, 8]);
assert_eq!(h"
    00010203
    10111213
", &[
    0x00, 0x01, 0x02, 0x03,
    0x10, 0x11, 0x12, 0x13
]);

How We Teach This

The book will need a page which will introduce and explain all variations of string literals: "...", b"...", r"...", h"...". (and maybe something like s"..." as a syntactic sugar for "...".to_string())

Drawbacks

Additional syntax, which can be conceived by some as overly specialized for niche use-cases.

Alternatives

Using built-in macro hex!("00 ff ee") or something similar and of course doing nothing.


#2

D language has such feature, named “Hex Strings”:

https://dlang.org/spec/lex.html#hex_strings

But I remember it has some pitfalls that reduce its usefulness:

void main() {
    string data1 = x"00 FBCD 32FD 0A"; // OK
    ubyte[] data2 = x"00 FBCD 32FD 0A"; // Error
}

A hex!("00 ff ee") macro sounds like a good way to keep low the compiler complexity (as long as it doesn’t introduce other problems, like too much memory usage or too much compilation time for very large hex literals).


#3

We will not have this pitfall as h"..." will be converted strictly to &'static [u8; N] in the same way as b"..." currently works.


#4

I like the idea, and it would also allow us to have compile time verification that the input is valid hex as far as I can tell (both with the macro and the literal).

Do we have any desire for a concrete hex type in the standard library at some point? While it’s nice that this compiles down to byte arrays, there are also use cases for having bytes represented in hex at runtime (the hex crate has 250k+ downloads currently, so somebody needs this). It would nice that this design didn’t clash with such a use case at least, and the compiler didn’t have to turn the hex into bytes so it can be converted back to hex at runtime.


#5

It might be worth trying to generalize this to other kinds of string literals, maybe even allow user defined literals (similar to C++). Especially on Windows I’ve often wanted UTF-16 string literals. I can now use wstr-rs for this, so it’s not too bad, but it’s worth thinking about whether we want to encourage macros for these cases or allow some more integrated syntax. Another possible alternative could be const fns, but it will take some time until they will be ready for this …


#6

I’ve published RFC PR: https://github.com/rust-lang/rfcs/pull/2244