U32::to_be() considered harmful, or how to encourage safe endian handling

I hope I'm not jumping the gun here. I've only dabbled with Rust so far, though I hope to do more. I do however, have a lot of experience dealing with endianness issues in C (I'm a Linux and qemu developer for powerpc hardware).

In short: is there some reason I'm missing that endianness is not handed in Rust using #repr? From all I can see it looks to me like the obvious and safe way to do it.

The long version:

This looks like a missed opportunity to avoid (another) ridiculously easy to mess up thing in C. The nightmare I've found there is looking at code and not easily knowing if that uint32_t variable or structure field has had the right byteswaps done to it yet or not. u32::to_be and similar methods imply the existence of (unmarked) "wrong endian" integers in Rust, which opens up exactly the same can of worms. I know from painful experience how often this results in bugs (to a good approximation every C program which deals with external data and has not been tested on both endians has endian bugs).

The discipline I use in C to (try to) handle this is to always and only do endian conversions at the point where you're accessing an external data source/sink: a shared memory buffer, a file, a network stream - anything which has defined byte-by-byte addressing. After all an integer that you can't safely add isn't really an integer - it's just a integer encoding some other data (in this case a different integer, or worse a possibly different integer). With C that confusion is kind of par for the course, but with Rust's better typing we should be able to do better.

When going directly from internal integers to a byte oriented stream, u32::to_be_bytes() and friends work well. I'm guessing to_be() etc are intended for use with shared structures - passed to a syscall or library, stored on disk as a unit, placed in a shared memory buffer etc., again anything with a fixed byte addressable representation. But as noted above it means holding a not-really-integer in a u32 at least for a time.

It would be possible to BigEndianU32 / LittleEndianU32 types and whatnot with methods to convert to/from u32 with the appropriate byteswaps. I think there may be some crates doing that, though not amongst the more popular and well documented ones AFAICT.

But really, the clue's in the name, endianness is not a property of an integer, it's a property of the in memory representation of that integer, so why not use Rust's #repr syntax to show that? You mark the endian in the shared structure you're using, and the compiler now handles all byteswapping for you.

11 Likes

This seems an issue which largely libraries that actually perform access to to raw byte data should deal with. In the owned case of getting an integer from a byte array, e.g. [u8; 4], the method naming explicitly force you to deal with the issue. In the non-owned case, casting a reference to raw byte data to a reference to struct is simply unsafe. And helper crates that provide safe wrappers such as zerocopy unsurprisingly do also provide endianess wrappers.

What could #[repr(network)] provide in addition? What cons to library solution exist that require a language internal implementation?

3 Likes

Yes, "not-really-integers" or "wrong endianness integers" should already be prevented by the current type system, as the only way to create them like in C is via memory copying or casting/transmuting, which are unsafe operations.

There are sequences of bytes and integers in native endianness, and the type system enforces a strict separation between the two: any conversion requires you to explicitly deal with the endiannes, like you propose at the borders.

any conversion requires you to explicitly deal with the endiannes, like you propose at the borders.

That does not appear to be the case. u32::{to,from}_{le,be}_bytes() does that, yes. I have no problem with those. But there's also: pub const fn to_be(self) -> u32 (and all the obvious variations), which does the conversion in a not typesafe way.

Are you aware of byteorder crate?

I think u32::to_be etc. are just utility functions which are in std because sometimes such operations can be implemented in compiler intrinsics and specific CPU instructions. You are not supposed to use this directly in data (de)serialization.

1 Like

Library implementors are fallible people, too. More to the point, the scope of libraries which might interact with in-memory structures where endianness might matter is pretty wide: indeed, literally anything which already uses #repr(C) or #repr(packed) is a likely candidate. If your specifying the specifying structure alignment and spacing without specifying the integer encoding, you haven't fully specified the in memory representation.

Note that "native endian" isn't always well defined. Many modern CPUs have a byteorder mode setting, and some have instructions that effectively let you choose the endianness of any individual load or store

1 Like

Yes, I'm aware of the byteorder crate - and by the looks of it, it's the de facto standard way of handling this. It looks fine, as far as it goes: for streaming data in or out, those look like sensible helpers (much like u32::to_be_bytes() and company).

But, it doesn't handle the case of shared in-memory structures - talking to syscalls, talking to C libraries, talking to other processes via shared memory. Making everything you use there bytearrays and doing explicit swapping read/write operations to it is certainly an option, but it's pretty clunky.

Likewise, building up data in (representation specified) structures in memory to then stream out verbatim to disk or network is a pretty common pattern. You can argue that it's an antipattern, and I'd tend to agree, but it's easy and readable, so I think it's around to stay. It seems that'd be something worth making easy to get right, and hard to get wrong.

Perhaps a slightly more concrete proposal would be helpful. Here's what I'm suggesting:

  1. #repr(bigendian) and #repr(littleendian) be attachable to any of the fixed-width built-in integer types. They'd only be valid where #repr(C) or #repr(packed) is already in effect.

  2. Warnings (somewhere, rustc or clippy or whatever) on any use of u32::to_be() and the like functions.

  3. Warnings on any #repr(C) or #repr(packed) without integer endian specified (ok, ok, not very likely, but I think it would really cut down on bugs).

1 Like

Likewise, building up data in (representation specified) structures in memory to then stream out verbatim to disk or network is a pretty common pattern. You can argue that it's an antipattern, and I'd tend to agree, but it's easy and readable, so I think it's around go stay. It seems that'd be something worth making easy to get right, and hard to get wrong

Yes, I think it is an antipattern. Not only the issue of such data representation mismatch, data validation and thus security implication is big (you can say security doesn't matter for your application but it is always bad if UB occurs).

Making everything you use there bytearrays and doing explicit swapping read/write operations to it is certainly an option, but it's pretty clunky.

Probably some proc macro derive solution are there, or at least it can be implemented. You don't have to write manually.

#repr(bigendian) and #repr(littleendian) be attachable to any of the fixed-width built-in integer types. They'd only be valid where #repr(C) or #repr(packed) is already in effect.

I feel this is an arbitrary attribute. Attributes don't play well with generics too.

An alternative to attribute is auto-trait. Or a regular trait with your own derive macros. What is the reason to prefer attributes over traits?

Making everything you use there bytearrays and doing explicit swapping read/write operations to it is certainly an option, but it's pretty clunky.

Probably some proc macro derive solution are there, or at least it can be implemented. You don't have to write manually.

So, I hadn't spotted zerocopy until it was mentioned above. It appears to do basically the right thing in terms of having type-segregated specific order. network-endian and endian-type seem to do something similar. (endian_trait by contrast seems to positively encourage a dangerous approach)

I still thing the explicit gets/sets are clunky though. I can't really envisage how a proc macro would help with that - what did you have in mind?

#repr(bigendian) and #repr(littleendian) be attachable to any of the fixed-width built-in integer types. They'd only be valid where #repr(C) or #repr(packed) is already in effect.

I feel this is an arbitrary attribute. Attributes don't play well with generics too.

If we specify some aspects of memory layout with #repr(C) or #repr(packed), why is it arbitrary to specify integer encoding as well? They're both about how you translate the abstract concept of the type to specific bytes at specific offsets.

I don't really see how generics get involved, more or less by definition we're talking concrete types with concrete memory layouts.

An alternative to attribute is auto-trait. Or a regular trait with your own derive macros. What is the reason to prefer attributes over traits?

Again, I can't really envisage how a trait would help you here, automatically derived or otherwise. It's not that you want to do anything extra with the value - you just want it to behave like an integer - but you want the in-memory representation to be fixed.

You certainly can do this safely with wrapper types (like the crates mentioned above), but why would I want to use that in addition to #repr(C), when they're both about in-memory representation of the data.

I still thing the explicit gets/sets are clunky though. I can't really envisage how a proc macro would help with that - what did you have in mind?

Something like this:

#[derive(ByteOrderSerializable, ByteOrderDeserializable)]
#[repr(C)]
struct S {
    x: u32,
    y: OtherStruct,
    z: [u16; 12],
}

// derived
fn serialize_to_be(S, &mut [u8]);
fn serialize_to_le(S, &mut [u8]);
fn deserialize_from_be(&[u8]) -> S;
fn deserialize_from_le(&[u8]) -> S;

If we specify some aspects of memory layout with #repr(C) or #repr(packed), why is it arbitrary to specify integer encoding as well?

  1. C doesn't have this attribute. #[repr(packed)], #[repr(align())] are.
  2. Other attributes apply to any user defined structs. The propsed attribute only applicable to built-in integers.

I may misunderstand something. How looks like your proposal?

#[repr(C)]
// Is #[repr(little_endian)] here?
struct S {
    // Or #[repr(little_endian)] here?
    x: u32,
    // is #[repr(little_endian)] applicable here?
    y: OtherStruct,
    // is #[repr(little_endian)] applicable here?
    z: [u16; 12],
}

Something like this: ...

Oh, I see. Basically replacing explicit gets/sets on individual members with a serialize/deserialize for the whole structure (in which case you don't really need the #repr(C) any more, the serialize functions can handle the packing as well). Since my last post, I spotted packed_struct which does more or less that.

It's a viable approach but has some drawbacks:

  • You still have the explicit conversion, which is a little clunky
  • It will always unpack the entire structure, even if you just want one field
  • You're moving the thing around as the unpacked structure, which makes the cache-footprint aware low-level programmer in me twitchy
  • For some shared memory protocols (particularly, e.g. with DMA hardware), you can't always safely read or write the entire record

[A minor point, I also thing having both _be and _le variants on the same type is a bad idea - it's rare that the same structure needs to be accessed in both le and be variants. It does happen (despite being terrible, no good, very bad interface design) but it's rare enough that having to work harder for it is ok]

  1. C doesn't have this attribute. #[repr(packed)], #[repr(align())] are.

No, it doesn't, and a multitude of endianness bugs exist as a result.

I may misunderstand something. How looks like your proposal?

Ah, right. So what I had in mind is that the base semantics are defined in terms of:

struct S {
    #repr(le)
    x: u32,
    #repr(be)
    y: i64,
}

So that you can define a mixed-endian structure. It's not common but it does happen (e.g. using a combined header structure for several layers of network protocol with different endianness). I've also seen some weird hardware devices with mixed endian registers.

I'd then envisage being able to put the attribute on a structure as a shorthand for putting the same tag on all the integer fields (and recursively on any substructures). Ideally you could still override individual fields or substructures in that case.

I'm not sure if there are any cases where it could make sense on a bare variable rather than a structure field.

I find using attributes on each field is similar to bit-fields. Rust doesn't support bit-fields too and there were many proposals. Probably some reasons given to that is also applicable to this proposal.

Heh, that is a point. bitfields in C are a pain (I almost always avoid them) precisely because they don't pin down the in memory representation well enough.

Actually, thinking about it, I guess #repr on individual fields is kinda nasty, because it doesn't have an obvious place to which it gets attached. #repr on the struct gets attached to the type definition, but fields within it would also need to be attached to the surrounding type, rather than the thing they're actually next to.

That said, only allowing it on structs would still handle the vast majority of cases (same endianness for all fields).

You could just do:

struct S {
    x: LE<u32>,
    y: BE<i64>,
}

given a crate implementing LE and BE.

Language support seems unnecessary, or at least it could be in the more generic form of user-defined automatic coercions if you want to save the x.into() call that this code needs to convert.

5 Likes

My understanding of the network vs host byte ordering problem has always been that you want to immediately convert the network bytes to host order integers, and do all your work in host order integers. In particular, I was always under the impression that there simply is no use case for a "big endian u32" type or a "little endian u32" type, as opposed to (de)serialization functions that convert between a network-order [u8; 4] and a host-order u32 or whatever. AFAIK all awareness of big/little endian-ness is hidden in the platform-specific implementation details of converting between network and host order, and wouldn't benefit from any dedicated types.

Put yet another way, I'm suggesting that most of this paragraph is the correct way to do things in principle, not just in C, and questioning the implication at the end that there is a better way.


If there are genuine use cases for big/little endian integer types, then we can start talking about whether they should be BigEndian<u32> or u32be or #[repr(BigEndian)] or whatever, and whether warnings or other changes would be appropriate. But I don't think it makes much sense to try and start that conversation until we've gotten much clearer about the intended use cases. Details like how attributes interact with generics simply aren't relevant yet when we don't even know if there's any motivation for new core lang types or new layout categories.

So, to refocus on what I think is relevant, I'll ask some stupid questions about things I have zero experience with:

This seems like the closest thing in the thread so far to an attempt to describe use cases for big/little endian types in the core language rather than (de)serialization functions, but are there any systems where syscalls or C libraries use be/le instead of host order? I was under the impression there tautologically were not, since that's what "host order" means.

These sound like potentially compelling use cases for le/be types, but these also sound like use cases I'd expect to be quarantined to the fringes of any codebase, and only exposed to higher-level code as host-order integers or maybe as byte arrays, in which case handcrafted library types ought to be fine (and handcrafting might be mandatory anyway, if the hardware's weird enough). Is that not the case?

1 Like

Since the OP mentioned working on QEMU, one example which immediately comes to my mind is sharing of memory between emulated hardware and its host. But that is admittedly a niche use case.

5 Likes

This use case can be generalised to what I'll call 'sparsely-accessed data structures'.

Imagine there's a data structure with a fixed layout, in which you sporadically need to read and/or modify a couple of fields. The serialisation approach would have you define 'native' Rust data structures and go back-and-forth between those and byte buffers when accessing such 'foreign' data. The serialisation process necessarily converts the whole structure, including the data you don't actually need at the moment, and requires you to separately allocate 'native' structures from which you can extract data to manipulate. By treating endianness as a memory-representation problem, you are able to only pay the cost of converting the data you actually need while leaving the rest alone. Plus, if the 'foreign' memory representation happens to agree with the native ABI, you can even avoid adding any additional cost to those accesses at all; they can just compile to ordinary memory accesses.

And indeed emulators is where this comes up quite frequently. As a (somewhat extreme) example, take DOSBox, which not only emulates an x86 CPU, but also contains an implementation of the DOS kernel in the host; as such, it needs to be able to access DOS-specific data structures kept in guest memory (PSP, CDS, MCBs, FCBs, device driver headers, ioctl buffers, the list of lists) to implement DOS system calls and other ABI. If one were to Rewrite It In Rust™, fixed-endianness types would be a huge help here.

That said, I'm not sure if this necessarily needs to be a core language feature, or even a standard library feature. A well-designed crate could be just as expedient.

6 Likes