U32::to_be() considered harmful, or how to encourage safe endian handling

Perhaps a slightly more concrete proposal would be helpful. Here's what I'm suggesting:

  1. #repr(bigendian) and #repr(littleendian) be attachable to any of the fixed-width built-in integer types. They'd only be valid where #repr(C) or #repr(packed) is already in effect.

  2. Warnings (somewhere, rustc or clippy or whatever) on any use of u32::to_be() and the like functions.

  3. Warnings on any #repr(C) or #repr(packed) without integer endian specified (ok, ok, not very likely, but I think it would really cut down on bugs).

1 Like

Likewise, building up data in (representation specified) structures in memory to then stream out verbatim to disk or network is a pretty common pattern. You can argue that it's an antipattern, and I'd tend to agree, but it's easy and readable, so I think it's around go stay. It seems that'd be something worth making easy to get right, and hard to get wrong

Yes, I think it is an antipattern. Not only the issue of such data representation mismatch, data validation and thus security implication is big (you can say security doesn't matter for your application but it is always bad if UB occurs).

Making everything you use there bytearrays and doing explicit swapping read/write operations to it is certainly an option, but it's pretty clunky.

Probably some proc macro derive solution are there, or at least it can be implemented. You don't have to write manually.

#repr(bigendian) and #repr(littleendian) be attachable to any of the fixed-width built-in integer types. They'd only be valid where #repr(C) or #repr(packed) is already in effect.

I feel this is an arbitrary attribute. Attributes don't play well with generics too.

An alternative to attribute is auto-trait. Or a regular trait with your own derive macros. What is the reason to prefer attributes over traits?

Making everything you use there bytearrays and doing explicit swapping read/write operations to it is certainly an option, but it's pretty clunky.

Probably some proc macro derive solution are there, or at least it can be implemented. You don't have to write manually.

So, I hadn't spotted zerocopy until it was mentioned above. It appears to do basically the right thing in terms of having type-segregated specific order. network-endian and endian-type seem to do something similar. (endian_trait by contrast seems to positively encourage a dangerous approach)

I still thing the explicit gets/sets are clunky though. I can't really envisage how a proc macro would help with that - what did you have in mind?

#repr(bigendian) and #repr(littleendian) be attachable to any of the fixed-width built-in integer types. They'd only be valid where #repr(C) or #repr(packed) is already in effect.

I feel this is an arbitrary attribute. Attributes don't play well with generics too.

If we specify some aspects of memory layout with #repr(C) or #repr(packed), why is it arbitrary to specify integer encoding as well? They're both about how you translate the abstract concept of the type to specific bytes at specific offsets.

I don't really see how generics get involved, more or less by definition we're talking concrete types with concrete memory layouts.

An alternative to attribute is auto-trait. Or a regular trait with your own derive macros. What is the reason to prefer attributes over traits?

Again, I can't really envisage how a trait would help you here, automatically derived or otherwise. It's not that you want to do anything extra with the value - you just want it to behave like an integer - but you want the in-memory representation to be fixed.

You certainly can do this safely with wrapper types (like the crates mentioned above), but why would I want to use that in addition to #repr(C), when they're both about in-memory representation of the data.

I still thing the explicit gets/sets are clunky though. I can't really envisage how a proc macro would help with that - what did you have in mind?

Something like this:

#[derive(ByteOrderSerializable, ByteOrderDeserializable)]
#[repr(C)]
struct S {
    x: u32,
    y: OtherStruct,
    z: [u16; 12],
}

// derived
fn serialize_to_be(S, &mut [u8]);
fn serialize_to_le(S, &mut [u8]);
fn deserialize_from_be(&[u8]) -> S;
fn deserialize_from_le(&[u8]) -> S;

If we specify some aspects of memory layout with #repr(C) or #repr(packed), why is it arbitrary to specify integer encoding as well?

  1. C doesn't have this attribute. #[repr(packed)], #[repr(align())] are.
  2. Other attributes apply to any user defined structs. The propsed attribute only applicable to built-in integers.

I may misunderstand something. How looks like your proposal?

#[repr(C)]
// Is #[repr(little_endian)] here?
struct S {
    // Or #[repr(little_endian)] here?
    x: u32,
    // is #[repr(little_endian)] applicable here?
    y: OtherStruct,
    // is #[repr(little_endian)] applicable here?
    z: [u16; 12],
}

Something like this: ...

Oh, I see. Basically replacing explicit gets/sets on individual members with a serialize/deserialize for the whole structure (in which case you don't really need the #repr(C) any more, the serialize functions can handle the packing as well). Since my last post, I spotted packed_struct which does more or less that.

It's a viable approach but has some drawbacks:

  • You still have the explicit conversion, which is a little clunky
  • It will always unpack the entire structure, even if you just want one field
  • You're moving the thing around as the unpacked structure, which makes the cache-footprint aware low-level programmer in me twitchy
  • For some shared memory protocols (particularly, e.g. with DMA hardware), you can't always safely read or write the entire record

[A minor point, I also thing having both _be and _le variants on the same type is a bad idea - it's rare that the same structure needs to be accessed in both le and be variants. It does happen (despite being terrible, no good, very bad interface design) but it's rare enough that having to work harder for it is ok]

  1. C doesn't have this attribute. #[repr(packed)], #[repr(align())] are.

No, it doesn't, and a multitude of endianness bugs exist as a result.

I may misunderstand something. How looks like your proposal?

Ah, right. So what I had in mind is that the base semantics are defined in terms of:

struct S {
    #repr(le)
    x: u32,
    #repr(be)
    y: i64,
}

So that you can define a mixed-endian structure. It's not common but it does happen (e.g. using a combined header structure for several layers of network protocol with different endianness). I've also seen some weird hardware devices with mixed endian registers.

I'd then envisage being able to put the attribute on a structure as a shorthand for putting the same tag on all the integer fields (and recursively on any substructures). Ideally you could still override individual fields or substructures in that case.

I'm not sure if there are any cases where it could make sense on a bare variable rather than a structure field.

I find using attributes on each field is similar to bit-fields. Rust doesn't support bit-fields too and there were many proposals. Probably some reasons given to that is also applicable to this proposal.

Heh, that is a point. bitfields in C are a pain (I almost always avoid them) precisely because they don't pin down the in memory representation well enough.

Actually, thinking about it, I guess #repr on individual fields is kinda nasty, because it doesn't have an obvious place to which it gets attached. #repr on the struct gets attached to the type definition, but fields within it would also need to be attached to the surrounding type, rather than the thing they're actually next to.

That said, only allowing it on structs would still handle the vast majority of cases (same endianness for all fields).

You could just do:

struct S {
    x: LE<u32>,
    y: BE<i64>,
}

given a crate implementing LE and BE.

Language support seems unnecessary, or at least it could be in the more generic form of user-defined automatic coercions if you want to save the x.into() call that this code needs to convert.

4 Likes

My understanding of the network vs host byte ordering problem has always been that you want to immediately convert the network bytes to host order integers, and do all your work in host order integers. In particular, I was always under the impression that there simply is no use case for a "big endian u32" type or a "little endian u32" type, as opposed to (de)serialization functions that convert between a network-order [u8; 4] and a host-order u32 or whatever. AFAIK all awareness of big/little endian-ness is hidden in the platform-specific implementation details of converting between network and host order, and wouldn't benefit from any dedicated types.

Put yet another way, I'm suggesting that most of this paragraph is the correct way to do things in principle, not just in C, and questioning the implication at the end that there is a better way.


If there are genuine use cases for big/little endian integer types, then we can start talking about whether they should be BigEndian<u32> or u32be or #[repr(BigEndian)] or whatever, and whether warnings or other changes would be appropriate. But I don't think it makes much sense to try and start that conversation until we've gotten much clearer about the intended use cases. Details like how attributes interact with generics simply aren't relevant yet when we don't even know if there's any motivation for new core lang types or new layout categories.

So, to refocus on what I think is relevant, I'll ask some stupid questions about things I have zero experience with:

This seems like the closest thing in the thread so far to an attempt to describe use cases for big/little endian types in the core language rather than (de)serialization functions, but are there any systems where syscalls or C libraries use be/le instead of host order? I was under the impression there tautologically were not, since that's what "host order" means.

These sound like potentially compelling use cases for le/be types, but these also sound like use cases I'd expect to be quarantined to the fringes of any codebase, and only exposed to higher-level code as host-order integers or maybe as byte arrays, in which case handcrafted library types ought to be fine (and handcrafting might be mandatory anyway, if the hardware's weird enough). Is that not the case?

Since the OP mentioned working on QEMU, one example which immediately comes to my mind is sharing of memory between emulated hardware and its host. But that is admittedly a niche use case.

3 Likes

This use case can be generalised to what I'll call 'sparsely-accessed data structures'.

Imagine there's a data structure with a fixed layout, in which you sporadically need to read and/or modify a couple of fields. The serialisation approach would have you define 'native' Rust data structures and go back-and-forth between those and byte buffers when accessing such 'foreign' data. The serialisation process necessarily converts the whole structure, including the data you don't actually need at the moment, and requires you to separately allocate 'native' structures from which you can extract data to manipulate. By treating endianness as a memory-representation problem, you are able to only pay the cost of converting the data you actually need while leaving the rest alone. Plus, if the 'foreign' memory representation happens to agree with the native ABI, you can even avoid adding any additional cost to those accesses at all; they can just compile to ordinary memory accesses.

And indeed emulators is where this comes up quite frequently. As a (somewhat extreme) example, take DOSBox, which not only emulates an x86 CPU, but also contains an implementation of the DOS kernel in the host; as such, it needs to be able to access DOS-specific data structures kept in guest memory (PSP, CDS, MCBs, FCBs, device driver headers, ioctl buffers, the list of lists) to implement DOS system calls and other ABI. If one were to Rewrite It In Rust™, fixed-endianness types would be a huge help here.

That said, I'm not sure if this necessarily needs to be a core language feature, or even a standard library feature. A well-designed crate could be just as expedient.

5 Likes

My understanding of the network vs host byte ordering problem has always been that you want to immediately convert the network bytes to host order integers, and do all your work in host order integers. In particular, I was always under the impression that there simply is no use case for a "big endian u32" type or a "little endian u32" type , as opposed to (de)serialization functions that convert between a network-order [u8; 4] and a host-order u32 or whatever. AFAIK all awareness of big/little endian-ness is hidden in the platform-specific implementation details of converting between network and host order, and wouldn't benefit from any dedicated types.

Right, if you're pulling things in from a stream (e.g. network) that's the way to go (although even there reading into fixed representation structures is a common (anti)pattern because it's so easy to do compared to having a bunch of deserialization boilerplate - maybe not so much in Rust as it is in C).

But the case I'm most familiar with is the shared memory structure, because that's what I've encountered (and debugged) a lot in the past.

Put yet another way, I'm suggesting that most of this paragraph is the correct way to do things in principle, not just in C, and questioning the implication at the end that there is a better way.

So, I'm not trying to say there's a better way than what I described, what I'm trying to say is that Rust is in a much better position to encourage or enforce the good way of doing things.

This seems like the closest thing in the thread so far to an attempt to describe use cases for big/little endian types in the core language rather than (de)serialization functions, but are there any systems where syscalls or C libraries use be/le instead of host order? I was under the impression there tautologically were not, since that's what "host order" means.

Examples I can think of:

  • Emulation of one endianness system on another (as noted elswhere, I'm very familiar with this being a QEMU developer)
  • Hardware devices: most hardware devices have LE registers, so on BE systems you need swaps when accessing them. A few hardware devices have BE registers, so you need the same on LE systems. A very few devices have different registers in different endianness (usually this comes about because the device has some sort of internal bridge or layering where different components were built by different teams)
  • On POWER servers, we've now mostly changed over to LE, though it was traditionally BE. But many firmware interfaces and some hardware devices remain BE because of that history.
  • There's no inherent reason on many cpus you couldn't run BE userspace programs on an LE kernel, or vice versa, though I don't know off hand of any cases that support this now.
  • In-memory data structures defined by cross-platform specifications to have a particular byte order. e.g. the flattened device tree used on many embedded systems is always BE because history, but needs to be read by LE kernels and other software. I don't know if ACPI tables are always LE or are host endian [Note: for a cross platform spec declaring things "host endian" is nearly always a mistake]

That said, I'm not sure if this necessarily needs to be a core language feature, or even a standard library feature. A well-designed crate could be just as expedient.

It's not a question of making it possible, it's a question of making the obvious way to do things a good way. The standard library already includes u32::to_be_bytes() etc. which are fine for streaming, but awkward for the shared structure case. byteorder expands on the streaming stuff, but doesn't really do anything extra for the shared structure case. There are several crates that do the endian specific types thing, several are mentioned in posts above.

But none of them are particularly obvious to find. In the meantime, u32::to_be() looks like an obvious choice: it matches the way this is usually done in C - the dangerous, bug prone way it's usually done in C.

Well.. that and the fact that we have a whole system for controlling in memory representation of types which can control so little about the in memory representation of types.

4 Likes

Imagine there's a data structure with a fixed layout, in which you sporadically need to read and/or modify a couple of fields. The serialisation approach would have you into define 'native' Rust data structures and go back-and-forth between those and byte buffers when accessing such 'foreign' data. The serialisation process necessarily converts the whole structure, including the data you don't actually need at the moment, and requires you to separately allocate 'native' structures form which you can extract data to manipulate. By treating endianness as a memory-representation problem, you are able to only pay the cost of converting the data you actually need while leaving the rest alone. Plus, if the 'foreign' memory representation happens to agree with the native ABI, you can even avoid adding any additional cost to those accesses at all; they can just compile to ordinary memory accesses.

Right, that's a good way of putting it. I tend to dislike in general the pattern of converting an externally specified representation into an internal one, unless you have a really compelling reason to do so (like a totally different organization of the data). That's not so much for the performance impact (which is often insignificant), but simply because I find it conceptually clearer if there's a uniform representation of something across the project - so if you need an externally specified representation anywhere, you might as well use it everywhere.

2 Likes

The alternative to reparsing and serializing the entirety of an in-memory structure is to define accessors for specific fields that operate directly on the serialized memory (probably very simply via to_be_bytes and from_be_bytes on each access). This is similar to how data formats such as Cap'n Proto use the same data layout for a serialized message and in-memory modification of such a message.

1 Like

A small note on history: the u32::to_be() family of methods was stabilised alongside the 1.0 release of Rust and might not have seen more discussion than "we need a way to deal with endianness concerns and that's what C does". The u32::to_be_bytes() family is a much later addition (the docs say 1.32) which IIRC was introduced because of concerns similar to those under discussion here.

Do you know that it is possible to formally deprecate these methods? Because that is what it seems you are advocating for. And that looks like a much more achievable goal than adding new #[repr] attributes to the language, especially ones that look like newtype wrappers in disguise.

(about external crates)

Yes, the issue w.r.t. discoverability on crates.io has been known for a long time. It is a hard problem, and not specific to this topic.

If you are talking about the #[repr] family of attributes, I am not sure you quite understand their purpose (though I could be mistaken myself). So far, all of these attributes apply to compound types and change the way their overall representation is computed from the individual parts. But endianness is a property of a primitive type. That is why many are advising you to use newtype wrappers: to make your own endian-aware primitive types.

3 Likes

u32 is a type with the native endian, by definition. Adding an attribute that makes it something else doesn't seem right to me.

You don't use #[repr(signed)] u32 to make an i32, you have a separate i32 type. So if you want a little/big-endian types, make u32le/u32be types.

If there's a mistake here, it's the existence of #[repr(packed)] that changes alignment of types, effectively making all the new types that are incompatible with the originals (due to &struct.field being unsafe).

7 Likes

This part is making a lot more sense to me now. I'd definitely support deprecating to_be() et. al. in favor of to_be_bytes() et. al.

Assuming we word the deprecation notice well, that ought to be enough of a speedbump to get people interested in "the shared structure use case" to go look at crates.io.

That is a very helpful list. I don't think these examples quite produce an argument for adding new types in std (assuming we deprecate to_be() et. al.), but some of them look like great candidates for a "de facto standard" solution. I feel like there ought to be a WG to ping for this, but the Embedded WG doesn't seem quite right for any of it.

1 Like

I can't see this happening. repr(C) is mostly for FFI, where forcing the endianness to be specified doesn't make things any safer on your platform, and is just wrong if you ever want to run on a machine with the opposite endianness.

:+1: