Pre-RFC: anonymous struct and union types


#1

C structure and union types can include anonymous unions or anonymous structs. (The C11 standard allows this, and C compilers have allowed it for decades as an extension.) For instance:

struct Variant {
    int type;
    union {
        uint64_t u;
        double d;
    };
};

Omitting the field names for such a union or struct embeds its fields directly in the containing struct or union, while keeping the struct/union semantics for field layout and overlap. For instance, the structure above allows the following field accesses:

  • v.type
  • v.u
  • v.d

These constructs also nest arbitrarily. An anonymous union allows overlapping fields that won’t get used at the same time. An anonymous struct within a union allows grouping together multiple fields that need to exist simultaneously. For instance, consider adding a counted string to the above structure:

struct Variant {
    int type;
    union {
        uint64_t u;
        double d;
        struct {
            char *s;
            size_t slen;
        };
    };
};

This allows accessing v.s and v.slen.

Note that you can define an inner struct or union without omitting the field name as well:

struct Variant {
    int type;
    union {
        uint64_t u;
        double d;
        struct {
            char *ptr;
            size_t len;
        } s;
    };
};

This version would instead have v.s.ptr and v.s.len.

This pattern includes two new constructs that Rust doesn’t have: defining a struct or union type inline inside another, and omitting a field name to make fields part of the parent type.

For a much larger production example, take a look at struct kvm_run in the Linux KVM API.

This struct-of-unnamed-union-of-structs pattern occurs in many C APIs, so Rust FFIs will need to interface with it.

Rust code could define types compatible with this layout, using unions and structs; however, each union or struct type used in the definition would require a separate definition and name. And worse, when using the resulting aggregate type, each grouping of fields requires a name; for instance, the unnamed union above would require an explicit name and an additional .union_name. in field accesses.

(Cc @retep998, who will likely want this for Windows APIs.)

I’d like to propose an RFC defining syntax for this.

First, for defining a struct or union inline, while still naming the field:

struct S {
    common: u32,
    fieldname: union {
        field1: u64,
        field2: f64,
        inner_struct: struct {
            inner1: SomeType,
            inner2: AnotherType,
        },
    },
}

This syntax should parse unambiguously, because struct { or union { can’t currently appear where the field type would.

This would have the same semantics as defining a new type with no name, and then declaring the corresponding field with that type.

Any derive or repr declarations for the top-level struct should automatically apply to the nested types as well, so you can #![derive(Debug),repr(C)] for struct S above.

Note that you can still use that anonymous type via inference; for instance:

    fn f(s: S) {
        let u = &s.fieldname;
        println!("{}", u.field1);
    }

Finally, for defining an unnamed field:

struct S {
    common: u32,
    union {
        field1: u64,
        field2: f64,
        struct {
            inner1: SomeType,
            inner2: AnotherType,
        },
    },
}

This syntax should parse unambiguously, because struct { or union { can’t currently appear where the field name would.

This definition effectively inlines all the fields within the top-level structure, while retaining the layout and borrow semantics defined by the struct and union. For instance, after borrowing inner1 you can still borrow common and inner2, but you can’t borrow field1 or field2.


From C# v.7.0
Field aliases: alternative idea for anonymous unions for FFI
#2

From a quick glance, it seems that this proposal includes anonymous structs/unions only within struct/union definitions, but not as types in general. Is there a deep reason for that? All things being equal, it seems nicer to expand the type grammar, such that this isn’t a special case within type definitions but instead just another kind of type. A closely-related pre-RFC seems to be doing just that.

(I haven’t thought deeply about the implications of any of this, but just wanted to note the general principle of not special-casing the type grammar.)


#3

No particular reason. It seems quite reasonable to me to allow anonymous types anywhere a type can appear.

(That proposal could support part of this, as long as the approaches involving sorting fields to unify types don’t get included, or at least not when declared with repr(C).)

However, I don’t want to propose allowing unnamed fields everywhere, only within a struct or union. (And possibly also within an enum variant that uses named fields, for symmetry.)


#4

In the juxtoposition of the two emphasized sentences, it seems like you’re drawing a distinction between “anonymous types” and “unnamed fields,” but I think they refer to the same idea (which I think is better referred to as “unlabled types,” since “anonymous types” is usually used for things like the type of closures). What is the distinction?


#5

“anonymous struct” or “anonymous union” refers to something like struct { ... } or union { ... } without a type name, used directly as a type. “unnamed field” refers to a field within an outer struct or union, that has struct or union type itself, and has no field name.

The following uses an anonymous union but not an unnamed field:

struct S {
    a: T1,
    u: union { b: T2, c: T3 },
}

The following uses an anonymous union as an unnamed field:

struct S {
    a: T1,
    union { b: T2, c: T3 },
}

The term “unnamed field” comes from the GCC documentation.


#6

I missed the unnamed field aspect of the RFC. This doesn’t seem to impact layout, is the goal just to have the API mirror C’s representation? It seems to me like a feature that would be perplexing.


#7

Yes, as well as C’s expressive power for layout, for several reasons:

  • I’d like to avoid having to invent names for elements that the C API doesn’t have names for.
  • I’d like to avoid forcing every caller to use those names rather than using the fields directly as they would in C.
  • I’d like to avoid mass refactoring of a codebase when moving fields around.

Without this, when someone looks at the C API and its documentation, and wants to construct corresponding Rust code, they have to remember and add the invented union name.

In addition, this syntax with struct { ... } and union { ... } gives a large amount of expressive power over structure layout. If you’re used to reading unnamed unions and structs, you can very quickly translate between a memory representation and a structure layout, while keeping that separate from semantic grouping of fields. union { ... } acts like | in an ADT, and struct { ... } acts like a grouping operator to put multiple fields in one branch of the |.


#8

I’d be in favor of “unnamed fields”, but mildly against named fields with unnamed types, because they are two pretty different things.


Named fields with unnamed types are merely a convenience, any such type can be easily outlined.

struct S {
    a: u8,
    b: struct {
        c: u8,
    }
}

=>

struct S_B {
    c: u8,
}

struct S {
    a: u8,
    b: C_B,
}

Such outlining is 1) only two tokens longer 2) doesn’t affect interface of S and therefore invisible for users - the field c is accessed as b.c in both cases. I’d argue that if struct { FIELDS.... } is allowed in field type position, then it should be allowed in any type position, e.g.

fn f() -> struct { key: u8, value: u8 } { .... }

and I don’t think this is a sugar that Rust urgently needs.


“Unnamed fields” with “unnamed types” as they are used in C don’t even need to be types or fields, they are layout specifiers and in theory can use any syntax not reminding struct/union declarations. Such modifiers give ability to build an arbitrary complex layout for an aggregate S while keeping it’s interface as simple as a struct with a set of fields:

// I intentionally use the non-`struct {....}` syntax to highlight the difference from named fields with unnamed types.
struct S {
    product [
        a: u8,
        b: u16,
        sum [
            c: u32,
            d: u64,
        ]
        product [
            e: i8,
            f: i16,
        ]
    ]
    sum [
        g: i32,
        h: i64,
    ]
}

<=>

custom_aggregate S {
    a: u8,
    b: u16,
    c: u32,
    d: u64,
    e: i8,
    f: i16,
    g: i32,
    h: i64,
}

This is certainly a feature which gives new abilities (as opposed to just convenience) by affecting user-visible interfaces . Note that this is still a fringe low level feature that is useful in quite specific situations. One problem with this is that unions currently require unsafe blocks on any access to fields, so you can’t do some things from C, for example the field alias trick as conveniently as in C. However, I think the rules can be relaxed a bit to allow safe access to common initial sequences, safe writes to trivially destructible fields and maybe something else.


#9

It would also be nice to incorporate enums into the picture, i.e. give “custom aggregates” ability to include tagged unions in their layout as well.

struct S {
    span: Span,
    id: Id,
    enum {
        Foo(u8),
        Bar(u16),
    }
    enum { // Fun: `S` has two discriminants
        A,
        B,
    }
    union {
        a: u8,
        b: u16,
    }
}

#10

Named fields with unnamed types, while not significantly longer in tokens, have a semantic overhead in that you can’t see the layout of the inner struct/union inline in the outer struct/union.

However, if the consensus rejects that, I can imagine implementing them with a sufficiently powerful macro system, by desugaring them to a separate type. Given that, I would put a higher priority on unnamed fields.

Creating a new set of layout primitives (and pseudo-keywords) seems excessive, when the layout they implement effectively works like struct or union. Using a different syntax seems fine if it provides more expressive power, though.


#11

This is not a problem unique to C FFI though, literally any struct with fields having non-builtin types has it.

struct S {
    a: Type, // I have to go somewhere and find out what `Type` is :(
}

IDEs usually have features a la “show pop-up window with Type's definition on mouse-over”, Eclipse certainly has it and IIRC Visual Studio and QtCreator had this ability too. Now the only thing we need is IDEs supporting this for Rust… but I haven’t actually checked recently, maybe it’s already available somewhere?

I used pseudo-keywords for explanatory purpose to show that these “unions” and “structs” don’t necessarily have to exist at type level, the original syntax struct { FIELDS.... }/union { FIELDS.... } seems good and sufficient.


#12

This is not a problem unique to C FFI though, literally any struct with fields having non-builtin types has it.

Usually when you’re doing this, tho, they’re intended to be a part of the same whole. It’s not like

struct Doop {
    v: Vec<u8>,
}

which just has an opaque wrapper, as opposed to

struct string {
    length: usize,
    union {
        short: [u8; 16],
        long: struct {
            cap: usize,
            ptr: *const u8,
        }
    }
}

#13

This seems key - how often does this pattern occur in practice in C codebases and how often are Rust programmers likely to encounter it? I do think that, in isolation, allowing un-named fields is a mis-feature - it seems to make data structures more confusing, rather than more ergonomic - all it seems to save is a field access - .field, which is not a lot of typing (c.f., anon structs, which have applicability for named args, etc.). So, this seems like it would make the language worse to me, but might be justified if it solves a lot of pain when doing FFI


#14

364 mentions of DUMMYUNIONNAME in Windows API and 171 mentions of DUMMESTRUCTNAME, which is used in most cases of unnamed structs/unions inside other structs/unions.


#15

Unix kernel headers often fake this with #defines (for portability), like the following from Linux:

#define sa_handler	_u._sa_handler
#define sa_sigaction	_u._sa_sigaction

paired with this struct definition:

struct sigaction {
    union {
      __sighandler_t    _sa_handler;
      void (*_sa_sigaction)(int, struct siginfo *, void *);
    } _u;
    (..other fields..)
};

You’re supposed to write foo->sa_handler rather than interacting with the _u and _sa_handler fields directly - indeed, the former name is standardized while the latter is meant to be an implementation detail. Ideally Rust bindings for such APIs would be written as true anonymous structs (though getting bindgen to do it automatically would not be easy).

There are quite a lot of these - here’s a grep of /usr/include from OS X:

http://pastie.org/10938609

and here’s the Linux kernel:

http://pastie.org/10938610


#16

Using Deref, unnamed field can be realized as follows:

use std::ops::Deref;

struct InnerS {
    inner_a: i32,
    inner_b: i32
}

struct S {
    a: u32,
    inner: InnerS
}

impl Deref for S {
    type Target = InnerS;

    fn deref(&self) -> &InnerS {
        &self.inner
    }
}

fn main() {
    let x = S {a: 1, inner: InnerS {inner_a: 2, inner_b: 3}};
    println!("inner_a = {}", x.inner_a);
    // -> "inner_a = 2"
}

It seems sufficient for omitting field names. :slight_smile:


#17

You can only have one implementation of Deref for a type, so that would only allow one unnamed field per type, and only if the type didn’t need to use Deref for anything else.


#18

I would strongly recommend against this - Deref is designed for implementing smart pointers and this (mis-)use is confusing at best. See https://github.com/nrc/patterns/blob/master/anti_patterns/deref.md for more details.


#19

Agreed; Deref doesn’t make sense for this.


#20

Can we dust this off? I was about to tackle adding full support for siginfo_t to the libc crate (which is needed in order for waitid to be useful; right now many platforms don’t expose the si_pid and si_status fields) but the problem is that both Linux and NetBSD do the thing @comex described, where there’s a bunch of implementation-detail substructures hidden from the public C API with macros. The documented siginfo_t looks something like

typedef struct {
    int si_signo;
    int si_code;
    int si_errno;
    pid_t si_pid;
    uid_t si_uid;
    void *si_addr;
    int si_status;
    long si_band;
    union sigval si_value;
} siginfo_t;

But the actual NetBSD definition looks like this:

struct _ksiginfo {
	int	_signo;
	int	_code;
	int	_errno;
#ifdef _LP64
	/* In _LP64 the union starts on an 8-byte boundary. */
	int	_pad;
#endif
	union {
		struct {
			pid_t	_pid;
			uid_t	_uid;
			sigval_t	_value;
		} _rt;

		struct {
			pid_t	_pid;
			uid_t	_uid;
			int	_status;
			clock_t	_utime;
			clock_t	_stime;
		} _child;

		struct {
			void   *_addr;
			int	_trap;
			int	_trap2;
			int	_trap3;
		} _fault;

		struct {
			long	_band;
			int	_fd;
		} _poll;
	} _reason;
};
/* ... */
typedef union siginfo {
	char	si_pad[128];	/* Total size; for future expansion */
	struct _ksiginfo _info;
} siginfo_t;

/** Field access macros */
#define	si_signo	_info._signo
#define	si_code		_info._code
#define	si_errno	_info._errno

#define	si_value	_info._reason._rt._value
#define	si_pid		_info._reason._child._pid
#define	si_uid		_info._reason._child._uid
#define	si_status	_info._reason._child._status
#define	si_utime	_info._reason._child._utime
#define	si_stime	_info._reason._child._stime
/* etc */

Contrast the FreeBSD definition, which still has the union, but far fewer of the fields are inside. C programmers don’t have to care (except that they’ll be in for a nasty surprise if they try to use the si_* identifiers for anything else).

Not only would it be nice if the libc crate could be equally convenient, at this point we may not be able to adopt any of the workarounds (e.g. accessor methods) for siginfo_t without breaking compatibility.