Type descriptions (reflection) for non-C FFI

  • Yes, this is effectively opt-in reflection that additionally allows specifying/describing types from other languages.
  • Yes, this should probably be evaluated in a crate. (I'm still interested in your thoughts on it)

Also see Pre-RFC: Runtime reflection

There are a bunch of things to consider in regards to using reflection within a single process/crate. This is not about that but primarily about encoding, FFI and Sandboxing (e.g. via WASM).

At the moment basically all (unless the other language is C) FFI data exchange goes through a third, less flexible memory layout or data needs to be serialized/deserialized. One (partial) remidiation to that would be a (cross-target) stable memory representation (e.g. #[repr(v1)]), which works as long as the other side can understand it. Another option is type descriptions, similar to what is needed for reflection and similar to how many encodings work (e.g. json, binary-json, ASN.1/DER, ... which has it right next to the data, protobuf which has a separate human-readable DSL to describe it and others that store it separate from the data but in binary form).

I think it would be amazing for both FFI, serialization (if not everything is known at compile time) and compatibility in general if types (including non-Rust memory layouts) could be described (opt-in generation of static data). Similar (conceptually) to how traits describe functions on types, which are then stored in the vtable. The problem with traits and vtables is that the exact type must be known on both sides at compile time, while a type description can be serialized.

In practice this would likely be a #[derive(TypeDescription)], adding a method to get an immutable reference to the static type description. Using this at runtime isn't the only way this could be useful, for example for macros that only need to care about fields/memory layout, see reflect - Rust, which provides its own type description intended for compile-time only.

I think it might make sense to have the type description and derive macro in std, and its usage like deserialization or FFI types in crates.

As an example on how it could be represented:

struct TypeDescription {
    // I'm not 100% if this usage is valid or if we need String/Vec.
    // List of opcodes manipulate the address to get to the value.
    fields: HashMap<&'static str, (TypeID, &'static[Opcode])>
}
#[non_exhaustive]
enum Opcode {
    Offset(usize), // addr += arg0
    OffsetReadU32(usize) // addr += read_u32_at(addr) * arg0
    Pointer, // addr = read_usize_at(addr)
}

#[derive(TypeDescription)]
struct MyType {}
#[derive(TypeDescription)]
enum MyEnum {}

Describing types from other languages (and even most encodings) this way might even be easier than trying to write the corresponding C type on both sides. Similarly, it may provide a more flexible abstraction for crates like serde (which could use this type information at compile time to generate code, instead of having to use and derive its own trait-based view of Rust types). All of these field access are of course only safe if the underlying data actually is of the described type.

As far as I can tell this would be flexible enough for all existing Rust types and most types from other languages. Accessing the fields does have an overhead of course, but when the information is available at compile time this (probably) can generate efficient get/get_mut functions (via macros) for zero-copy access, since it is really close to what needs to be done in assembly anyways.

Why would this need to be in std? Deriving this type description (although possible) is likely more difficult in a crate, which is limited in how much information it can get about the type layout. Additionally some of this information already needs to be present in compiler internals to generate a intermediate representation, so exposing it as a static (on per-type opt-in) might not even cost much in terms of compilation time (still impacts binary size of course). And doing this in a proc-macro (while possible) may break with the addition of new syntax and would have to do work the compiler already needs to do). I see this similar to FormatArgs, which is a compact representation of what to do (containing the data itself) as a list of opcodes (iirc).

This is (by design) both more limiting and more flexible than having reflection like in other languages, since you don't get the abstract view of what is a struct, what is an enum, but instead only a way to access specific parts. This should also make the TypeDescription data smaller.

The main goal here (especially when combined with memory layout guarantees) is accessing types/data you may not even know at compile time (for example for an Inspector/Editor showing data defined + used primarily by a plugin/extension/dll) without the need of a third data representation (e.g. C), which could result in multiple memcpy to get the data into the required format.

What are your thoughts on this?

Because of offset_of!, this can be done in a library crate derive just as well as directly in the compiler (in terms of generated results, anyway). Or, for structs, anyway.

Ignoring enums is problematic; when you have overlapping fields where only some are valid at any given moment (as enum representations do), the type description as you've laid out gives no way to know which fields are valid to inspect, even if you know the data is of the correct type.

1 Like

I've debated whether to include enum handling in the example or not and ended up leaving it out (to keep it short and focus on handling offsets and the concept itself). But you're right, considering enums here is important.

My Idea for handling enum variants (and similar situations) was a StopIfBeU8Eq(u8) opcode, which reads one byte, compares it to arg0 and stops processing (returning None from the get field functions, similar to ? ) if they are not equal. That should work for enums and other encodings where fields can be optional.

It's probably not ideal (for example it can represent AND by having two StopIf* opcodes, but it can't represent OR, though for enums that's only needed when the same field name is shared between variants), I'm not sure if endianess should be included (same for reading offsets) and in many cases you may want the entire field to output an enum for easier matching (as in TypeID = that of an enum). But combined with a field to get the discriminant itself it is enough to get fields in an enum variant, though the information which fields are available for which variants is lost (as it isn't strictly required for getting the values).

Granted, that gives a response whether you can read the field when you try to and doesn't provide a "give me all fields I can read", though that could be implemented with this layout, though perhaps in a less performant/efficient way for enums with many variants (e.g. returning an iterator that outputs all fields that don't stop).

Similarly I've left out how/whether it should be possible to specify that multiple fields share the same start (e.g. for nested structs or enums with multiple fields in its variants), as its likely useful to have (and can further compress the opcode list), but is more of an implementation detail. And whether there should be any branching within the opcode list (to specify a field location depends on the value at location A, but not in a linear way)