Cross-language safer ABI based on Rust?

Right now, the standard cross-language ABI remains C. If you want to expose an interface to another language, or consume an interface from another language, you normally go by way of C. That applies even for interfaces between two memory-safe languages that don’t allow issues like buffer overflows or use-after-free vulnerabilities.

I’ve seen various proposals for allowing Rust to “natively” use objects owned by other languages, such as the JavaScript GCed heap. However, has anyone considered the idea of defining a safe cross-language ABI, that standardizes things like counted buffers, more semantic types, and ownership/borrowing?

Such an ABI might look a lot like a subset of Rust type signatures.

9 Likes

That sounds cool and difficult. Generics in particular would be a tough one. And if you mean “safe” literally, it would have to have its own type system… that’s compatible with the type systems of the all the languages that want to interface through it! (Which kind of reminds me of the JVM and CLR… which at least have the option of falling back to runtime checks to enforce the harder things.)

Wouldn’t that - especially in the case of generics - also mean that you need some kind of compiler at hand, that compiles monomorphised versions of ABI-exported functions at (dynamic) link time?

1 Like

I’m not sure I understand what it would mean to have a “typed” ABI, but I would expect such a language just wouldn’t feature parametric polymorphism.

It seems like there is some amount of prior art in typed assembly languages.

1 Like

Exactly. For a first pass, the ABI could support the subset of the type system that doesn't include generics or traits.

That would still provide owned types vs references, length-accompanied slices, owned and borrowed strings, and other things that the C ABI doesn't cover.

Yes I want this a whole lot and think about it often. Today Rust does not quite fulfill all C use cases, and I think this is a big part of it. I want there to be an identifiable subset of Rust that is C-ABI compatible (possibly that means no generics), and I want MIR to be a multi-language IR. Please do that!

Ownership can provide a great foundational model to build other computing systems off of, and more languages are going to incorporate ownership concepts in the future. Having them speak Rust helps guard against competition from 2nd-generation ownership systems.

1 Like

:thumbsup: I also think about this all the time.

If done right and carefully, I think this is an actually an old and neglected area of CS that rust could substantially influence for the better. But it’s also a more long term play (and tedious and esoteric) so not as glamorous as other areas, or even with immediate and obvious benefits.

I could ramble on but some of the constraints we have today in ABIs (mostly due to historical precedence, and hence not engineered) aren’t really all that applicable anymore.

Some areas off the top of my head where great good could result:

  1. Standardized name mangling protocol (I also think this shouldn’t be language dependent)
  2. Generic dynamic library design
  3. Static linking container format (this is more controversial, and possibly out of scope wrt abi, but again won’t get into it right now)

Is there something preventing a new ABI from skipping name mangling? My understanding is that name mangling is necessary to work around the C-ABI's limits on symbols characters.

You'd have to reinvent both the linker (ld) and the dynamic loader (ld.so). If you're doing this in a way that only supports the new safer ABI, no traditional FFI libraries, then maybe these could be simpler than they currently are, but I still expect this is no small task.

Neither the linker nor the dynamic loader cares about name mangling directly; they just match up symbols by name. The ABI for a given platform might involve name mangling, so if you want to interoperate with that ABI you'd need to mangle your symbols to match, but if you define a new ABI you don't necessarily have to.

(Also, C++ does extensive name mangling, but that would primarily matter if you want to provide method calls or type-based overloading.)

The rough and short of it is something like: name mangling was introduced by C++ in order to still use C linker tools but also have generic functions. The problem was the C function "ABI" is flat and essentially global (with various caveats) - a function has a one-to-one mapping with its name/use, e.g. printf (with again some caveats) is the name that both the linker and the dynamic linker search for when resolving that function call by callees.

But a generic function essentially has one source code name, but may need multiple C function names (essentially one for each implementation - as though you hand specialized it a particular version for String, usize, or whatever arguments you pass as the generic object, etc.). Hence the solution was to emit a "mangled" name that encoded the parameter types in some manner, and this was the C linking name, which the linker tools could use (the compiler knew to emit the mangled name for a particular call, and the linker is dumb, and just searches for that name). I may take some flak, but it was basically a hack because they wanted to use C linker tools, not reinvent wheels, and so they did it, and moved on.

They famously never specified how to do this in the standard (mangle names). Again, I may take some flak, but I think this was a pretty serious blunder, and I can't see any particularly good reason for not doing so. (and now we can't really link C++ programs compiled with different compilers, due in a good part to lack of standardized name mangling, among other things)

So because Rust has generics, if the ABI we're talking about is for the entire set of rust, then we can't skip name mangling insofar as we have to provide a solution for some very interesting and important questions like:

  1. how to resolve a "path" to a symbol, given generic arguments
  2. where the genericity exists/whose responsibility it is
  3. how to represent genericity (sort of 1)

@cuviper has made some good points concerning 2 elsewhere. So when I talk about a "standardized name mangling protocol" what I'm talking about is engineering a solution to 1-3, but primarily 3, i.e., the nitty gritty of a transformation from symbolic name to a unique, reified reference (assuming we want to use dumb linker tools, which is probably never going to change).

And while we're on the topic, I'll just say I think we should use a variant of Godel numbering, e.g. something like: http://www.cse.unt.edu/~tarau/research/2009/fgoedel.pdf

AFAIK C doesn't have a symbol limit name per say, or if it does, it's probably platform dependent. I don't think any modern static or dynamic linker on Earth right now will have any problem with an extremely large symbol name (which definitely occur with C++). Of the three binary formats, ELF, mach-o, and PE, none of them have any technical limitation on the actual character count.

I'd have to check, but the "a.out" format might, which is basically the original unix binary format.

If you meant the kind of characters, e.g., utf-8, rust could generate utf-8 symbols for functions which the linker and dynamic linker worked with just fine, until I sort of broke it in a roundabout way: export_name with unusual utf8 breaks new version script based linker · Issue #38238 · rust-lang/rust · GitHub

As a first pass, we could have a safe ABI based on a subset of Rust types that doesn’t include generics.

Yea I’m definitely for that! And I think its actually the way to do it - start off specifying a smaller (and likely easier) subset of the language, nail it down, and then extend when some of the kinks are worked out to the more complex features.

But I’d be also be lying if I said I wasn’t more interested in figuring out a really great abi for a language(s) with generics :slight_smile:

I’d encourage everyone thinking about this stuff to read through the “Itanium” C++ ABI document. This was developed originally to promote interop among C++ compilers for Itanic, but is now (as far as I know) used by both GCC and LLVM for all CPUs and all operating systems other than Windows (where the MSVC++ ABI is used instead). I had a small hand in the development of this spec and its implementation in GCC, and my former boss at CodeSourcery (Mark Mitchell) was the lead author.

The most important thing you should notice is that this ABI has deeply seated dependencies on details of the C++ object model, and the majority of the text is devoted to C+±specific issues, such as the multiple vtables required to account for each phase of a complex C++ object’s lifecycle, the rules for type identity, etc. In many places you have to have guru-level understanding of the C++ standard just to know what problem it’s trying to solve, let alone what the solution is. Rust could conceivably make use of the name mangling spec (the primary function of name mangling is not to permit function overloading, IMNSHO, but to render caller-callee function signature inconsistencies across a separate-compilation boundary detectable at link time) but very little else.

I would argue, based on my experience with this spec, that it’s an open research question whether it’s even possible to develop a cross-language ABI for languages with a rich object model. That doesn’t mean it’s not useful to try! But it does mean that you should start very, very small. Try to solve concrete and bounded problems, one at a time. Here’s an example of a concrete and bounded problem in this area: “How can I make Result<T,E>, for some relatively limited set of types T and E (start with () and integers), usable as the return value of system calls when my OS kernel is written in Rust but its user space applications could be in any language?”

10 Likes

The specific set of problems I’d like to solve in a cross-language ABI, in order of priority:

  • Can we safely passing strings and arrays without either language having to use unsafe code working with the length?
  • Can we distinguish between owned and borrowed objects, so one language can call another with a pointer and know whether it can free that pointer afterward?
  • Can we support passing and returning tuples and simple algebraic data types (Rust-style enums)?
  • Can we support returning a pointer to an object that maintains references to the internals of one of the passed arguments, and thus must not outlive that argument?

Note that for all of these, both languages have the responsibility to propagate the appropriate constraints to the rest of their code.

10 Likes

Every item on this list is awesome :thumbsup:

All of these seem doable to me. #2 and #4 are "a simple matter" of defining a name mangling scheme that encodes all of these properties - the C++ ABI I linked to can already distinguish various kinds of pointers and references, so #2 would be just some more modifier letters. #4 is more involved because you'd have to invent an encoding for lifetimes and you'd have to include the return type in the mangling (C++ doesn't bother, although it really ought to). I don't know how this information is currently propagated across crate boundaries, but for a cross-language scheme, name mangling is the path of least resistance since you can continue to use the existing linker.

#1 and #3 involve nailing down object layout, and I had the impression the compiler team didn't want to do that just yet. Maybe it would be OK for simple things like strings and small-arity tuples and enums containing only scalars.

Struct layout doesnt need to be fixed. Object files (rlib) can contain metadata about the struct layout.

Edit: enums can also be supported by adding an abstract vm to determine the discriminant. (This supports null pointer optimization and optimizations of the same kind)

Specifying that metadata in enough detail that other languages can reasonably use it is probably a lot more work than committing to a fixed layout for some subset of types.

Same, and now also you have a weird machine and it may be subvertable.

Rustc already generates and uses that metadata. I just propose to put it in the object files.