Pre-RFC: Support using the Rust allocator from C

Summary

Make the Rust allocator callable from C.

Motivation

When using both C/C++ and Rust in the same project, it can be necessary to create Rust Box'es, Vec's and String's. As the Rust code may use an allocator other than malloc, it is not allowed to use malloc on the C/C++ side. Instead the Rust allocator needs to be called. Currently this requires everyone to individually creating wrapper functions on the Rust side that export a C interface. It would be much easier to allow directly using the Rust allocator from C/C++.

Guide-level explanation

It becomes possible to allocate and free memory using the Rust allocator in C/C++ code linking to Rust code. This is useful when interacting a lot between Rust and C/C++ code. For example:

void *_RNvNtC4rust5alloc5alloc(size_t, size_t);
bool rust_box_is_42(uint32_t *);

bool box_is_42(uint32_t val) {
    uint32_t *box_ptr = _RNvNtC4rust5alloc5alloc(sizeof(uint32_t), alignof(uint32_t));
    rust_box_is_42(box_ptr);
}
#[no_mangle]
unsafe extern "C" fn rust_box_is_42(box_ptr: *const u32) -> bool {
    // `Box::from_raw` would not be allowed if `box_ptr` was allocated using `malloc`.
    let box_ = Box::from_raw(box_ptr);
    *box_ == 42
}

Or the other way around:

void _RNvNtC4rust5alloc7dealloc(void *, size_t, size_t);
uint32_t *get_rust_boxed_num();

bool box_is_42() {
    uint32_t *box = get_rust_boxed_num();
    bool is_42 = *box == 42;
    // It would not be valid to free `box` using `free`.
    _RNvNtC4rust5alloc7dealloc(box, sizeof(uint32_t), alignof(uint32_t));
    return is_42;
}
#[no_mangle]
extern "C" fn get_rust_boxed_num() -> *const u32 {
    Box::into_raw(Box::new(rand()))
}

It is not allowed to override (or define) any of the allocator symbols yourself, whether in C or Rust. They must always be defined by rustc through either #[global_allocator] or libstd.

Reference-level explanation

The following functions will be exported that forward to the global Rust allocator:

#[linkage_name = "_RNvNtC4rust5alloc5alloc"]
extern "C" fn alloc(size: usize, align: usize) -> *mut u8;

#[linkage_name = "_RNvNtC4rust5alloc11alloc_zeroed"]
extern "C" fn alloc_zeroed(size: usize, align: usize) -> *mut u8;

#[linkage_name = "_RNvNtC4rust5alloc7realloc"]
extern "C" fn realloc(ptr: *mut u8, size: usize, align: usize) -> *mut u8;

#[linkage_name = "_RNvNtC4rust5alloc7dealloc"]
extern "C" fn dealloc(ptr: *mut u8, size: usize, align: usize);

These functions directly correspond to the respective methods on GlobalAlloc. All functions have the same safety invariants as the corresponding methods on GlobalAlloc. In addition the safety invariants of Layout::from_size_align must be followed when called with the given size and align.

These functions are directly visible to all code statically linked into a rust executable or shared library that depends on liballoc and all code that links to either libstd.so or a shared library that contains #[global_allocator]. They will always be available at runtime when liballoc is linked in, but it may be necessary to tell the linker that it is fine if it can't immediately find the allocator symbols using for example --undefined.

This RFC can be implemented by renaming the methods of the allocator shim and then adding documentation that the signatures of the functions in the allocator shim must not change.

Drawbacks

Why should we not do this?

Rationale and alternatives

The function signature of these functions is already effectively stable as they correspond 1-to-1 with methods on the stable GlobalAlloc.

The OOM handling functions could also be stably exported, but given that defining the OOM handler isn't stable yet anyway and there are several design choices there that would affect the possibly exported C api, this RFC does not propose to export a C api for the OOM handler.

It would be possible to not export these functions. This is the status quo. It however requires everyone to write their own wrapper functions.

Prior art

C/C++ allow directly calling the global allocator from other languages using malloc and free. Many "managed" languages like Python or Java also allow directly using the global allocator to allocate native objects using a C api.

Unresolved questions

What name should be used? @eddyb suggested the names used in this RFC in https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/upstreaming.20LLVM.20patch.20for.20__rust_.20alloc.20functions/near/247438389 such that backtraces look nice and it is clear that the functions are coming from Rust and not C or C++. It would also be possible to use names like __rust_alloc and __rust_dealloc or __rustalloc_alloc and __rustalloc_dealloc. These have the advantage of being easier to memorize and looking less weird.

Should the extern "C" or extern "C-unwind" abi be used. In other words, should these functions be allowed to unwind in the future.

How do we ensure that --undefined is never necessary to allow linking to succeed?

Future possibilities

None that I know of.

1 Like

Using the mangled symbols on the C++ side looks a little painful (I understand it's necessary for C, though). Could we add support for Rust symbol mangling to the C++ standard somehow, so you could use extern "Rust" rust::alloc::alloc;? I guess that would be a bad idea until the v0 symbol mangling finally gets upstreamed :crossed_fingers: but maybe after that?

1 Like

I think the following would work as header file:

namespace {
extern "C" {
    void *_RNvNtC4rust5alloc5alloc(size_t, size_t);
    void *_RNvNtC4rust5alloc11alloc_zeroed(size_t, size_t);
    void *_RNvNtC4rust5alloc7realloc(void *, size_t, size_t);
    void _RNvNtC4rust5alloc7dealloc(void *, size_t, size_t);
}
}

namespace rust {
namespace alloc {
using alloc = _RNvNtC4rust5alloc5alloc;
using alloc_zeroed = _RNvNtC4rust5alloc11alloc_zeroed;
using realloc = _RNvNtC4rust5alloc7realloc;
using dealloc = _RNvNtC4rust5alloc7dealloc;

}
}

You need a header file to specify the function signature anyway.

3 Likes

Wouldn't need to be in the standard. An implementation could always support the "Rust" language linkage. For example:

namespace rust::alloc{
    extern"Rust" unsigned char* alloc(size: unsigned long,align: unsigned long);
}

would allow a C++ implementation that is aware of names from a particular rust implementation to mangle that symbol according to the rust name. I actually intend to do this in lccc, though it also adjusts the ABI to the "Rust" abi as well (so it would have to be exported in rust as extern"Rust" to work properly).

I would recommend against using the "v0" mangling for exports, unless that is intended as more than just a "suggested" name mangling scheme, and the one to be adopted by rustc specifically. In lccc, to avoid naming problems caused by trying to select from two naming schemes, I use (and extended) the itanium syntax (not consistent with legacy - the mangling scheme is designed to produce consistent symbol names). The fact #[export_name]/#[linkage_name]/whatever it's called exists, though, means this isn't a non-starter without forcing the name mangling scheme upon all implementations. I would suggest using unmangled names, though. Knowing the constructons for itanium names, I can tell you that remembering __rust_alloc is a lot easier than _RNvNtC4rust5alloc5alloc, or _ZN4rust5alloc5allocEmm

Note that the C++ global allocator is not the same as malloc or free, and the name is not accessible to other languages (On itanium x86_64-Sys-V the basic allocator function is _Znwm (::operator new(unsigned long)). This is true of C, however.

I would love to see these functions made available, but I definitely don't think we should export the mangled names. Perhaps one day we might add native support for the mangling convention to C and C++ compilers, but until then, I think we should give these symbols clearer names.

We also need to define, stably, the precise semantics and non-semantics of these functions. For instance, some C code might expect the realloc function to act like C realloc, where passing NULL as the previous pointer makes it act like malloc, and passing 0 as the new size makes it act like free. (Those semantics make C realloc alone capable of providing the full allocation API with a single function pointer.) If we don't want to provide those semantics, we should clearly document that, because that may surprise C developers. We'll already need to provide wrappers to export; we should consider providing those semantics for the realloc wrapper.

1 Like

This was the idea of @eddyb. The reasoning behind it is that it gives nice backtraces and that it is distinctively a Rust symbol and not a C or mangled C++ symbol AFAIK. rust-lang

I am fine either way though. What do you think would be a good prefix? __rust_ is currently used for unstable internal interfaces, so I think either these allocator functions should use a different prefix or the internal interfaces should use a different prefix to prevent confusion about what is an implementation detail and what is stable.

The semantics are identical to the corresponding methods of the GlobalAlloc trait. These functions directly call the methods on #[global_allocator] with only an extra Layout::from_size_align call to assemble a Layout from the size and align arguments as Layout is not FFI-safe.

From the safety section of GlobalAlloc::realloc:

Safety

This function is unsafe because undefined behavior can result if the caller does not ensure all of the following:

  • ptr must be currently allocated via this allocator,

This means that realloc can't be used in the place of malloc.

  • layout must be the same layout that was used to allocate that block of memory,
  • new_size must be greater than zero.

This means that realloc can't be used in the place of free.

  • new_size , when rounded up to the nearest multiple of layout.align() , must not overflow (i.e., the rounded value must be less than usize::MAX ).

(Extension subtraits might provide more specific bounds on behavior, e.g., guarantee a sentinel address or a null pointer in response to a zero-size allocation request.)

I think either rust_alloc or rustalloc_alloc would be fine.

Regarding semantics, I'm not suggesting we need to change the semantics of these functions. But if we're going to export C symbols whose suffixes are the same as well-known C functions (e.g. realloc), we should explicitly document cases in which they differ from the semantics of those C functions. For instance, something like this:

Note that unlike the C realloc function:

  • rust_realloc only works on an existing memory allocation. Passing NULL to create a fresh allocation is not valid; call rust_alloc instead.
  • rust_realloc does not accept a new size of 0. To free memory, call rust_free instead.

There are also significant signature differences, in that each function takes an alignment, so those would have to be documented.

Speaking of signature differences, the signature in C/++ should use uinptr_t in place of size_t. They are not always the same, and usize is going to be equivalent to uintptr_t if they are different.

I don't understand what this is referring to. v0 is the first version of the specified Rust mangling scheme, and while some progress was stalled by having to upstream it into tools like debuggers, the plan was all along to make it the default. We might be able to do that soon, once it fully supports unstable const generics.

Anyway, alternative implementations of Rust don't need produce the same symbols, and something like _RNvNtC4rust5alloc5alloc is not obtainable through mangling, it just happens to fit the syntax and be supported by rustc-demangle and (in recent/future versions) debuggers.

It would always be used through #[export_name = "..."]/#[link_name = "..."] and specified as an opaque string as far as an RFC like this would be concerned.

Why would you remember it? It's an opaque identifier that is meant to be copy-pasted from the RFC (or Rust docs, if it ends up somewhere in e.g. alloc docs).

Why would Rust mangling support ever exist in a non-Rust compiler? Why would it make sense? It's not like you can have mangled symbols you can name externally, for most things - that ability would kneecap the sound isolation of symbols from different crates that may share e.g. a name (or even a version), but are considered separate by Cargo.

A name like rust_alloc could easily be defined in C code, without arising any kind of suspicion, and even completely accidentally. We can add more qualifiers in there like rust_internals_global_allocator_alloc, to make it more visually distinct, but at the end of the day, a symbol that matches the Rust (v0) mangling scheme and does not fit any C or C++ naming conventions, is IMO superior for distinguishing it.


My personal opinion wrt this RFC, ignoring the choice of names, is that we shouldn't expose the Rust global allocator like this, even if there is a stable API on the Rust side.

There may be unknown unknowns, and while on the Rust side we can account for some of them (e.g. needing more of these internal entry-points into the global allocator, or needing to pass more information through one of the entry-points) by reshuffling the internal codegen, we more or less lose that ability once there's any exposure to C.

Alternatively, if people really want some stable C API for the Rust global allocator, and if we don't want to let C have the ability to override it at all, we could instead "just" add a few #[no_mangle]/#[export_name = "..."] functions to the alloc crate, that do what everyone can already do in their own FFI crate.

That I could get behind, and I probably wouldn't care too as much about the choices of symbol names.

C can't override it anyway as rust will refuse to compile a dylib, cdylib, staticlib or bin unless a global allocator is defined on the rust side. While symbol interposition would be possible at runtime, this can be prevented by using the "protected" visibility for the symbols: LLVM Language Reference Manual — LLVM 13 documentation

If we want to reshuffle the internal codegen, these functions could still be emitted as wrappers, just not used by alloc::alloc::Global.

Whenever I write code in anything, I typically memorize useful things. Having to copy-paste the name every time I want to use it would cost more time than just remembering it, but it's easier for an unmangled name to be remembered.

Isn't it, though? I don't know the rust-v0 mangling as well as Itanium, but that looks like the mangling of rust::alloc::alloc.

If you only have one copy of a crate in a dependency graph, they could be reduced to their own names, and of course, special crates like std/alloc/core can get priority over their exact name.

Aren't the C exports just thin wrappers arround the counterparts in alloc::alloc? If so, I highly doubt the signatures of those need to change, given that it's effectively calling a function with a stable definition.

_RNvNtC4rust5alloc5alloc demangles to rust[0]::alloc::alloc. It has a 0 as crate disambiguator, which is extremely unlikely to happen (1 in 2^64 times). Rustc would produce something like _RNvNtCsiHrIwmIS61w_4rust5alloc5alloc which demangles to rust[d9d1be2d9a8a11c8]::alloc::alloc depending on the -Cmetadata arguments given (and possibly in the future rustc version).

You can verify this using rustc_demangle: Rust Playground

It is, but there's no disambiguator set for the rust "crate", which, as @bjorn3 also mentioned, is not something that normally happens (if we do use this kind of "artificial mangling", we should make sure that a disambiguator of 0 cannot actually exist in other situations, by making rustc error if it happens to arise from hashing, regardless how extremely unlikely that may be).

Perhaps, but without whole-program compilation and/or Cargo informing rustc that any one crate in the graph is somehow "uniquely named", the only sound thing to do is to always include the crate disambiguators (and ensure that they're unique across the dependency graph).

Right now, the functions @bjorn3 was talking about are the exact mechanism through which #[global_allocator] (or the default one in std) is injected into any crate using alloc - i.e. alloc imports and calls undefined symbols, that get defined in a downstream crate (by the compiler).

So by exposing them directly, there's a risk of them being (accidentally) (ab)used to hijack the Rust global allocator (but see the "alias" suggestion below, that could solve this neatly).

Ah that'd be handy, can you open an issue about using it for weak lang items and whatnot? Though for anything we use real mangling for, the crate disambiguators are going to make (accidental) external interference unlikely/impractical.

Fair enough, that's a good point, and combined with the difficulty of overriding the definitions from C (assuming the link order prevents it today), I'm less concerned.


Perhaps we could have an "alias symbol" feature, say, an attribute like #[export_name = "..."], but which instead of setting the symbol name of the function, produces a LLVM alias (similar to ones LLVM generates from merge-functions), which should be as efficient to call as the original function, but is separate for the purposes of late-binding/overriding.

Even if we don't expose it to the language, the alloc entry-points are special enough that we could at least hardcode this behavior for them in particuler.

I have updated the RFC to clarify that this is not allowed. I agree that it is best to prevent this completely though.

Side-note: realloc() for size == 0 is best avoided due to differing behaviors. It is obsolescent and will be UB soon.