Manually freeing the pointer from ::into_raw?

Various rust's objects, including Vec, CString, and more have into_raw or similar functions that turns the objects into raw pointers. And there is an important rule that these object can only be freed by turning the pointer back to original rust object with from_raw functions. This is reasonable because Rust and external language (like C) may use different allocators and have different requirements on the alignment, etc.

Yet, this make it impossible, or at least difficult without doing an extra copy, to create rust native type and have them freed in C. Even if I override global_allocator to use only libc malloc & free functions, the rule is still there. Could there be an guarantee that these pointer can be freed by the Allocator::deallocate function of the allocator that allocates the object? In other word, make code like

let (ptr, len, cap) = vec.into_raw_parts();
MyGlobalAllocator.deallocate(ptr, Layout::array::<VecType>(cap));

valid? So If I'm sure that I implement allocate and deallocate that matches another runtime, I can safely pass the ownership of the vec and have them deallocate the vec.

What do you mean by “an extra copy”? What is the approach that includes an additional copy (of the data, I presume?) that you had in mind?

What’s the advantage of calling MyGlobalAllocator.deallocate(ptr, Layout::array::<VecType>(cap)); instead of drop(Vec::from_raw_parts(ptr, len, cap));?

I'm not suggesting using MyGlobalAllocator.deallocate instead of Vec::from_raw_parts. What I meant is that if there is a guarantee that using MyGlobalAllocator::deallocate also works, and if there is a piece of C code functionally equivalent to MyGlobalAllocator::deallocate, then freeing Rust's Vec from C side is made possible.

I'm extending Neovim's codebase (which is mostly in C) with Rust. They have internal vec-like structure and string which works pretty much the same as rust's Vec & CString. Just wondering if I can avoid the hassle of exposing & rewriting the same logic.

If something is allocated using the global allocator in rust you have to free it again using the same global allocator in rust. The global allocator may not match the system allocator used by C. Rust allows you to override the global allocator used by rust code using #[global_allocator], but this doesn't affect C code.

2 Likes

You can simply implement a Rust function:

extern "C" fn drop_int_array(ptr: mut* u8, len: usize, cap: usize) {
    drop(Vec::from_raw_parts(ptr as mut* i32, len, cap));
}

and have that function called from C whenever needed.

Just overriding the global_allocator to use malloc and free does not exactly work, as Rust poses different allignment constrains on types, hence for CString if you really need to use malloc consider using a different specialized type anyway.

3 Likes

This wouldn't work for containers that have some header sitting before their payload. E.g. Rc::into_raw returns a pointer to its payload which is not the beginning of its allocation.

3 Likes

I don't think the problem is alignment. I can also use libc::aligned_malloc to malloc with correct alignment requirement when implementing the global allocator trait and make sure the C side will to exactly the same.

I believe it is natural to assume that the pointer returned by Vec::into_raw_parts is the only resource the vec is holding (when the inner type is a primitive type) and the allocation is exactly ptr to ptr + cap. Rust could have make people's life miserable by implementing Vec like:

impl<T> Vec<T> {
    pub fn with_capacity(capacity: usize) -> Self {
        let ptr = std::alloc::alloc(Alloc::array<T>(capacity + 42));
        Self {
             len: 0,
             cap,
             ptr,
        }
    }

    pub fn into_raw_parts() -> _ {
        (self.ptr.offset(42), self.len, self.cap)
    }
}

And the pointer from into_raw_parts will not be the one that was returned from the allocator. But there is no reason to expect that Vec and CString is implemented like that (beside what @the8472 said that some data structure has additional header block).

Box does specify that its raw pointer can be manipulated directly with the raw allocation functions. We additionally document that converting between Box<[T]> and Vec<T> does no allocation if len == cap, so it's in theory "just" a matter of documenting the allowance.

HOWEVER, you still must go through the Rust allocation functions, and must not mix Rust allocation with calls to the underlying allocator. Quoting from std::alloc::System (emphasis mine):

The default memory allocator provided by the operating system.

This is based on malloc on Unix platforms and HeapAlloc on Windows, plus related functions. However, it is not valid to mix use of the backing system allocator with System, as this implementation may include extra work, such as to serve alignment requests greater than the alignment provided directly by the backing system allocator.

Given that you'll need to create and use a Rust shim anyway, there doesn't seem to be much benefit to making the shim slightly shorter.

1 Like

Technically that's only true for System, you could copy the implementation of System locally and verify that it's valid to mix it with the C stdlib functions (true for the current unix implementation afaik), then override #[global_allocator] to use your verified implementation.

(I'm a member of T-opsem but speaking for myself here, not the team.)

I'd still caution against mixing with Global, since the global allocator gets extra magic to justify allocation not being considered observable and thus valid to replace/elide; I'm not aware of a way of formally defining it such that standard alloc removal optimizations are justified (this is typically done by allowing serving allocations with some compiler magic source of memory, then eliminating blocks as free of observable effects) which necessarily maintains that allocations which "escape" are actually served by the concrete allocator (and not by the magic one).

In practice it's not a problem (since the magic allocator doesn't actually exist at runtime... mostly; I've seen heap allocation actually observably on the stack at least once), but it's not a guarantee that's sound to rely on[1], as far as I'm aware.


  1. In fact, T-opsem is leaning towards not trying to provide any concrete guarantee that allocation which happens "on the stack" is in any way "on the stack" w.r.t. what addresses allocations get. It's perfectly allowable to see addresses of stack-allocated locals which don't follow any sort of consistent stack regime, including for locals to have addresses outside the stack region (e.g. in the heap region or in the constant memory region). ↩︎

3 Likes

This bugs me to no end. Why is the global allocator treated like magic, whereas local ones are completely ignored? Plenty of applications (embedded, real-time, AAA games, HFT etc) heavily avoid or don't use global allocators entirely, and instead rely on local or static arenas.

If the magic of global allocator is really justified by optimizations, then those local allocators should also be able to utilize it. If it's not, then it shouldn't be in the language. One less magic thing to worry about.

What if I swap a global allocator with a custom one? It could be a Rust allocator. It could be an opaque binary library. It can have whatever extra side effects for allocation. Maybe it sends alerts when memory threshold is exceeded, or applies extra security measures to allocations. How can the compiler assume that it knows the intended behaviour of allocation calls, beyond "this gives fresh memory"?

--‐-------------------

Imho that's a continuation of very problematic C/C++ approach to optimization, where the compiler authors believe that they control the world and know better than the programmer what they want. This stuff leads to replacing calls to external libraries (most notably libc) with different ones, which leads to issues down the line (e.g. if there is a difference between glibc and musl, or if the library is swpped via LD_PRELOAD at runtime).

I get it, people want free performance. But the ball-of-mud optimization model where anything can be swapped behind your back is deeply problematic. There should be a clean separation between core language semantics, minimal reasonable definition of UB, and properly designed libraries. Even if there is a benefit to swapping out library calls, pulling it into a core spec is a bad idea. It should be kept as a separate compilation option at best, with explicit opt-in, so that the end user knows what they're signing up for.

7 Likes

We already use -fno-semantic-interposition. Because we don't use protected visibility right now this is required to be able to optimize any exported function. So no inlining cross-crates or even inlining of exported functions within a crate. And with protected visibility you wouldn't be able to use LD_PRELOAD to swap it out in the first place. I tried changing all non-#[no_mangle] functions to protected visibility a couple of days ago (as those have an unpredictable symbol name and thus can't be interposed without depending on rustc implementation details anyway), but hit an ld.bfd linker issue I didn't know how to solve. (ld.gold and ld.lld worked fine)

This also bugs me. The current idea AIUI is that we'll make the behavior opt in for local allocators as well, e.g. with some ReplacableAlloc<A> wrapper that nondeterministically services allocations with the wrapped allocator or compiler magic. Global is then defined as using that wrapper.

Because using the global allocator is unsafe, and it's specifically documented that you may not rely on allocations in the source being served by the global allocator: GlobalAlloc

You must not rely on allocations actually happening, even if there are explicit heap allocations in the source. The optimizer may detect unused allocations that it can either eliminate entirely or move to the stack and thus never invoke the allocator. The optimizer may further assume that allocation is infallible, so code that used to fail due to allocator failures may now suddenly work because the optimizer worked around the need for an allocation. More concretely, the following code example is unsound, irrespective of whether your custom allocator allows counting how many allocations have happened.

drop(Box::new(42));
let number_of_heap_allocs = /* call private allocator API */;
unsafe { std::intrinsics::assume(number_of_heap_allocs > 0); }

Note that the optimizations mentioned above are not the only optimization that can be applied. You may generally not rely on heap allocations happening if they can be removed without changing program behavior. Whether allocations happen or not is not part of the program behavior, even if it could be detected via an allocator that tracks allocations by printing or otherwise having side effects.

4 Likes

You know that's a non-answer. We're talking about the way things should be, not the way they currently work, and strengthening the allocator guarantees is possible even for stable APIs, even if the documentation stays the same.

No, it's a direct answer to the question of "How can the compiler assume that it knows the intended behaviour." The compiler is allowed to make the assumption because you're using the interface which is explicitly for the exclusive purpose of "this gives fresh memory" without caring how the allocation is actually serviced and provides absolutely no further guarantees.

If you want to rely on further guarantees of your specific allocator, you (will eventually) have the option of using your specific allocator directly instead of the Global allocator. When you do that, we absolutely aren't going to play any tricks; what code you call is what code gets (as-if) run.

The only exception is if you explicitly opt into some sort of nondeterminism. The std::alloc::Global library item opts into that nondeterminism.

There doesn't need to be a way to use the #[global_allocator] resource without this nondeterminism, because it already does exist: use the resource without going through the #[global_allocator] machinery.

1 Like

Not using global allocator is extremely clumsy, as you're aware. The relevant traits aren't even stable. Given the pervasive use of global allocator in the ecosystem, I can't find that answer satisfactory.

Is there even a strong proof that eliding allocations is beneficial in Rust? It's not C++ where people dump everything on the heap just to reduce troubles with lifetimes and dangling pointers. Rust code already utilizes the stack heavily, so I would assume most allocations are there for a reason, particularly in the hot code where it matters.

I get it, the project defaults to C++/LLVM semantics for good reasons, but this particular case looks very suspect. Is there a difficulty with disabling allocation elision in LLVM?

Inlining any function that returns Box<T> is almost immediately an opportunity for stack promotion that cannot easily be expressed in the language. Yes, this optimization matters.

4 Likes

It's actually enabling allocation elision which takes extra effort; LLVM's default is of course to respect that functions do what they do. The global allocator symbols (__rust_alloc, __rust_dealloc, __rust_alloc_zeroed, and __rust_realloc) are specially handled to be annotated as an alloc family in LLVM.

The specific relevant LLVM Function Attributes are "alloc-family"="__rust_alloc" (identifies what set it's a part of), allockind("alloc,aligned,uninitialized") (identifies which function it is and initialization state) and allocsize(0) (sets a minimum number of allocated bytes returned when nonnull).

Source comments also say that the Rust fork of LLVM 14 and earlier are patched to recognize the symbols and optimize them like malloc etc., but I think later LLVM versions just use the function annotations now.

... actually from a quick test, it looks like we might actually not put those annotations on the symbols when defined locally via #[global_allocator]? I'm divining behavior by looking at post-optimization LLVM IR so it might be all sorts of misleading, though, and absolutely shouldn't be relied upon. They definitely get added when using an unknown and/or known-default #[global_allocator], though. You'd need to ask T-compiler knowledge for better information here. I think this changed somewhat in 1.71 (notably, __rust_no_alloc_shim_is_unstable).

I've yet to see Rust/LLVM actually replace a Box allocation with a stack allocation. It's unfortunately quite a bit more difficult than you'd initially hope because of our panic/unwinding semantics (lack of a forward progress guarantee) and calling the dynamic handle_alloc_error handler. Stack space is somewhat limited, and LLVM often seems loathe to relocate memory manipulated by pointer (likely due to C++ish address sensitivity concerns).

The theoretically simplest case[1] which doesn't even check allocation failure, just unsafely assumes it was successful, still performs an actual allocation. Using a safe box instead[2] optimizes almost identically, just with an extra check for null and conditional call to handle_alloc_error; no unwinding pads are needed, as this example is carefully controlled to be known not to unwind except by handle_alloc_error.

That being said, I should reiterate that Rust/LLVM absolutely can and does completely elide allocations. It'll even do it when leaking success-checked allocations of more than the entire address space. It's solely replacement that I haven't actually observed being done.

A middle ground semantic that might be more agreeable than "Global allocations are arbitrarily replaceable" might be "Global allocations are removable, but if observed, they were provided by actually calling the allocator." This keeps most (if not all) of the optimizations that we currently do in practice, but is significantly more difficult to (specify and) provide.

The huge one is actually eliding allocations when constant folding code. Any concrete example will look silly, since it could just not use a box, but the reason inlining is the optimization is that it cuts through abstractions to expose exactly that kind of silly no-op code for other peephole optimizations to clean up.

A canonical example would be something like { let a = Box::new(2); let b = Box::new(2); *a + *b }. If allocations are removable, this can be folded down to just a constant 4. If allocations aren't removable, the best you would be able to do becomes at a low level roughly

{
    let a = __rust_alloc(4, 4);
    if a.is_null() { handle_alloc_error(4, 4); }
    let b = __rust_alloc(4, 4);
    if b.is_null() { handle_alloc_error(4, 4); }
    __rust_dealloc(b, 4, 4);
    __rust_dealloc(a, 4, 4);
    4
}

and this is also assuming that handle_alloc_error doesn't ever unwind (perhaps a reasonable restriction, but not one which the compiler currently exploits).

"Zero overhead abstraction" exists because the optimizer is able to strip out all of the extra unnecessary ceremony introduced. Opaque arbitrary external functions (e.g. allocation and handle_alloc_error) are the perfect barrier to code stripping/folding optimizations.

One optimization which we don't do enough of yet, but is enabled by replaceable allocation, is in place initialization. Allowing Box::new(T::new()) to construct T in place requires turning

create T on stack (maybe unwind)
create heap allocation
maybe handle alloc error
move T from stack to heap

into

create heap allocation
if (alloc error) {
  create T on stack (maybe unwind)
  handle alloc error
} else {
  (on unwind: delete heap allocation)
  create T directly to heap (maybe unwind)
}

This requires inserting an allocation that might not otherwise exist. LLVM doesn't like doing this for obvious reasons, but the current draft opsem (allocation is not an observable event) does permit this transformation. It's certainly not perfect (stack space still needs to be available for if the allocation fails), but it is a known desirable.


  1. #![feature(rustc_attrs)]
    use std::alloc::*;
    
    fn main() { unsafe {
        let layout = Layout::new::<i32>();
        let p = alloc(layout).cast();
        escape(&*p);
        dealloc(p.cast(), layout);
    }}
    
    extern "C" {
        #[rustc_nounwind]
        fn escape(_: &i32);
    }
    
    ↩︎
  2. #![feature(rustc_attrs)]
    
    fn main() { unsafe {
        let p = Box::new(0);
        escape(&*p);
    }}
    
    extern "C" {
        #[rustc_nounwind]
        fn escape(_: &i32);
    }
    
    ↩︎
7 Likes

We have codegen tests that specifically check that heap allocations get elided.

3 Likes

My gut feeling is that if you do that, then just replacing the allocator with some custom thing with the same API might not be enough. You probably also want to use specialized datastructures that kind of know the particularities of the underlying memory acquisition scheme. And if you are at this point, you don't need some standartized alloc-trait or something because you brew you own soup anyway.