Add support for the wrapping of heap allocation and deallocation functions


#1

It can be very useful for user programs to be able to take arbitrary actions when when heap allocation and deallocation occurs.

My motivation is that I want to build a heap profiler for Servo, one that’s similar to Firefox’s DMD. This profiler would record a stack trace on each allocation, and use that to emit data about which parts of the code are performing many allocations (contributing to heap churn) and which ones are responsible for allocating live blocks (contributing to peak memory usage). This may sound like a generic tool that could be built into Rust itself, but it’s likely to end up with Servo-specific parts, so flexible building blocks would be better than a canned solution.

There are lots of other potential uses for this, and this kind of facility is common in other systems. E.g. glibc provides one for malloc/realloc/free, and there’s also the more general LD_PRELOAD (on Linux) and DYLD_INSERT_LIBRARIES (on Mac) which allow you to hook any library functions.

What this needs.

  • A program needs to be able to specify that it wants to opt in to this feature, and a way to specify the wrapping functions (one each for std::rt::heap::allocate, reallocate, reallocate_inplace, deallocate, and possibly usable_size and stats_print). The opting-in could be at compile-time or runtime; the latter is probably preferable because it’s more flexible, but this is not a strong preference.

  • The allocation/deallocation functions need to call any provided wrappers.

  • A way for wrappers to temporarily disable wrapping while they are running, so that we don’t get infinite recursion if the wrapper itself triggers a call to the function that it wraps.

I have a basic, gross, proof-of-concept implementation. (I’ll show just the part relating to the wrapping of allocate; the other functions are very similar.) It adds the following code to src/liballoc/heap.rs, which defines a struct for holding the wrapper function and a function for setting the wrapper.

pub type AllocateFn = unsafe fn(usize, usize) -> *mut u8;
        
pub type AllocateFnWrapper = fn(AllocateFn, usize, usize) -> *mut u8;

struct AllocatorWrappers {
    allocate: AllocateFnWrapper,
}

static mut wrappers: Option<AllocatorWrappers> = Option::None;

pub fn set_allocator_wrappers(allocate: AllocateFnWrapper) {
    let new_wrappers = AllocatorWrappers {
        allocate: allocate,
    };
    unsafe { wrappers = Option::Some(new_wrappers) };
}

It also modifies allocate like so:

 #[inline]
 pub unsafe fn allocate(size: uint, align: uint) -> *mut u8 {
-    imp::allocate(size, align)
+    match wrappers {
+        Option::None        => imp::allocate(size, align),
+        Option::Some(ref h) => (h.allocate)(imp::allocate, size, align),
+    }
 }

In the normal case this adds a single, perfectly-predictable branch to the alloc/dealloc path, which is hopefully small in relation to the cost of an alloc/dealloc.

And here is part of a sample program that uses it.

// If a wrapper contains code that itself calls the function being wrapped,
// we'll hit infinite recursion. Therefore, each wrapper needs to be able to
// temporarily disable wrapping. This is achieved via a thread-local flag. 
thread_local!(static WRAPPERS_ENABLED: Cell<bool> = Cell::new(true));
    
fn my_allocate(real_allocate: AllocateFn, size: usize, align: usize) -> *mut u8 {
    WRAPPERS_ENABLED.with(|wrappers_enabled| {
        if !wrappers_enabled.get() {
            return unsafe { real_allocate(size, align); } 
        }
        wrappers_enabled.set(false);
        println!("my_allocate({}, {})", size, align);
        wrappers_enabled.set(true);
    }); 
    unsafe { real_allocate(size, align) }
}   

fn main() {
    // Want to do this as early as possible.
    set_allocator_wrappers(my_allocate, my_reallocate, my_reallocate_inplace, my_deallocate);

    // ... do stuff ...

    // Without this, I get "thread '<main>' panicked at 'cannot access a TLS
    // value during or after it is destroyed'". Presumably the problem is that
    // the destructor for WRAPPERS_ENABLED gets run, so we can't access the TLS
    // any more, and then deallocate() is called for something else. Urgh.
    WRAPPERS_ENABLED.with(|wrappers_enabled| {
        wrappers_enabled.set(false);
    });
}

As I said, it’s totally gross.

  • set_allocation_hooks() should be called as early as possible, so that it doesn’t miss any allocations. It’s also entirely non-thread-safe, which is probably ok if you do call it right at the start of main(), but is still nasty. Putting the wrappers table inside an RwLock or something might be possible but it would be a shame, performance-wise, to do that for a data structure that’s written once and then read zillions of times. I figure there must be a more Rust-y way of doing this. Could a #[feature("...")] annotation work? We really want these wrappers to be enabled the language runtime at start-up, rather than the program itself having to do it.

  • The thread-local storage to avoid infinite recursion isn’t so bad, though it would be nice if this could be somehow handled within the Rust implementation somehow so that each individual program doesn’t have to reuse it. The hack at the end of main – to deal with deallocate calls once the TLS is destroyed – is gross, too.

It was really just a prototype to see if I could get something working. And it does.

So… I’m wondering if (a) a feature like this seems like something that might be accepted into Rust, and (b) if there are better ways of implementing it. I hope so, because my ideas for measuring Servo’s memory usage are dead in the water without this.

This idea has a small amount of overlap with this old RFC – I want a custom allocator for the entire program, basically – but is much smaller.

Thank you for reading.


#2

@pnkfelix you may be interested in this.


#3

I don’t want to distract from the main discussion here, but this is a perfect example of the pitfalls of having static mut available, which is part of the reason I proposed removing it.

You’re lacking any kind of synchronization around accesses to wrappers, which is UB AFAIK, given multithreading.

I think the best option here would be AtomicPtr, with sequentially consistent writes and relaxed reads.

As for the rest of the proposal, I have nothing against it, and I’ve previously found myself wishing I had a way to visualize all the allocations, including type information (preferably in a debugger-friendly form).


#4

Some random thoughts before I head off for the night.

@nikomatsakis and I have talked in the past about wanting to instrument memory allocation / deallocation, and some different API’s for it. I am sure he will have thoughts on this (though he may not necessarily see this thread; I’ll try to ping him more directly).

I would like for there to be some way to attach instrumentation in a thread-local fashion. (But I also assume that making the blessed, lowest-level API require thread-local accesses for everyone to be a nonstarter.)

So maybe something like what @nnethercote is suggesting here is the right path: at the very lowest level, use a single global to dictate whether the lowest-level allocator has been overridden. (Whether its implemented via a static mut (and risk multiple threads clobbering it), or as @eddyb suggests, via an AtomicPtr, is an important detail, but separate from the question of how low-level this goes.)

Then on top of that I would presumably make my own library that is layered on top of that functionality to move the actual allocators with their own instrumentation into thread-locals, which are queried by the global allocator code that I installed at the outset in my own version of the program.

(I’ll try to think more about this. Feel free to poke holes in what I outlined above.)


#5

Ideally it’s implemented in some immutable way, because it’s something you want to opt into for the entirety of the program’s execution. There may be a difference between “immutable from the point of view of the program” and “immutable from the point of view of the Rust runtime”, though.


#6

I worked out the problem there. It turns out that stdio.rs has its own TLS (called LOCAL_STDOUT) that it uses to record the stdout handle used by each thread. What happens is that LOCAL_STDOUT's destructor runs, which triggers a deallocation, so the deallocate wrapper is invoked, and it tries to print a “deallocate” message, which requires access to LOCAL_STDOUT, which is in the middle of being destroyed, so it panics.

The good news is that for Servo’s purposes I don’t actually need to do any printing within that function. And if I did I could use a lower-level output function like write_all that avoids LOCAL_STDOUT.

The bad news is that this takes us into “need to know how something is implemented to call it safely from a wrapper” territory. In particular, given that the destruction order of TLS data is undefined, calling any code from within a wrapper could be a problem if that code uses TLS to store a Droppable type. Hmm.


#7

I opened a new issue in the RFC repo for this topic.


#8

Would it be possible to swap out JemAlloc for a custom allocator during linking? This would not solve the general problem of doing whatever on allocation, but would at least allow Servo to profile, right?


#9

Good question. I don’t know.

More generally, it appears that the implementation already has support for custom allocators, but it’s difficult to use because it requires re-building liballoc with external_funcs. See here.


#10

I know that this is what some C memory profilers do. A custom allocator could simply be a wrapper around jemalloc that observes the stack to find out which stack frame did the allocation. One could either combine it with a pre-allocated structure to hold the data (which would consume quite a bit of memory) or a pipe to write out the data (which would need external support for collecting and reporting the data, but could be more robust).