Layout is computed even for (de)allocators that don't use it

When a program is using system malloc/free for the allocator, tracking of objects' size and alignment for deallocation is redundant and causes unnecessary code bloat.

It seems that it happens because __rust_dealloc is magic, and doesn't get inlined, so the optimizer can't see that its arguments are unused. There's this opaque layer of abstraction between allocations/deallocations in Rust and calls to malloc/free, which makes it not quite a zero-cost abstraction.

This:

pub fn actual(_drop: Box<[u8; 123456]>) {
}

pub fn expected(drop: Box<[u8; 123456]>) {
    unsafe { free(Box::leak(drop).as_mut_ptr().cast()); }
}

Compiles to (godbolt link):

example::actual:
        mov     esi, 123456
        mov     edx, 1
        jmp     qword ptr [rip + __rust_dealloc@GOTPCREL]

example::expected:
        jmp     qword ptr [rip + free@GOTPCREL]

Note that Drop of a Box needlessly sets size and alignment of the allocation, because it can't know that this particular allocator implementation won't use it.

rest of the repro code
use std::alloc::*;

#[global_allocator]
static A: Alloc = Alloc;

struct Alloc;

unsafe impl GlobalAlloc for Alloc {
    #[inline(always)]
    unsafe fn alloc(&self, _layout: Layout) -> *mut u8 { std::process::abort(); }

    #[inline(always)]
    unsafe fn dealloc(&self, _ptr: *mut u8, _layout: Layout) { std::process::abort(); }

    #[inline(always)]
    unsafe fn alloc_zeroed(&self, _layout: Layout) -> *mut u8 { std::process::abort(); }

    #[inline(always)]
    unsafe fn realloc(
        &self, 
        _ptr: *mut u8, 
        _layout: Layout, 
        _new_size: usize
    ) -> *mut u8 { std::process::abort(); }
}

extern "C" {
    fn free(_: *mut u8);
}

So it seems like there's a missed optimization opportunity here.

Maybe, but is that significant compared to the real work of the allocator?

It isn't for statically sized/aligned types, but it might be for dynamically sized/aligned types. If it's a currently supported DST, though, that's max an indirect load and a const add, or a mul and a const add.

So no, not really, I don't think so.

I'd be curious if LTO could see through __rust_dealloc?

I've tried -C lto and -C lto=fat, and it doesn't see through it on godbolt.

It's not just that there are a few unnecessary instructions per deallocation; it also misses the fact that dropping a Box<T> for any T is the exact same call to free (after dropping the T if it needed that). For example, the compiler shouldn't need to generate a new three-instruction function for every N that appears in Box<[u8; N]>: godbolt

3 Likes

The allocation functions are #[rustc_std_internal_symbol], so I think they're prohibited from being mucked with by LTO. I'm not sure why exactly though.

Building with LTO and -Z build-std permits this optimization. I wrote the following program:

#[no_mangle]
pub fn actual(_drop: Box<[u8; 123456]>) {
}

pub fn main() {
}

And enabled LTO:

[profile.release]
lto = true

Then compiled it with cargo +nightly rustc --release -Z build-std --target x86_64-unknown-linux-gnu -- --emit asm to get the following:

	.section	.text.actual,"ax",@progbits
	.globl	actual
	.p2align	4, 0x90
	.type	actual,@function
actual:
	.cfi_startproc
	jmpq	*free@GOTPCREL(%rip)
.Lfunc_end1:
	.size	actual, .Lfunc_end1-actual
	.cfi_endproc
4 Likes