LD_PRELOAD to intercept readlink of rustc compiler hangs

I am building a LD_PRELOAD program in RUST to track file system dependencies such as readlink, open, fopen, etc. The goal is to add build dependency tracking for build tools like gnu make, cargo, etc. I infact had a version mostly working in C and ran fine on rustc compiler as well.

I started rewriting it in RUST. And have mostly loved the experience so far. But ran into hang problem when using the LD_PRELOAD library on the rustc compiler.

The problem is that the LD_PRELOAD intercepts readlink(), which is called by jemalloc() code in RUSTC. The test code does nothing more than intercept readlink and continues to call the original readlink. And it looks like this goes into recursion

Question: How can I rewrite the piece of code below that I can prevent this recursion? I am asking this here after discussing it in the user forum in the below thread.

I am trying to see if people with internal rustc understanding may provide so advice.

Basically looking to figure out how I can my LD_PRELOAD program to work with the rustc compiler.

A few things I need to figure out here I think, is how I can do some of the static global variable initialization such as

dlsym_next(concat!("readlink", "\0"));
static mut ONCE: Once = Once::new(); 
etc 

in the test code below, without triggering the jemalloc code that does readlink/open invocations causing recursion. All these intercepts works fine in my C version. So I suspect its not the iterception, but the initialization code that triggers rustc's jemalloc code that again does readlink and goes into mutex lock.

Hoping the Rustc experts here may have some advice on how to avoid jemalloc recursion and be able to do static global initialization without triggerig it.

I even tried directly doing a syscall to readlink after interception. But I still cant avoid the hang. Hangs differently,

(gdb) bt
#0 0x00007f144590e4ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f1445909dcb in L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007f1445909c98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x000056401d0150ed in malloc_mutex_lock_final (mutex=0x56401d2351d8 <init_lock>) at ../jemalloc/include/jemalloc/internal/mutex.h:141
#4 rjem_je_malloc_mutex_lock_slow (mutex=0x56401d2351d8 <init_lock>) at ../jemalloc/src/mutex.c:84
#5 0x000056401cfea09f in malloc_mutex_lock (tsdn=0x0, mutex=) at ../jemalloc/include/jemalloc/internal/mutex.h:205
#6 malloc_init_hard () at ../jemalloc/src/jemalloc.c:1506
#7 malloc_init () at ../jemalloc/src/jemalloc.c:217
#8 imalloc (sopts=, dopts=) at ../jemalloc/src/jemalloc.c:1986
#9 calloc (num=1, size=32) at ../jemalloc/src/jemalloc.c:2138
#10 0x00007f14456fd550 in dlerror_run () from /lib64/libdl.so.2
#11 0x00007f14456fd058 in dlsym () from /lib64/libdl.so.2
#12 0x00007f14497b6cc7 in ldpreload::dlsym_next::h7cab03892daf2fab (symbol=...) at src/lib.rs:43
#13 0x00007f14497b7429 in ldpreload::readlink_local::get::$u7b$$u7b$closure$u7d$$u7d$::haeb8628401923c4c () at src/lib.rs:71
#14 0x00007f14497b73b0 in std::sync::once::Once::call_once::$u7b$$u7b$closure$u7d$$u7d$::haf60324ae025b6d9 () at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libstd/sync/once.rs:264
#15 0x00007f14497de848 in std::sync::once::Once::call_inner::hfbdd978c729db7b8 () at src/libstd/sync/once.rs:416
#16 0x00007f14497b7329 in std::sync::once::Once::call_once::hc3adf9476c282443 (self=0x7f1449a720a0 ldpreload::readlink_local::get::ONCE::h19ff01a158e3f42e, f=...) at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libstd/sync/once.rs:264
#17 0x00007f14497b6db9 in ldpreload::readlink_local::get::hc45eb880fee9d18f (self=0x7f14498379f4) at src/lib.rs:70
#18 0x00007f14497b7488 in ldpreload::readlink_local::readlink::$u7b$$u7b$closure$u7d$$u7d$::h8cc84822c0f8be3c () at src/lib.rs:83
#19 0x00007f14497b7b40 in core::option::Option$LT$T$GT$::unwrap_or_else::hb73d717eb9eefcaa (self=..., f=...) at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libcore/option.rs:428
#20 0x00007f14497b6e93 in readlink (path=0x56401d023e2d "/etc/malloc.conf", buf=0x7ffde2568650 "", bufsiz=4096) at src/lib.rs:79
#21 0x000056401cfeee12 in malloc_conf_init () at ../jemalloc/src/jemalloc.c:913
#22 malloc_init_hard_a0_locked () at ../jemalloc/src/jemalloc.c:1281
#23 0x000056401cfe8f4f in malloc_init_hard () at ../jemalloc/src/jemalloc.c:1517
#24 malloc_init () at ../jemalloc/src/jemalloc.c:217
#25 imalloc (sopts=, dopts=) at ../jemalloc/src/jemalloc.c:1986
#26 malloc (size=size@entry=72704) at ../jemalloc/src/jemalloc.c:2038
#27 0x00007f14420dfae0 in pool (this=0x7f1444c4eca0 <(anonymous namespace)::emergency_pool>) at ../../../../gcc-5.5.0/libstdc++-v3/libsupc++/eh_alloc.cc:117
#28 __static_initialization_and_destruction_0 (__priority=65535, __initialize_p=1) at ../../../../gcc-5.5.0/libstdc++-v3/libsupc++/eh_alloc.cc:244
#29 _GLOBAL__sub_I_eh_alloc.cc(void) () at ../../../../gcc-5.5.0/libstdc++-v3/libsupc++/eh_alloc.cc:307
#30 0x00007f1449a828f3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#31 0x00007f1449a7415a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#32 0x0000000000000001 in ?? ()
#33 0x00007ffde256a2af in ?? ()
#34 0x0000000000000000 in ?? ()

Test code as follows

extern crate core;
extern crate libc;
#[macro_use]
extern crate ctor; 


use libc::{c_void,c_char,c_int,size_t,ssize_t};

use std::sync::atomic;

#[cfg(any(target_os = "macos", target_os = "ios"))]
pub mod dyld_insert_libraries;

/* Some Rust library functionality (e.g., jemalloc) initializes
 * lazily, after the hooking library has inserted itself into the call
 * path. If the initialization uses any hooked functions, this will lead
 * to an infinite loop. Work around this by running some initialization
 * code in a static constructor, and bypassing all hooks until it has
 * completed. */

static INIT_STATE: atomic::AtomicBool = atomic::AtomicBool::new(false);

pub fn initialized() -> bool {
    INIT_STATE.load(atomic::Ordering::SeqCst)
}

// extern "C" fn initialize() {
//     Box::new(0u8);
//     INIT_STATE.store(true, atomic::Ordering::SeqCst);
// }

// /* Rust doesn't directly expose __attribute__((constructor)), but this
//  * is how GNU implements it. */
//  #[link_section = ".init_array"]
//  pub static INITIALIZE_CTOR: extern "C" fn() = ::initialize;

#[ctor]
fn initialize() {
    Box::new(0u8);
    INIT_STATE.store(true, atomic::Ordering::SeqCst);
    println!("Constructor");
}


#[link(name = "dl")]
extern "C" {
    fn dlsym(handle: *const c_void, symbol: *const c_char) -> *const c_void;
}

const RTLD_NEXT: *const c_void = -1isize as *const c_void;

pub unsafe fn dlsym_next(symbol: &'static str) -> *const u8 {
    let ptr = dlsym(RTLD_NEXT, symbol.as_ptr() as *const c_char);
    if ptr.is_null() {
        panic!("redhook: Unable to find underlying function for {}", symbol);
    }
    ptr as *const u8
}


#[allow(non_camel_case_types)]
pub struct readlink {__private_field: ()}
#[allow(non_upper_case_globals)]
static readlink: readlink = readlink {__private_field: ()};

impl readlink {
    fn get(&self) -> unsafe extern fn (path: *const c_char, buf: *mut c_char, bufsiz: size_t) -> ssize_t  {
        use ::std::sync::Once;

        static mut REAL: *const u8 = 0 as *const u8;
        static mut ONCE: Once = Once::new();

        unsafe {
            ONCE.call_once(|| {
                REAL = dlsym_next(concat!("readlink", "\0"));
            });
            ::std::mem::transmute(REAL)
        }
    }

    #[no_mangle]
    pub unsafe extern "C" fn readlink(path: *const c_char, buf: *mut c_char, bufsiz: size_t) -> ssize_t {
        println!("readlink");
        if initialized() {
            println!("initialized");
            ::std::panic::catch_unwind(|| my_readlink ( path, buf, bufsiz )).ok()
        } else {
            println!("not initialized");
            None
        }.unwrap_or_else(|| readlink.get() ( path, buf, bufsiz ))
    }
}

pub unsafe fn my_readlink(path: *const c_char, buf: *mut c_char, bufsiz: size_t) -> ssize_t {
    println!("my_readlink");
    readlink.get()(path, buf, bufsiz)
}

You need to make sure that when your own library is calling readlink (e.g. via jemalloc), it'll get the actual readlink and not your intercepted version.

But instead I'm guessing things will go better if you use the system allocator instead of jemalloc. Add this to your crate:

use std::alloc::System;

#[global_allocator]
static GLOBAL: System = System;

Didn't we change the default allocator to be the system allocator always, though? (rustc itself still uses jemalloc and you can't change that.)

If you look at the stack trace, the dynamic linker itself (libdl) is calling into jemalloc's copy of calloc.

Looking at the rustc binary, it exports calloc, malloc, etc. as global symbols, and this apparently makes it override glibc's default allocator for the entire process, not just for Rust code. It's not clear to me whether this is intentional. rustc doesn't do this on macOS, for example (which would require jemalloc to be built with --enable-zone-allocator).

Regardless, this seems like a real pain to work around. Doing a direct syscall instead of using RTLD_NEXT should work (not sure why it didn't work for you). But it has the downside that if some other LD_PRELOADed library hooks the same function and comes 'after' your hook, you'll end up bypassing it.

I'd also note that, although it's not the issue here, LD_PRELOAD-based syscall interception has the major downside of not working with static binaries. You might want to consider switching approaches and using seccomp for more reliable tracing.

1 Like

Seems like a simple elegant solution. I added the above code you suggested To the test ld_preload library code above. And it still is calling and hanging in the jemalloc code above. Just to be clear, the above is a libreadlink.so that gets run as LD_PRELOAD=libreadlink.so rustc

so if I understand your suggestion, adding that code would atleast make libreadlink.so not use jemalloc. even if the rustc it is being ld_preloaded into may still use jemalloc as @CAD97 is suggesting is using.

But I still see jemalloc being called and hang as follows

(gdb) 
(gdb) bt
#0  0x00007fd3f7b888dd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fd3f7b81af9 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x000055de117b1e2d in malloc_mutex_lock_final (mutex=0x55de119d21d8 <init_lock>)
    at ../jemalloc/include/jemalloc/internal/mutex.h:141
#3  _rjem_je_malloc_mutex_lock_slow (mutex=0x55de119d21d8 <init_lock>)
    at ../jemalloc/src/mutex.c:84
#4  0x000055de117866c5 in malloc_mutex_lock (tsdn=0x0, mutex=<optimized out>)
    at ../jemalloc/include/jemalloc/internal/mutex.h:205
#5  malloc_init_hard () at ../jemalloc/src/jemalloc.c:1506
#6  malloc_init () at ../jemalloc/src/jemalloc.c:217
#7  imalloc (sopts=<optimized out>, dopts=<optimized out>)
    at ../jemalloc/src/jemalloc.c:1986
#8  calloc (num=1, size=32) at ../jemalloc/src/jemalloc.c:2138
#9  0x00007fd3f73e1723 in __cxa_thread_atexit_impl () from /lib64/libc.so.6
#10 0x00007fd3fc9248bf in std::sys::unix::thread_local_dtor::register_dtor ()
    at library/std/src/sys/unix/thread_local_dtor.rs:36
#11 std::thread::local::fast::Key<T>::try_register_dtor ()
---Type <return> to continue, or q <return> to quit---
    at library/std/src/thread/local.rs:442
#12 std::thread::local::fast::Key<T>::try_initialize [_ZN3std6thread5local...] ()
    at library/std/src/thread/local.rs:428
#13 0x00007fd3fc927466 in std::thread::local::fast::Key<T>::get ()
    at library/std/src/thread/local.rs:414
#14 std::io::stdio::LOCAL_STDOUT::__getit () at library/std/src/thread/local.rs:179
#15 std::thread::local::LocalKey<T>::try_with ()
    at library/std/src/thread/local.rs:266
#16 std::io::stdio::print_to () at library/std/src/io/stdio.rs:970
#17 std::io::stdio::_print [_ZN3std2io5stdio6_pr...] ()
    at library/std/src/io/stdio.rs:998
#18 0x00007fd3fc916243 in readlink::readlink [readlink] (
    path=0x55de117c0d2d "/etc/malloc.conf", buf=0x7fff953fea00 "", bufsiz=4096)
    at src/lib.rs:73
#19 0x000055de1178b962 in malloc_conf_init () at ../jemalloc/src/jemalloc.c:913
#20 malloc_init_hard_a0_locked () at ../jemalloc/src/jemalloc.c:1281
#21 0x000055de11785428 in malloc_init_hard () at ../jemalloc/src/jemalloc.c:1517
---Type <return> to continue, or q <return> to quit---
#22 malloc_init () at ../jemalloc/src/jemalloc.c:217
#23 imalloc (sopts=<optimized out>, dopts=<optimized out>)
    at ../jemalloc/src/jemalloc.c:1986
#24 malloc (size=size@entry=72704) at ../jemalloc/src/jemalloc.c:2038
#25 0x00007fd3f429d310 in (anonymous namespace)::pool::pool (
    this=0x7fd3f7139aa0 <(anonymous namespace)::emergency_pool>)
    at ../../../../gcc-5.5.0/libstdc++-v3/libsupc++/eh_alloc.cc:117
#26 __static_initialization_and_destruction_0 (__priority=65535, __initialize_p=1)
    at ../../../../gcc-5.5.0/libstdc++-v3/libsupc++/eh_alloc.cc:244
#27 _GLOBAL__sub_I_eh_alloc.cc(void) [_GLOBAL__sub_I_eh_al...] ()
    at ../../../../gcc-5.5.0/libstdc++-v3/libsupc++/eh_alloc.cc:307
#28 0x00007fd3fcb65d0a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#29 0x00007fd3fcb65e0a in _dl_init () from /lib64/ld-linux-x86-64.so.2
#30 0x00007fd3fcb5706a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#31 0x0000000000000001 in ?? ()
#32 0x00007fff95400e0b in ?? ()
#33 0x0000000000000000 in ?? ()
(gdb) 

The code now compiles as crate_type = ["cdylib"] and looks as below

extern crate core;
extern crate libc;
#[macro_use]
extern crate ctor;
#[macro_use]
extern crate syscalls;


use libc::{c_void,c_char,c_int,size_t,ssize_t};

use std::sync::atomic;
use std::alloc::System;

#[global_allocator]
static GLOBAL: System = System;

#[cfg(any(target_os = "macos", target_os = "ios"))]
pub mod dyld_insert_libraries;

static INIT_STATE: atomic::AtomicBool = atomic::AtomicBool::new(false);

pub fn initialized() -> bool {
    INIT_STATE.load(atomic::Ordering::SeqCst)
}

#[ctor]
fn initialize() {
    Box::new(0u8);
    INIT_STATE.store(true, atomic::Ordering::SeqCst);
    println!("Constructor");
}


#[link(name = "dl")]
extern "C" {
    fn dlsym(handle: *const c_void, symbol: *const c_char) -> *const c_void;
}

const RTLD_NEXT: *const c_void = -1isize as *const c_void;

pub unsafe fn dlsym_next(symbol: &'static str) -> *const u8 {
    let ptr = dlsym(RTLD_NEXT, symbol.as_ptr() as *const c_char);
    if ptr.is_null() {
        panic!("redhook: Unable to find underlying function for {}", symbol);
    }
    ptr as *const u8
}


#[allow(non_camel_case_types)]
pub struct orig_readlink {__private_field: ()}
#[allow(non_upper_case_globals)]
static orig_readlink: orig_readlink = orig_readlink {__private_field: ()};

impl orig_readlink {
    fn get(&self) -> unsafe extern fn (path: *const c_char, buf: *mut c_char, bufsiz: size_t) -> ssize_t  {
        use ::std::sync::Once;

        static mut REAL: *const u8 = 0 as *const u8;
        static mut ONCE: Once = Once::new();

        unsafe {
            ONCE.call_once(|| {
                REAL = dlsym_next(concat!("readlink", "\0"));
            });
            ::std::mem::transmute(REAL)
        }
    }

}
#[no_mangle]
pub unsafe extern "C" fn readlink(path: *const c_char, buf: *mut c_char, bufsiz: size_t) -> ssize_t {
    println!("readlink");
    if initialized() {
        println!("initialized");
        ::std::panic::catch_unwind(|| my_readlink ( path, buf, bufsiz )).ok()
    } else {
        println!("not initialized");
        None
    }.unwrap_or_else(|| orig_readlink.get() ( path, buf, bufsiz ))
}

pub unsafe fn my_readlink(path: *const c_char, buf: *mut c_char, bufsiz: size_t) -> ssize_t {

    println!("my_readlink");
    let sz = orig_readlink.get()(path, buf, bufsiz);
    println!("my_readlink: Complete");
    sz
}

It's also possible your library contains all kinds of Rust symbols that rustc is picking up instead of the ones from libstd and friends (since rustc is dynamically linked). What's the output of objdump -T libreadlink.so?

@comex can rustc be compiled like on macOS without global jemalloc. How do I do that? Do I compile rustc with --enable-zone-allocator ? And I am aware of the problem with this approach with static linnking. I was considering

I was considering https://github.com/pmem/syscall_intercept for those cases.

seccomp sounds interesting. Reading about it now. will it have a performance impact ? like strace does? I am looking for a low performance impact way to listen on just the fileystsem and subprocess/execv/spawn system calls to do track file system accesses.

Here is the objdump -T output

bash-4.4$ objdump -T target/debug/libreadlink.so 

target/debug/libreadlink.so:     file format elf64-x86-64

DYNAMIC SYMBOL TABLE:
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 getenv
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 dl_iterate_phdr
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 free
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 abort
0000000000000000      DF *UND*  0000000000000000  GCC_3.3     _Unwind_Backtrace
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 __errno_location
0000000000000000  w   D  *UND*  0000000000000000              _ITM_deregisterTMCloneTable
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 writev
0000000000000000  w   DF *UND*  0000000000000000  GLIBC_2.18  __cxa_thread_atexit_impl
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.3.4 __xpg_strerror_r
0000000000000000      DF *UND*  0000000000000000  GCC_3.0     _Unwind_GetRegionStart
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 write
0000000000000000      DF *UND*  0000000000000000  GCC_3.0     _Unwind_GetTextRelBase
0000000000000000      DF *UND*  0000000000000000  GCC_3.0     _Unwind_RaiseException
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.3.2 pthread_cond_wait
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 __xstat64
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 strlen
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_mutexattr_destroy
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 mmap
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_setspecific
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_mutex_destroy
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_mutexattr_init
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.4   __fxstatat64
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 memset
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_condattr_destroy
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 getcwd
0000000000000000      DF *UND*  0000000000000000  GCC_4.2.0   _Unwind_GetIPInfo
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 close
0000000000000000      DF *UND*  0000000000000000  GCC_3.0     _Unwind_GetLanguageSpecificData
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 memchr
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.3   __tls_get_addr
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.3.2 pthread_cond_signal
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_rwlock_rdlock
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 calloc
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 __fxstat64
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 syscall
0000000000000000  w   D  *UND*  0000000000000000              __gmon_start__
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.14  memcpy
0000000000000000      DF *UND*  0000000000000000  GCC_3.0     _Unwind_GetIP
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.3.2 pthread_cond_init
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_getspecific
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_mutex_unlock
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_mutexattr_settype
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 malloc
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 bcmp
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_rwlock_unlock
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.3.3 pthread_condattr_setclock
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 realloc
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 munmap
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_key_create
0000000000000000      DF *UND*  0000000000000000  GCC_3.0     _Unwind_GetDataRelBase
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_condattr_init
0000000000000000      DF *UND*  0000000000000000  GCC_3.0     _Unwind_SetGR
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 open64
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 memmove
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 memrchr
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.3.2 pthread_cond_destroy
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 sysconf
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 __lxstat64
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_key_delete
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 posix_memalign
0000000000000000  w   D  *UND*  0000000000000000              _ITM_registerTMCloneTable
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 dlsym
0000000000000000      DF *UND*  0000000000000000  GCC_3.0     _Unwind_Resume
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_mutex_init
0000000000000000  w   DF *UND*  0000000000000000  GLIBC_2.2.5 __cxa_finalize
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 pthread_mutex_lock
0000000000000000      DF *UND*  0000000000000000  GCC_3.0     _Unwind_SetIP
0000000000026a50 g    DF .text  0000000000000350  Base        rust_eh_personality
00000000000071f0 g    DF .text  000000000000019d  Base        readlink


bash-4.4$ 

how can we fix this ? should i open a bug against rustc?

That won't help, because rustc's global definitions of malloc and friends are substituted for the system allocator for every library in the process. This includes libdl, libc, and libstdc++, which are the culprits in the stack traces you posted – I'm determining this by looking at the entry immediately below the one for malloc.

If anything, if you want to avoid rustc's copy of jemalloc for your own allocations, you could do that by setting #[global_allocator] a non-system allocator. But don't bother: that would only help for allocations coming directly from libreadlink.so, not for the ones from libdl/libc/libstdc++.

As long as you directly override readlink using LD_PRELOAD, I think your only option is to ensure your hook does not call anything from those libraries that causes them to perform allocations – at least not when your hook is itself being called from malloc.

Runtime patching like that library does actually might be a viable approach for rustc, avoiding the problems with direct symbol overriding. Maybe.

However… that library won't work for static binaries, as they don't have a dynamic linker at all and thus completely ignore LD_PRELOAD.

Well, seccomp would allow you to intercept only the syscalls you're interested in, rather than all of them as strace does, which should greatly reduce overhead. However, for each syscall that does get intercepted, the overhead is probably similar to strace.

To do this, you would probably want to use the new seccomp notifier functionality.

By the way, another option is eBPF with kprobes, which should have very low overhead, but requires root.

The simplest way to build rustc without global jemalloc is to just disable jemalloc entirely, by passing jemalloc = false in config.toml. AFAIK, jemalloc only has a small performance advantage over glibc's default allocator, so this shouldn't slow things down too much.

But if you do want rustc to use jemalloc, just only for Rust code rather than globally… it looks like the current way rustc pulls in jemalloc is through crates.io; compiler/rustc/Cargo.toml pulls in jemalloc-sys with features = ['unprefixed_malloc_on_supported_platforms']. Removing this from Cargo.toml might work, though I haven't tested it.

Well, I hadn't looked at the implementation before, but since rustc explicitly asks for unprefixed malloc, it seems that globally overriding malloc is by design. If anything, the bug is that it doesn't globally replace malloc on macOS.

This doesnt seem like the right choice, for rustc to be making that choice of jemalloc on behalf of every other library that gets loaded. It basically makes writing any sort of LD_PRELOAD very very difficult.

LOL. It is very ironic. I had a C version of wisktrack, my open source LD_PRELOAD filesystem tracker, mostly working. Then on a recommendation from a friend, I wanted to give rust try.
Though steep learning curve, I have loved rewriting it Rust so far, until, rustc compiler is proving to be the one tool that is stumping my rust rewrite of my filesystem tracker.

Does rustc have to make jemalloc, a global choice for very other library that gets loaded ? It doesnt seem to be atleast for my case.

I prefer LD_PRELOAD beause it is lowest cost, and I would really like to get this working on RUST.

Would the rust compiler team be open to making the following the default standard for how rustc is compiled?

Is there anything I can do to trigger and initialize jemalloc outside of all this? I am considering the following ideas

  1. I suspect triggering jemalloc initialization in the constructor doesnt help. Because, redhook, the ldpreload helper crate that I started with had this code which was supposed to do exactly that according to the comment. But does not seem to have helped.
/* Some Rust library functionality (e.g., jemalloc) initializes
 * lazily, after the hooking library has inserted itself into the call
 * path. If the initialization uses any hooked functions, this will lead
 * to an infinite loop. Work around this by running some initialization
 * code in a static constructor, and bypassing all hooks until it has
 * completed. */

static INIT_STATE: atomic::AtomicBool = atomic::AtomicBool::new(false);

pub fn initialized() -> bool {
    INIT_STATE.load(atomic::Ordering::SeqCst)
}

pub fn initialize() {
    Box::new(0u8);
    INIT_STATE.store(true, atomic::Ordering::SeqCst);
}
/* Rust doesn't directly expose __attribute__((constructor)), but this
 * is how GNU implements it. */
#[link_section = ".init_array"]
pub static INITIALIZE_CTOR: extern "C" fn() = ::initialize;

But like I said, the comment seems to be refering avoiding this specific problem, but hasnt helped in this case.

  1. separate the LD_preload into 2 .so files. The LD_PRELOAD=libreadlink.so (dylib) has all of the above code except the actual C extern readlink, open, etc that is meant to override the libc or other version of readlink, etc. have the libreadlink.so do jemalloc initialization, look up and save dlsym for all the intercepted API, as well implement the my_*() versions of the intercepted API. Basically all of the core functionalities of LD_PRELOAD, except the actual intercept API. And have libreadlink.so dlopen() and load the libreadlinkint.so(cdylib) that only implement the intercepted function, that depends on then calls into the libreadlink.so that is already loaded. This is just another attempt to have jemalloc initialization completed before the intercepted functions are loaded. Not sure if that helps. Still trying to get that code to work. ran into an issue successfully dlopen() libreadlinkint.so
    #[no_mangle]
    pub unsafe extern "C" fn readlink(path: *const c_char, buf: *mut c_char, bufsiz: size_t) -> ssize_t {
        println!("readlink");
        if initialized() {
            println!("initialized");
            ::std::panic::catch_unwind(|| readlink.my_readlink ( path, buf, bufsiz )).ok()
        } else {
            println!("not initialized");
            None
        }.unwrap_or_else(|| readlink.get() ( path, buf, bufsiz ))
    }

  1. Is there anything else I can do to initialize jemalloc for the process completely, before LD_PRELOAD my libreadlink.so, either in the constructor for libreadlink.so or by setting up may be another dependency for libreadlink.so that gets loaded before libreadlink.so that triggers and completes jemalloc initialization.

Whats also confusing is where the hang happens. as shown below. It looks like initialized() is true. Also my_readlink gets called, and was able to successfully call readlink and return/Complete too. Which makes me if there are multiple threads involved and if each thread does its own jemalloc init, which means my idea of trying to get jemalloc initialized before loading intercept functions may be flawed.

bash-4.4$ LD_PRELOAD=target/debug/libreadlink.so ls -al test.link                    Constructor
readlink
initialized
my_readlink
my_readlink: Complete
lrwxrwxrwx 1 sarvi eng 10 Aug 29 22:30 test.link -> Cargo.toml
bash-4.4$ LD_PRELOAD=target/debug/libreadlink.so rustc
Constructor
readlink
initialized
my_readlink
my_readlink: Complete
^C

Is there a non-system allocator that I can quickly try?

I am not sure how I can do this, considering any rust code I write here ends up using some variables that endup triggering jemalloc initialization and recursing.

I am not hearing solution where I can get my LD_PRELOAD library to work well with the default compilation of rustc. Or may be I am being thick and missing a solution you have proposed. am I ?

BTW, I just downloaded rustc and compiled it with

# features = ['unprefixed_malloc_on_supported_platforms']

as in

[sarvi@sjc-ads-4974 rust]$ cat compiler/rustc/Cargo.toml 

And my ld_preload seems to work fine with this compiled version fine,

bash-4.4$ LD_PRELOAD=target/debug/libreadlink.so /ws/sarvi-sjc/localinstall/bin/rustc
Constructor
Usage: rustc [OPTIONS] INPUT

Options:
    -h, --help          Display this message
        --cfg SPEC      Configure the compilation environment
    -L [KIND=]PATH      Add a directory to the library search path. The

Though I am not seeing any readlink intercepted messages I would expect to see as below. But may be readlink is not being called. I dont know

bash-4.4$ LD_PRELOAD=target/debug/libreadlink.so ls -al test.link Constructor
readlink
initialized
my_readlink
my_readlink: Complete
lrwxrwxrwx 1 sarvi eng 10 Aug 29 22:30 test.link -> Cargo.toml

Will this change have a performance impact on the rust compiler ?

The difficulty with pre-initializing jemalloc is this:

  • For each image (library or executable) in the process, for each symbol that the image references (via a relocation), that relocation only gets processed once by the dynamic linker.* You can have the same symbol mean different things for different images (by loading a library that overrides that symbol in between when those images are loaded), but you can't have the same symbol mean different things for the same image at different times.
  • jemalloc is statically linked into the same image as rustc. (Though technically most of the interesting parts of rustc are in librustc_*.so images rather than rustc proper, but that's an implementation detail I wouldn't rely on.)

So if you try to initialize rustc's malloc, then when that calls readlink, the symbol is either hooked or not. If it's hooked, you have to deal with the hang. But if it's not hooked, you can't then hook it once you're done initializing.

…That is, not with the dynamic linker. There are libraries that can hook functions at runtime, usually using code patching. syscall_intercept is one which focuses specifically on system calls, and it might work, but there are also libraries which just provide an API to hook arbitrary functions. This approach should work for rustc, even if it doesn't work for static binaries. However, I don't have any experience with the libraries out there for Linux, so I don't have any specific recommendations for you.

(I do have experience with similar libraries for Apple platforms, and even wrote my own once, but that's another story...)

If you use an arbitrary-function-hooking library, you would still LD_PRELOAD your library, but instead of overriding any symbols, you would define a global constructor function which initializes malloc then does the hooking. Constructors get called by the dynamic linker before main, so there shouldn't be issues with multiple threads existing at that point.

I was hesitant to recommend this approach in my last post because it will never work with static binaries like Go's compiler. But if Go is not a priority to you, then it's probably the easiest path forward.

* Depending on configuration, the dynamic linker can either bind all relocations when the image is loaded, or do it lazily when each symbol is first called, but it doesn't really make a difference in this scenario.

I am not hearing solution where I can get my LD_PRELOAD library to work well with the default compilation of rustc. Or may be I am being thick and missing a solution you have proposed. am I ?

Another option is to keep interposing readlink like you currently are, but ensure that your implementation (a) avoids all operations that allocate memory, and (b) forwards to a direct system call implementation (e.g. this crate), rather than using dlsym with RTLD_NEXT to find libc's implementation.

Or, at least, ensure that readlink does these things when it's called from malloc initialization; there could be a different code path for all other calls. Possible approaches to determine whether it's being called from malloc initialization:

  • Just compare the filename to /etc/malloc.conf (silly and reliant on implementation details but gets the job done).
  • Use a global "is malloc definitely initialized" flag. The flag would start out as false. In a global constructor, first initialize malloc, then set the flag to true.

Would the rust compiler team be open to making the following the default standard for how rustc is compiled?

I think the main reason for wanting unprefixed jemalloc is to make LLVM use it, since LLVM is written in C++ and doesn't know about Rust's #[global_allocator] mechanism. Looking at LLVM's source, I don't see any support for making it use something other than the system allocator, but I could be missing something.

Is there a non-system allocator that I can quickly try?

I guess wee_alloc.

Which makes me if there are multiple threads involved and if each thread does its own jemalloc init, [..]

GDB is your friend here.

So based on what I understood above. I tried to write the simplest of code that does only 2 things in the intercept hook function.

  1. call dlsym() to get the next readlink
  2. call it. That hangs because dlysm calls _dlerror_run which calls calloc from jemalloc. Aaargh!! Why does dlysm() call _dlerror_run() ? If only I could get the next dlsym without jemalloc calloc.
gdb) bt
#0  0x00007f859d71b8dd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f859d714af9 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x000055d796db3e2d in malloc_mutex_lock_final (mutex=0x55d796fd41d8 <init_lock>) at ../jemalloc/include/jemalloc/internal/mutex.h:141
#3  _rjem_je_malloc_mutex_lock_slow (mutex=0x55d796fd41d8 <init_lock>) at ../jemalloc/src/mutex.c:84
#4  0x000055d796d886c5 in malloc_mutex_lock (tsdn=0x0, mutex=<optimized out>) at ../jemalloc/include/jemalloc/internal/mutex.h:205
#5  malloc_init_hard () at ../jemalloc/src/jemalloc.c:1506
#6  malloc_init () at ../jemalloc/src/jemalloc.c:217
#7  imalloc (sopts=<optimized out>, dopts=<optimized out>) at ../jemalloc/src/jemalloc.c:1986
#8  calloc (num=1, size=32) at ../jemalloc/src/jemalloc.c:2138
#9  0x00007f859d5079e5 in _dlerror_run () from /lib64/libdl.so.2
#10 0x00007f859d507393 in dlsym () from /lib64/libdl.so.2
#11 0x00007f85a24a2b50 in readlink::readlink [readlink] (path=0x55d796dc2d2d "/etc/malloc.conf", buf=0x7ffe3ebf0f40 "", bufsiz=4096) at src/lib.rs:23
#12 0x000055d796d8d962 in malloc_conf_init () at ../jemalloc/src/jemalloc.c:913
#13 malloc_init_hard_a0_locked () at ../jemalloc/src/jemalloc.c:1281
#14 0x000055d796d87428 in malloc_init_hard () at ../jemalloc/src/jemalloc.c:1517
#15 malloc_init () at ../jemalloc/src/jemalloc.c:217
#16 imalloc (sopts=<optimized out>, dopts=<optimized out>) at ../jemalloc/src/jemalloc.c:1986
---Type <return> to continue, or q <return> to quit---
#17 malloc (size=size@entry=72704) at ../jemalloc/src/jemalloc.c:2038
#18 0x00007f8599e30310 in (anonymous namespace)::pool::pool (this=0x7f859ccccaa0 <(anonymous namespace)::emergency_pool>)
    at ../../../../gcc-5.5.0/libstdc++-v3/libsupc++/eh_alloc.cc:117
#19 __static_initialization_and_destruction_0 (__priority=65535, __initialize_p=1) at ../../../../gcc-5.5.0/libstdc++-v3/libsupc++/eh_alloc.cc:244
#20 _GLOBAL__sub_I_eh_alloc.cc(void) [_GLOBAL__sub_I_eh_al...] () at ../../../../gcc-5.5.0/libstdc++-v3/libsupc++/eh_alloc.cc:307
#21 0x00007f85a26b4d0a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#22 0x00007f85a26b4e0a in _dl_init () from /lib64/ld-linux-x86-64.so.2
#23 0x00007f85a26a606a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#24 0x0000000000000001 in ?? ()
#25 0x00007ffe3ebf3dbe in ?? ()
#26 0x0000000000000000 in ?? ()

code here

extern crate core;
extern crate libc;

use std::alloc::System;

#[global_allocator]
static GLOBAL: System = System;

use libc::{c_void,c_char,c_int,size_t,ssize_t};

#[link(name = "dl")]
extern "C" {
    fn dlsym(handle: *const c_void, symbol: *const c_char) -> *const c_void;
}

const RTLD_NEXT: *const c_void = -1isize as *const c_void;

type Readlinkptr = unsafe extern fn (path: *const c_char, buf: *mut c_char, bufsiz: size_t) -> ssize_t;


#[no_mangle]
pub unsafe extern "C" fn readlink(path: *const c_char, buf: *mut c_char, bufsiz: size_t) -> ssize_t {
    let readlinkptr: Readlinkptr = ::std::mem::transmute(dlsym(RTLD_NEXT, (concat!("readlink", "\0")).as_ptr() as *const i8));
    readlinkptr(path, buf, bufsiz )
}

Tried making wee_alloc the system allocator. But that doesnt affect dlysm/dl_error that are still calling jemalloc/calloc code. probably because dlsym froms from libdl that is also relying on jemalloc for calloc().

So I was wondering if I could rewrite just the intercept function, dlsym look and check if jemalloc is initialized in C code by mixing C code and using C code for library constructor and intercept functions that will do the dlysm lookup and deciding if it should the original intercept function or not, based on if jemalloc is initialized or not. But I suspect, even there the dlsym lookup in libdl.so will call dlerror which will call calloc from jemalloc causing recursion.

So I started thinking along what I can do to make my ld_preload fully self sufficient. make it not go outside of ldpreload for everything, until it finally calls the original function from another library. IF my ldpreload cannot override a statically linked main program, can that not help to contain my ldpreload from preventing the main program rustc program from distrupting my ld_preload library with the same static linking approach ?

If there is a way to make my ld_preload be fully self sufficient with all the symbols it needs by statically linking what my ld_preload program needs into my libreadlink.so. This, to start with would mean dlsym/dlerror and may be even system malloc(not jemalloc). Is this possible? Can I create a shared library that is fully self contained and doesnt need any symbol resolution outside of the shared library ? Basically statically link libc and use and use https://gcc.gnu.org/legacy-ml/gcc-help/2003-08/msg00128.html to eliminate all dead code or unreferenced symbols. So I am hoping that only what is needed and everything that is needed for my ldpreload and the global system allocator estbalished for my ld_preload will get statically linked into shared library. and nothing else.

Also would that mean dlysm/dlerror/malloc/calloc,etc for the rest of the main program/rustc will come from my ld_preload now? Or can I make them private to ny ld_preload library except readlink that I want to intercept?

Will this help in this situation ?

Right.

Yeah, you'll have the same issue whether your hook is written in Rust or C.

Statically linking libc might be an option, but there's an issue: apparently, having two copies of libc in the same process can cause problems because both think they own %fs and %gs (the segment registers used for thread-local storage).

It also wouldn't help you as much as you think, because you can't use a separate implementation of dlsym. dlsym is not a syscall or anything; it's asking libdl to look through its internal record of what images are loaded at what addresses, and search all images for a particular symbol name. Even if you reimplemented the algorithm to search through images, you don't have the list of loaded libraries. I suppose you could try to recreate that list, but that's definitely something you'd have to reimplement from scratch; you couldn't just rely on some existing implementation of dlsym.

However, keep in mind that dlsym is only necessary if you want to make sure you're calling other hooks of the same function from other LD_PRELOADed libraries. If you don't care about that, you can just perform the syscall yourself, which is one of the approaches I suggested in my last post.

By the way, I was looking at the code in glibc that actually performs that allocation:

      /* We don't use the static buffer and so we have a key.  Use it
         to get the thread-specific buffer.  */
      result = __libc_getspecific (key);
      if (result == NULL)
        {
          result = (struct dl_action_result *) calloc (1, sizeof (*result));
          if (result == NULL)
            /* We are out of memory.  Since this is no really critical
               situation we carry on by using the global variable.
               This might lead to conflicts between the threads but
               they soon all will have memory problems.  */
            result = &last_result;
          else
            /* Set the tsd.  */
            __libc_setspecific (key, result);
        }

You might be able to get away with hooking calloc yourself, somehow detecting when this is coming from dlsym during malloc initialization, and in that case failing the allocation by returning NULL. Then you'll trigger the "out of memory" code, which risks "conflicts between the threads", but as long as (a) this only gets triggered during malloc initialization and (b) you ensure malloc initialization happens from some global constructor function, before any threads are created… it should be okay.

But I haven't tried it, and I'm not sure whether dlsym does any other allocations.

Alternately, you could return an actual buffer rather than NULL, but in that case you'd need to hook free as well, since glibc will try to free this buffer when the thread exits...

Thanks for the taking the time to explain this to me. This is encouraging to hear.

So what I am hearing is if I intercept and serve malloc, calloc and free for at a minimum, for the duration of the ld_preload dlsym lookup & jemalloc initialization I could do a clean interception of readlink and open while at the same time respecting other LD_PRELOADS.

Do I need to use Or can I do without this? #![no_std] ?

And establish wee_alloc as the global allocator for this ld_preload library as follows.

extern crate alloc;
extern crate wee_alloc;

// Use `wee_alloc` as the global allocator.
#[global_allocator]
static ALLOC: wee_alloc::WeeAlloc = wee_alloc::WeeAlloc::INIT;

But this would still mean dlsym and all std:: and libc code that callls malloc will still go jemalloc.

So I establish my intercepts for malloc/calloc/realloc/free, and serve memory using memory allocated through alloc::{alloc, realloc, dealloc} functions. The understanding is that calling alloc(), dealloc(), after establishing wee_alloc as the global allocator, would not call jemalloc's malloc/free. Correct?

Something along the lines as below code?

#[no_mangle]
pub unsafe extern "C" fn malloc(bufsiz: size_t) -> *const c_void {
    if !orig_malloc_known {
        let layout = Layout::from_size_align(bufsiz, 0).unwrap();
        alloc(layout) as *const c_void
    } else {
        orig_malloc(bufsize)
    }
}

#[no_mangle]
pub unsafe extern "C" fn free(ptr: *const c_void) -> *const c_void {
   if allocated_from_wee_alloc(ptr) {
        layout = getlayoutfromptr(ptr);
        dealloc(ptr, layout)
   } else {
        orig_free(ptr)
   }
}

Am I missing anything here ? Will give this a try.

As you saw you cannot call dlsym or in general any glibc function that may take any locks directly or indirectly other than readlink within your readlink override since that can result in a deadlock.

I think the best solution is to call dlsym in a global C++ initializer (e.g. using the rust-ctor crate) and store the result in a static mut. If your readlink gets called before the variable is initialized (not sure if that's possible), then make a direct system call to readlink instead of calling the next symbol (you could arrange this by having static mut initial value be the address of a function that does the readlink system call).

If you absolutely want to always call the next hook, then intercepting the memory allocator as you described should work as well (it's a bit invasive though and might cause issues with programs using other custom mallocs).

Yeah, I think that should work. Make sure to be careful about performance: free is very hot, so however you implement allocated_from_wee_alloc, it needs to be very cheap or it will significantly slow down the program being hooked.

You may find that some of your deadlock issues go away if you require use of eager binding with your LD_PRELOAD library. That is, one must always set LD_BIND_NOW=t along with LD_PRELOAD=interceptor.so. What this does is tell the dynamic linker to do all the symbol resolution for each shared library up front, when that shared library is loaded. This means you can't find yourself recursing into the dynamic linker when you call some random C library function from the interceptor, because the symbol was already looked up before any of your code was run.

Calls to dlsym are still dangerous, though. You could take the eager-binding concept one step further and have your library define an ELF global constructor function that calls dlsym for everything you will ever need to call dlsym for, and saves the pointers, that might help. I think global constructors are guaranteed to be called without any internal locks held by the C library, at least for glibc.