pre-RFC: life-before-main / lib init


#1

The idea is to keep this simple, deterministic and optional. Is it sufficiently useful to be worth adding to Rust?


One attribute is added: #[init]. (There could instead be two, one for marking functions and one for use with extern.)

Motivation: make the env_logger crate easier to use, especially in “integration” test suites. It could also provide a work-around for producing complex constants at run-time instead of compile-time where run-time function evaluation is needed.

Variations: a similar strategy for de-init/clean-up could be used, but I’m not sure what the real use-cases are.

Caveats: using this to initialise a logger for in-code test functions without also initialising for other uses won’t work for this approach. Maybe an option like #[init(tests_only)] extern crate ...; is needed (which can also be used in libraries); alternatively the entire depency may be optional (e.g. #[cfg(test)] #[init] extern crate env_logger;).


Function marker: the #[init] attribute can be used to mark special functions. This can only be used on public (externally visible) library functions, which guarantees that the functions can be called manually instead if preferred. Example:

// in a 'greeter' lib:
#[init]
pub fn greet() {
    println!("Hello, world");
}

Extern marker: when an external dependency is declared via extern, the #[init] attribute can be used to call all of its init-attributed functions. There is no transitive-initialisation: libraries cannot use #[init] on extern statements (this may mean executables have to add extra extern statements). Example:

// in an executable:
#[init]
extern crate greeter;

Deterministic effects: when an ‘extern’ library is declared with ‘init’, its init-attributed functions are called in the order they are found, e.g. the order init1, init2, init3 below:

// a library

#[init]
pub fn init1() {}

pub mod X {
    #[init]
    pub fn init2() {}
}

#[init]
pub fn init3() {}

#[init] can only be used on extern statements within one module (the executable’s root module); if more than one of these uses #[init] they are initialised in the order encountered. This happens immediately before main gets called or the first #[test] function is run. Example:

// executable
#[init] extern crate lib1;
#[init] extern crate lib2;

// runs immediately after lib2's last #[init] function
fn main() {}

If a function is called via #[init] and it is called directly (e.g. from main), it is called two (or more) times (if this is undesirable, the user should avoid doing both).


#2

Here’s the old FAQ entry on why this doesn’t exist. Can you explain why lazy_static is not sufficient for your usecase?

If your main motivation is tests, why not add a way to run an init function before tests? This is something I’ve wanted as well from time to time, although lazy_static seems to solve most of those cases too.


#3

There is one situation not covered by lazy_static! where having a function run before main is useful: if you are building a shared library that intended to be injected into a process using LD_PRELOAD. In that case it might be useful to run some startup code before the application starts.


#4

Even there I think it would be mostly fine to do any initialization when your module is first called. The one case where that might not be enough is if you want to load other libraries with dlopen before the dynamic linker continues. I’m not sure if this is a good justification for adding this language feature at this time though, that seems like a very narrow usecase.

Here’s a workaround btw:

Put this is your crate:

#[no_mangle]
pub extern "C" rust_init() { /*...*/ }

Put this in a C file and include it via a build script:

extern void rust_init();
__attribute__((constructor)) static init()
{
        rust_init();
}

#5

Thanks for the feedback.

lazy_static is pretty useless for initialising a logger (see log and env_logger crates: the logger doesn’t initialise itself because it doesn’t know how it should be configured).

I know life-before-main is problematic in C++. My proposal fixes some of the problems: initialisation order is defined (deterministic), and nothing weird happens if exec X depends on A and B, which both depend on C which must have stuff initialised (see above: nothing transitive; X must init C directly). My proposal also ensures life-before-main is optional by allowing manual initialisation instead.

This proposal doesn’t try to address uses of globals before initialisation. This could be (partially) tackled with lifetime analysis but I’m not sure this is warranted. Note that this “problem” already exists in Rust: e.g. if you log a message before initialising the logger, the message is simply lost. This isn’t a big problem as long as things like Option are used instead of blindly dereferencing pointers.

But yes, a custom test driver allowing init-before-tests would solve my main motivation; the feature may not actually be needed.


#6

I think this kind of workaround would be needed even with my proposal, since it doesn’t allow initialisation in libraries.


#7

Life-before-main is abused a lot in C++ and Go. I’m pretty wary of it.


#8

Can you elaborate? What in particular are you able to do from pre-main initializer function that you couldn’t do from a lazy_static initializer?


#9

I think it goes something like this: your program Foo depend on library Bar which depens on library Logger. Foo does not depend directly on Logger. Logger’s functionality requires some sort of global state initialization. Therefore, before Bar can use Logger it needs to be initialized. I think in current Rust you have two options: 1) expose a Bar::init function that must be called by Foo before using anything else in Bar. 2) Access/initialize Logger in Bar through some lazy_static, but this would “contaminate” all usage sites.

However, I think it’s unlikely that Bar would know how Logger should be configured.


#10

Imho, the best design for such problems is indeed to provide a way to construct Bar and pass it the required dependencies (Also see, DI pattern).

Having global state and tight coupling with a specific implementation are not nice. What If the user of Bar wants to log using OtherLogger™ instead?


#11

Sure, this is a dilemma - but how would life-before-main help? If Bar or Logger used an #[init] function (as proposed here), that wouldn’t somehow help it know how it should be configured.

Is the issue the performance of lazy_static? Ideally in the fast path it should be a single (non-atomic) load plus a predictable branch, which might still be too slow in some extreme cases but is unlikely to show up here. Actually, looking at the logger implementation, it already does considerably more expensive stuff (an atomic increment) on every log attempt, regardless of whether logging is enabled.


#12

Well, if Foo uses Bar which uses Logger, and Logger needs initialisation… then there are three ways of doing this:

  1. Foo calls some init function on Bar which calls an init function in Logger
  2. Logger uses lazy_static to initialise itself
  3. Foo calls an init function in Logger directly

Option 2 would work fine if no parameters need to be passed but isn’t much good if it needs configuration from anywhere other than environment variables. The log crate BTW doesn’t configure where it logs to, but relies on other crates like env_logger to control where it logs, and these may require run-time variables to be passed (e.g. name of a log file). So option 2 is not always a good choice.

Option 1 works in the case above. But imagine Foo also uses another library, Baz, and that Baz also uses Logger. If Foo calls Bar’s init and Baz’s init and both of these try to initialise Logger, then Logger gets initialised twice. Of course it could handle this gracefully, but then if Bar and Baz try to initialise it in different ways it’s not clear what will happen. Or Foo could not call Baz’s init, but then what if Baz needs to initialise other things? So option 1 is not ideal.

This leaves option 3. Option 3 requires a little more work from the end user, but avoids the above issues so may well be the best option. That’s why I choose this solution for the RFC above.


#13

Take a look at the log crate. The simple logger here doesn’t need any parameters, but imagine a variant logging to a file. Where is that file name going to come from? It could be hard-coded, it could come from an environment variable, or it could be determined dynamically, e.g. by looking for other log files in a few standard locations or by reading a configuration file.


#14

I think you’ve forgotten to include some information in your RFC. You are describing a feature in which my code can pass dynamic arguments to an “init” function from a lib, but your RFC shows init functions being triggered by adding an #[init] tag to an extern crate declaration. Its not at all clear to me how I would pass arguments to an init function in logger.

I think everyone understands the motivation you are describing, but not how your RFC achieves it.


#15

I’m very strongly against the idea of declaring a guarantee that these functions will be called in the ‘source order’ of your code. Source order is semantically significant in macro_rules! macros, and this is considered one of their most damning flaws, which is why they will some day be deprecated. This makes far too many changes that should be totally irrelevant ‘breaking changes,’ because they could reorder init functions.


#16

Thanks withoutboats, that’s the kind of feedback needed.

I see your point about not wanting to depend on source order. The simplest (and probably best) way out of this is to only allow one init function per crate (which also makes arg-passing easier to handle).

About parameter passing: you’re spot on. Constants could perhaps be passed as arguments before main, but this is limited. I would not want to derive dynamic arguments before main. This only leaves one possibly-useful option: arguments which can be passed statically (something clumsy like #[init(args=("path/to/log_file.log"))], or dynamically (by calling the init function from main or later instead of using #[init]).

At this point, init-before-main seems a little less useful, since it just moves a function call from main to a use extern ... attribute. The exception being test-drivers where we don’t (currently) have a main.


#17

I think maybe there’s a bit of an XY problem here. You’ve identified a set of related problems (lazy static initializers can’t really be passed arguments, we can’t define an initializing set up function for test runner code right now) and proposed life before main as the solution. I think there are probably other more targeted solutions (e.g. some sort of tag for test set up functions for the test runner) that don’t introduce as many problems as life before main would.


#18

I implemented this pre-RFC in a crate: https://crates.io/crates/init

I added the restriction of one init function per crate.

Any libraries just need

#![feature(proc_macro)]

extern crate init;

use init::init;

#[init]
fn init() {
  // [...]
}

Binaries can have their own init and can add any library inits

#![feature(proc_macro)]

extern crate init;

// Pretend nom has an init
#[init]
extern crate nom;

use init::init;

#[init]
fn init() {
  // [...]
}

fn main() {
}

Binaries additionally need a build.rs file. Note that this part could have be done a few other ways but since if this RFC is adopted it wouldn’t need to exist I figured it doesn’t matter much how its done in this crate.

extern crate init;

fn main() {
    init::build();
}