[Discussion] Adding an `init` function that could modify static variables directly

Currently, static variables are immutable, and using static mut variables are unsafe. If we want to calculate some static variable during runtime, we could not mark those variable as static.

Currently, OnceCell is used to init static variables during runtime, but visit OnceCell requires additional checks, which might be noisy and need extra care (why not just use static mut instead?).

Thus I want to discuss the possibility of a new syntax, #[init]

Grammar:

#[init]
use std::init; // mark `std::init` as an init function, a function could be mark for several times(different crate may mark the same function, but each init function should be only execute once.)
 
#[init] // this init function execute after all other #[init] procedure it marks are finished.
fn init() {  // this should be a `fn()->()`
    static mut str1:String=String::new();
    let str2=String::from("init");
    static str3:String=str2.clone();
}
fn main(){
    println!("{}", unsafe {&str1}); // str1 is `static mut` since it is decleared as such in `init` function.
    // println!("{}", &str2); // str2 is not static variable, thus cannot access.
    println!("{}", &str3}); // str3 is `static` thus could be access directly.
}

Syntax sugar:

// no need to write #[init] here.
// fn init() { // the defination of init could be omit.
static mut str1:String=String::new();
let mut str2=String::from("init");
str2+="sugar";
static str3:String=str2.clone(); // str3 is "init sugar" now.
// }
fn main(){
    println!("{}", unsafe {&str1}); // str1 is `static mut` since it is decleared as such in `init` function.
    // println!("{}", &str2); // str2 is not static variable, thus cannot access.
    println!("{}", &str3}); // str3 is `static` thus could be access directly.
}
#[allow(seperate_init)]
static str4:String=str2; // rustc recognize this statement as part of #[init], thus a warning should be generated.

In this case, fn main() could be omit for really small examples:

// the following program could be recognize as part of init procedure, thus such code fragment could compiles normally.
let vec = vec![0; 5];
assert_eq!(vec, [0, 0, 0, 0, 0]);

// The following is equivalent, but potentially slower:
let mut vec = Vec::with_capacity(5);
vec.resize(5, 0);
assert_eq!(vec, [0, 0, 0, 0, 0]);

Is there any disadvantages?

Alternative to init - [Pre-RFC] safe Uninit Types

I don't have a citation handy, but Rust was specifically designed to have no “life before main”. I believe part of the motivation are hazards to do with dynamically loaded code, which I do not fully understand, but there is also a software-engineering hazard in any program: in what order does this code run? And given that there must be some ordering, what happens if init code in one module or library calls code in another, and the latter assumes that its own initialization has already succeeded?

If you use lazy initialization using OnceCell, Lazy/LazyLock, or similar, then the answer is easy: it is implicitly executed as a traversal of the dependency graph (in the form of functions calling other functions), and if there is a cycle then you get some kind of failure (panic, deadlock, or maybe even stack overflow depending on the implementation). But with static initialization code with a static execution order, you end up observing an uninitialized state. That's bad because:

  • it might read uninitialized memory, which is UB, and
  • it means that you haven't achieved your goal of having an immutable static variable that was initialized with run-time code, because a mutation was observable.
13 Likes

I'm not suggest a "strong alternative", which might wipe out entire OnceCell from std. This is an option and sometimes it is useful.

Currently, rust does not allow cycle dependence, thus this situation does not exist:

cargo add --path ../test1
      Adding test1 (local) to dependencies.
error: cyclic package dependency: package `test1 v0.1.0 (/me/test1)` depends on itself. Cycle:
package `test1 v0.1.0 (/me/test1)`
    ... which satisfies path dependency `test1` of package `test2 v0.1.0 (/me/test2)`
    ... which satisfies path dependency `test2` of package `test1 v0.1.0 (/me/test1)`

We could force all static variable initialized, since immutable static requires that. If a static variable could be uninitialized, then OnceCell is useable.

Actually, grammar I suppose does not allow mutate any immutable static variable.

let mut val=...;
static var: Type = val;

All the modification applies to mut val, not static var, thus no mutate for static variables.

When exactly should these #[init] functions be executed? In which order? How would this work in e.g. web assembly or other platforms that don't support executing code before main or when loaded?

Why are str1 and str3 defined inside init if they are accessed in main? The name scoping is wrong. And if they are defined outside init how can str3 access str2?

2 Likes

I don't know if this constitutes a pattern, but I've used something like this in a previous project:

// defined elsewhere
#[macro_export]
macro_rules! init_lock {
    () => {
        use std::sync::Mutex;
        static lock: Mutex<bool> = Mutex::new(false);
        let mut done = lock.lock().unwrap();
        if *done {
            return;
        }
        *done = true;
    };
}

mod tables {
    use crate::init_lock;

    pub const N: usize = 1000; // or whatever
    static mut TABLE: [[i32; N]; N] = [[0; _]; _];

    pub fn init() {
        init_lock!();
        unsafe {
            // initialize TABLE
            // init_lock ensures writers are synchronized
        }
    }

    pub fn table(x: usize, y: usize) -> i32 {
        // Safety: assume init.
        unsafe { TABLE[x][y] }
    }
}

There is a big lock around mutating the "constants" during initialization. No unsynchronized reads are done until after everything is initialized, but this is not enforced by the compiler. In particular, tests are able to all call init in parallel, although it is a little frustrating that every test must call init explicitly.

In my experience of C++, people get into trouble with static initializers in one of two ways:

  1. The ordering they assume for multiple separate initializers (multiple #[init] blocks) is not the ordering the compiler gave them. In some cases (SIOF), the ordering is undefined, and their code simply cannot work.
  2. They assume that the code run before main is cheap, and thus that a slow startup can be diagnosed by starting at the first line of main and going from there (in a debugger, or with logging statements).

In theory, the first problem can be fixed by having the ordering be fully defined, and having the compiler issue an error if you attempt to use a static that's not yet been initialized, although this then requires the initializers to be more constrained than in your proposal so that the analysis is tractable.

The second problem is a hard problem to fix, because it's not about the language, it's about developer expectations. main is assumed to be where Rust starts executing code - but by adding this feature, it becomes a later execution point.

And note that, while this is a much harder problem, in theory the optimizer could learn to track the range of possible values of a variable at any point in the program. With this sort of program analysis, the compiler would then be able to prove that once a OnceCell is initialized, it's never returns to the uninitialized state (since Once::call is the only thing that can change the state, and that always returns without doing further work if the state is COMPLETED), and thus as soon as it can show that the cell is initialized, it can remove further checks. This gets us the comprehensibility of OnceCell, with the performance of static constructors.

5 Likes

Note that if the goal is safe initialisation with no overhead on read access, it is possible with a zero-sized type. I have a proof of concept in this repository: https://gitlab.com/arnaudgolfouse/init_once

Moving the zero-sized type around may be annoying, but I believe it works.

This crate seems to use the same concept: https://crates.io/crates/init-token

5 Likes

It seems that, follow order of dependencies would generate an acceptable results. Since rust could not create cycle dependencies, the initialize order could be as simple as, initialize all the dependencies before the init function of the crate executed.

As I mentioned above, they are choices, not exactly what we must write.

as for web assembly, I don't know much about that, but it seems that, before we call get_or_init on OnceCell, the static field already have an empty value that suggests init rather than get should be called.

In this case, we could execute #[init] just after all the (normal) static field is initialized.

it is easy to identify "normal" static field, fields defined before first let clause could be recognized as normal static field, and could only be initialized from const expr.

since init function are special. actually the #[init] could be recognized as a proc macro, then it would expand the full init function, define global static variables.

I also suggests a syntax sugar for the really confusing grammar:

static mut str1:String=String::new();
let mut str2=String::from("init");
str2+="sugar";
static str3:String=str2.clone(); // str3 is "init sugar" now.
fn main(){...}
#[allow(seperate_init)]
static str4:String=str2; // rustc recognize this statement as part of #[init], thus a warning should be generated.

which equals to

static mut str1:String=String::new(); // occurs before first `let`, thus it is a "normal" static variable.
#[init]
fn init(){
    let mut str2=String::from("init");
    str2+="sugar";
    static str3:String=str2.clone(); // str3 is "init sugar" now.
    static str4:String=str2; // rustc recognize this statement as part of #[init], thus a warning should be 
}
fn main(){...}

the init order is just the dependencies order, thus when init is executed, it could suppose that every crate it use is properly initialized

This is why I show some function could be marked as init.

#[init]
use std::init; // this line specific that std::init should be called before the user-defined `init` program is executed.

crates could either provide fallback if init is not allow to call(for example, use OnceCell), or just refuse to compile(for example, you write a crate that operate your nvidia GPU with cuda, but you are not allowed to init cuda devices.)

That's not true. Currently rust uses libc, libc would have several instruction to execute before rust's main function executed.

In theory we could ask ChatGPT and ChatGPT would give you a perfect program. But that seems very far from us. Currently, the only thing we could ensure is that initialize before main is executed.

I meant dependencies in the general sense of “this must be done before that” — here, “this must be initialized before that”. A cycle can exist within a single crate, when an #[init] function makes use of a static that is initialized by a different #[init] function.

You could statically reject such cycles for static initialization per se, given your premise that the variables are declared from within the #[init] code, but ordering effects would also show up if any of the statics are interior mutable.

4 Likes

What dependency order?

Consider the following code (you can split it into files, if you like), which is all one crate:

mod one {
    #[init]
    fn one_init() {
        pub static hello: String = String::from("hello");
        pub static total_len: usize = hello.len() + crate::two::world.len();
    }
}

mod two {
    #[init]
    fn two_init() {
        static world: String = String::from("world");
        pub static total_len: usize = world.len() + crate::one::hello.len();
    }
}

There's no dependency ordering between these two (since they're modules in the same crate), and somehow, you need to split them up so that both one::total_len and two::total:len get calculated correctly, or work out how to detect the issue and error accordingly. You could even create the same issue with two #[init] blocks in the same module.

First, not all Rust code uses libc - it's an optional dependency if you're in a no_std context. Secondly, I didn't say that no instructions are executed before main - I said that no Rust is executed before main. Thirdly, I've noted that C++ programmers forget that C++ code can execute before main, in a language that does have static initializers, and while this is a false assumption, it's a common mistake that I've seen many engineers make and have to be reminded of when their binary takes a long time to get to main.

And finally, statics are global state. We have a long history that demonstrates that humans are bad at reasoning about global state; that's why there's a preference for making it challenging to use global state, rather than wrapping things up in an object that you pass around.

2 Likes

That's the order in which they get initialized, not when.

The "empty" value is some constant embedded in the binary. There's nothing that gets executed at runtime to initialize that, so you can't run the init code after that. What OnceCell does is to check at every access if it needs to run the init code, but you said you wanted to avoid those checks...

So you're proposing to allow let and other statements outside functions? That seems very dangerous and confusing.

1 Like

Such things could be solved easily by disallow cross references between modules. For example, only allow init with constant and moves in submodules would help solve such situation.

// syntax sugar are used.

const CONST : usize = 42;
static value : usize = CONST;

// you could write `#[init]fn init()` here, which means `value` is initialized before init is called.
// It might be useful to do some version checking or initialize mutex lock.
// with syntax sugar, rustc should recognize the `#[init]fn init()` insert just before the first non-const call and the first let clause.

let hello: String = String::from("hello"); // those `let` would live until the full init procedure is done, thus we could move it later in submodules.
let world: String = String::from("world");
let one_total_len=hello.len() + world.len();
let two_total_len=hello.len() + world.len();

static should_warning : usize = CONST; // if a static field could be calculated directly within a `#[init]fn init()`, a warning should be generated.

mod one {
    pub static hello: String = Super::hello; // rust should ensure `Super::hello` is defined with `pub(init) static`
    pub static total_len: usize = Super::one_total_len;
}

mod two_with_extra_const {
    pub static hello: String = Super::hello; // rust should ensure `Super::hello` is defined with `pub(init) static`
    pub static total_len: usize = Super::two_total_len;
    pub static some_const : usize = CONST;
}

Here, let could be directly translate into pub(init) static if the visibility pub(init) exists. Variable with pub(init) are allowed to move, but could move only once.

It might not be dangerous, since they are only syntax sugar. Maybe a feature gate #![implict_init] is needed to enable such syntax sugar.

And what about multiple #[init] blocks in a single module? Also, with the syntax sugar, you've now opened up the issue that the following code is allowed, but a "trivial" change to it is not:

mod one {
    pub static name: &str = "one";
    pub static total_len: usize = crate::two::name.len() + name.len();
}
mod two {
    pub static name: &str = "two";
    pub static total_len: usize = crate::one::name.len() + name.len();
}

The above is allowed today, since str::len is a const fn, and thus it's compile-time computed. However, if I change this to:

mod one {
    pub static name: String = String::from("one");
    pub static total_len: usize = crate::two::name.len() + name.len();
}
mod two {
    pub static name: String = String::from("two");
    pub static total_len: usize = crate::one::name.len() + name.len();
}

I've now created a cross-reference between modules, since String::from is not a const fn, and thus must live in a #[init].

How would you resolve this inconsistency?

2 Likes

What if there is one and only one function in the entire program that is allowed to initialize static variables in runtime?

This function could be a special block in the beginning of main itself or something bikesheddable like fn pre_main() or static fn main().

The compiler would statically analyze that every single static variable used anywhere in the program is initialized once the function (or section of main) finishes. Failure to do so emits a compilation error.

Libraries that require static initialization would include their own static initialization functions that would then be called by this block/function in a binary crate. This would make the order of initialization explicit in code.

I don't know if this is technically feasible. I think this would be useful even with the restriction of not allowing accessing any static variables in this "static initialization" context. In this case it could be its own function restriction like const fn.

But it still has to be able to run normal code (that's the whole point), which can access any static variable. The only way this could ever work is if the compiler tracked static variables access across every function.

And this still doesn't solve the problem with WASM (which can't run code when loaded).

2 Likes

The only way this could ever work is if the compiler tracked static variables access across every function.

Only across every function in this static initialization call graph. Would tracking the "static initialization-safety" of all functions be too expensive?

And this still doesn't solve the problem with WASM (which can't run code when loaded).

Easiest option: don't support this at all when compiling for such a use case.

Suggestion:

Setting global variables could only be allowed in functions marked as "static fn" (bikeshed). These functions would be callable from foreign code like any other function if it's extern "C" (or the WASM equivalent). It would be the foreign code's responsibility to call all functions initializing required static variables first.

I believe this would still be an acceptable level of footgun-ness when interfacing from C. I don't know if that's the case with WASM. IMHO it is.

That's true. Suppose you have a Sequence, which generate increasing IDs while new is called:

use core::sync::atomic::{AtomicUsize, Ordering};
static counter: AtomicUsize=AtomicUsize::new(0);
struct Sequence(usize);
impl Sequence{
    fn new()->Self{
        Self(counter.fetch_add(1,Ordering::SeqCst))
    }
}

and you wrote the following code:

mod one {
    pub static name: Sequence = Sequence::new();
    pub static total_len: usize = crate::two::name.len() + name.len();
}
mod two {
    pub static name: Sequence = Sequence::new();
    pub static total_len: usize = crate::one::name.len() + name.len();
}

What result could you got?

In this case, a compile error might be more appropriate.

In case you really want to initialize Sequence in a specific sequence,

let mod_one_name=Sequence::new();
let mod_two_name=Sequence::new();
mod one {
    pub static name: Sequence = mod_one_name;
    pub static total_len: usize = crate::two::name.len() + name.len();
}
mod two {
    pub static name: Sequence = mod_two_name;
    pub static total_len: usize = crate::one::name.len() + name.len();
}

The question is, some crate may have their own static variables and want to initialize them before main. Thus you cannot avoid manage different init functions either manually (that's unhappy) or automatically.

I would expect this to compile, by analogy to the version using &str or usize instead of Sequence, since Copy does not come into play here.

We are in Safe Rust here, so it can't be UB. Therefore, I would expect there to be rules that completely define the order in which the two calls to Sequence::new take place, so that I get a deterministic outcome. What I'm not clear on is what those rules should be to minimise surprise - in particular, you'd want to define an order in which the code in the three modules here (the outer module, mod one and mod two) gets visited such that users are unsurprised by the outcome.

IMHO, This is not expected. Your code, while convenient, are also fairly complex. They require knowledge of evaluation order, and will lead to subtle bugs and undefined behavior.