Would a crate like uneval - Rust that @kpreid pointed out be a better option? Build the tables once, as a generated source code file that you commit, along with the build.rs to regenerate it as needed. Then, they appear in the binary, but they're not going to affect compile time as much, since all the evaluation was done when the generated source code was written out.
And static initializers today can call const functions - is the additional power that uneval provides enough that you can do everything you want to do with uneval to create data tables (for the cases that are expensive to compute), and const fns to initialize your static data tables?
This is because noise exists. You could not rely on benchmark. For two similar function X and Y which calculate the same things, modify the implementation of X would significantly impact speed of Y, which shows an accidently 2% performance gain or loss. Thus the one example you've found might be fake example.
On my laptop, param and mixed have the same speed, and the speed is equals to global without copy.
use_data_param obtain &DATA_TABLE at almost no cost, use_data_mixed obtain DATA_TABLE each time it is called. Could it have >10% impact?
If you said "sum_param(data) is not changed", then as a simple consequences, we could rewrite
range.fold(0, |acc, elem| acc + elem * sum_param(data)) to range.fold(0, |acc, elem| acc + elem) * sum_param(data), which would speed up by 100x.
This fact suggests an awful thing: the cost wrote let data = unsafe { &DATA_TABLE }; equals to 10 times the function sum_param(data) executed.
Using some PGO technique, 3 ways of your code ended up by
Running benches/bench.rs (target/release/deps/bench-5e29645e49435c6c)
global time: [180.07 µs 180.09 µs 180.12 µs]
change: [+0.0558% +0.0806% +0.1075%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) high mild
3 (3.00%) high severe
param time: [180.27 µs 180.28 µs 180.30 µs]
change: [+0.0295% +0.0500% +0.0717%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low mild
6 (6.00%) high mild
3 (3.00%) high severe
mixed time: [180.01 µs 180.02 µs 180.02 µs]
change: [+0.0315% +0.0392% +0.0466%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high severe
Compare to the ~200us cost without PGO, it is ~10% gain, and thus I could suspect that your optimization just meet the PGO's best situation.
If we can't rely on benchmarks, how do you expect to measure performance?
And I'm still after an example from you where static DATA: Type = …; is a more performant way to find something than passing the same data as a parameter data: &Type. My statement, based on a deep understanding of how modern processors execute code, and a shallow understanding of how a modern compiler works, is that there's never a case where the version with a static is noticeably faster than the version with a parameter, but that the effect of passing as a parameter enables some optimizations to kick in that aren't possible with the static.
Ultimately, the reason I'm asking about this is that I want to understand the problem space we're working in. I believe that it comes down to the people in the following situation:
The initializer cannot be const evaluated - otherwise the existing initializers, which don't require code at runtime, are good enough.
This comes with a query about things like uneval - is the underlying problem that it's too hard to make your initializer const, and better tooling would help? If so, let's build that tooling, since compile-time constants are inherently easy to get right.
You cannot afford the runtime cost of 3 machine instructions on an access to your static - if you could, you would simply use LazyLock or OnceLock, and be lazily initialized.
This can be even better than eager initialization, since it avoids the code running at all if the initialized result is not needed.
I'm also assuming here that we cannot improve VRA to the point where the compiler is able to avoid this cost - but VRA would allow the compiler to determine that the value has previously been accessed, and thus the lazy initialization has already been done, and it can remove the 3 instructions that do that check.
You cannot thread a parameter through to all the places that need a static to be initialized before they run - either a ZST token to indicate that initialization has been done, which is erased at runtime, or a &GlobalContext that references all of the data that's initialized at start-up.
If I'm right, then this comes down to a very small group of people. Against that, if this is to be part of Rust, it needs to be solid against misuse; it's fine for an external crate like ctor or static_init to surprise you if you use them wrong, but the Rust core holds itself to a very high standard for good reasons.
Rust cannot translate those code automatically, thus such flag is used for "checking whether initialization function is called before main".
crate maintainers should wrote features like "no_static_init" to ensure the crate could be used without static initializers.
Yes you're right. I found that call rax is as fast as call #address
init token is discussed before, but it have some shortadvantages:
You must send tokens into different functions, that might be a disaster.
Suppose a situation that pi is initialized statically, thus you must send pi into all functions contains pi.
Thus, x.sin() become to x.sin(pi).
For physics, things become more difficult, for example, every physical constant is might change due to the increasing measure precision. Thus constants might be initialize at runtime. Then, each function use constant should receive a token. When you want to use the function, tokens are also needed. And finally, you have to wrote all token for the upper-most function.
Are we need .assume_init() each time it is used? IMHO MaybeUninit<T> cannot be used directly.
Calling extern function is unsafe.
Thus it is unsound calling the function before it is properly initialized.
Indeed. const is not capable enough for my purposes (for now, it will get better) and const evaluation into a static (not mut) cannot be delayed until runtime. If both of those were possible my immediate use cases would be solved.
At the end of the day this isn't hugely impactful for me. It would be nice if I did not have to jump through a couple hoops to get this functionality, and it would be nice if the compiler checked more constraints, but I am able to achieve what I want in present day slightly manual (unsafe) Rust.
As a Rustacean I want to have my cake and eat it too. Comments relating to a couple cheap checks or generating code are maybe missing the point. No amount of fully cached fully branch predicted microbenchmarks are going to make me want to include extra checks. This all should be achievable; there are some gnarly issues like dependency ordering to work out, and maybe it is not worth the effort. But we are trying to solve a real world problem and there is no reason not to solve it well.
Analyze asm code directly and point out which slow down the calculation.
Enable PGO to make a fair compare. X86 code have alignment, with proper alignment, the code executed faster. If we want to benchmark, we should minimize the impact of different alignment. Fortunately PGO could help.
Check whether the speed up is reliable. In your example, make a reference is 10 times slower than calculate a sum of all elements in the reference.
The initializer cannot be const evaluated is not very rare. For example, weights of trained neuron networks only loaded when the program is started, but the weights might change (due to continuous training), while the program keeps the same. In this case, Although we have weights, the weights could not initialize from a const fn.
What's more, if the variable is very large, save its value into binary executable would blow up its size.
Yes at least for stockfish. Every instruction is important. Such program have endless request for calculation power.
This require unsafe ops (otherwise you could not modify static variable), but the operation is TOTALLY SAFE.
If you want to initialize let context:GlobalContext, you should notice that, GlobalContext should be Sync But it might contains static variable which works in main thread only. Then you have to split GlobalContext into SyncContext and LocalContext, and send them at some cost (as I have shown above, 7 references could slower the whole wpl function by ~1%, this is confirmed by PGO as I mentioned above)
(result with PGO:)
Finished bench [optimized] target(s) in 46.24s
Running unittests src/lib.rs (target/release/deps/benchmark_static-32cee83b30042752)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Running benches/bench.rs (target/release/deps/bench-cec17703e6a71b9f)
orig time: [187.58 µs 188.13 µs 188.89 µs]
change: [-2.6891% -2.1008% -1.4611%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
global time: [188.77 µs 188.96 µs 189.19 µs]
change: [-2.4275% -2.0951% -1.8322%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) low mild
3 (3.00%) high mild
2 (2.00%) high severe
param time: [189.46 µs 189.48 µs 189.51 µs]
change: [-2.7107% -2.5860% -2.4777%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low mild
3 (3.00%) high mild
6 (6.00%) high severe
If you are right, only very small group of people should use Rust, since
memory cannot be statically allocated - otherwise the existing language, which don't require borrow checker, are good enough
Java/C#/Go have GC which have really small impact thus most of the people could afford such runtime cost.
You cannot check a raw pointer through to all places that need to allocate and deallocate - either RAII has been done, or memory pool that manage all of the allocations.
This is why you're wrong. Your restriction is too subjective and thus mark all of the situation I proposed as "useless".
If you use safe rust only, initializer would not misuse. static variables are widely used. the existance of static_init and ctor just shows the possibility of unified static initializers.
Since ctor could not deal with execution order, my propose might be valuable.
Using a token was an interesting enough idea to try throwing together an implementation, using OnceLock instead of UnsafeCell since it does half of what is needed: Rust Playground
Looking at it in godbolt it looks like there's an extra mov compared to accessing a static directly, but that seems like it should be negligible (EDIT: and #dark-arts has confirmed the mov is dead code, so LLVM should remove it, except it's an atomic access and it doesn't like optimizing those) Compiler Explorer
In today's Safe Rust, there is no such thing as runtime static initializers; your proposal, once you present it, will have to explain how you expect runtime static initializers to function, such that Safe Rust cannot result in misuse.
At the moment, since you haven't presented a complete proposal, I'm trying to list all the objections that are likely to be raised, so that you can address them up-front, and ensure that your proposal already covers them.
And yes, those crates prove that (with serious limitations, documented in both cases), you can have static initializers on some platforms (but not on others, by platform design); I bring them up because a good proposal will include a comparison to state of the art crates, and will explain why it solves the same problem, but does better than the crates do.
All useful unsafe operations are operations that are totally safe - the difference between "safe Rust" and "unsafe Rust" is on who bears the cost of proving the operations sound.
In safe Rust, the compiler is your co-pilot; if you do something unsound, the compiler refuses to compile your code and raises an error. In unsafe Rust, you as the code author are responsible for making the checks that what you've done is sound - and if you get it wrong, the compiler will let you write unsound code.
That's part of why I'm trying to extract a smaller feature from this that's easier to make sound in all cases; if the problem is that you really want a much more powerful const evaluator that can handle much more code than the current one can, maybe the shorter term fix is to generate data tables from a build script? If the problem is that you need a data structure that is filled in once for each CPU, maybe we'd be better off with a way to solve that problem directly?
That is the opposite of what I said, and I would appreciate it if you didn't misrepresent me in order to then insult me.
I listed reasons why the people who gain from this feature are likely to be a relatively small group - and used that to justify the claim that if you want to make this feature happen in the standard library or Rust language definition, you can't leave the edge cases unsolved (as both ctor and static-init do in different ways), because there's only a small amount of benefit from the feature. Instead, because only a small number of people benefit, it needs to be a solidly designed feature with no rough corners - not least because people who benefit can work around the gap today, in ways with varying degrees of risk (such as ctor's ordering issue, or static_init permitting deadlock in initialization).
That is not the same as calling it "useless" - that's saying that we need to get this very right so that we don't cause more pain for more people than we resolve by adding the feature.
In what way is it not helpful? Bear in mind that I've never needed pre-main initialization of global data in C, C with Classes, C++ or Rust, and there is no such thing in bare metal assembly (since I'm writing the code that starts when the CPU resets, so life before main in such code is life before power-on) - I've always had alternative methods that have worked just fine for me (and which avoid things like the SIOF, and hardware watchdogs rebooting me because it took too long to get to the first line of main that tickles the watchdog after boot), and I'm trying to extract a feature here that's easier to implement soundly than full-blown non-const static initializers, and which solves a chunk of the problem.
If you need your data to be generated at runtime, because it's entirely dependent on arbitrary properties of the environment that it finds itself in, then yes, I can't extract a smaller feature.
If you could reasonably generate it at compile time, and have the tables be static DATA_TABLE: DataType = …; if only there was a good way to fill in the … with code that can't currently be const evaluated, or if the compile time of the … could be reduced to a sensible time, or if the binary size impact of the … wasn't so significant, then there's room for an interesting feature that's not quite static initializers before main and that resolves the problem. The question then is what shape that feature takes.
For example, if the problem with the … is that the compile time impact is huge, or the binary size cost will be too high (1024 bytes of machine code generating 128 GiB of data tables, perhaps), then we have a way forward past the platform problem; we can say that conceptually, the code is run at compile time, but on platforms which support it, the initializer will be deferred to program startup. On WASM, this causes compile-time blow-up (but there's no other option there, because there is no program startup on the WASM target); on Linux, it'll be a runtime cost instead.
If the issue is that the available set of operations in const evaluation is not enough to represent your data building process, then that's fixable in two directions: first by increasing the number of things that can be const-evaluated, and secondly by finding good ways to do the evaluation in a build script, then loading the generated data into your code.
Finally, if you'd be happy with lazy initialization (OnceLock, LazyLock), but can't afford the 3 instruction cost after initialization has happened, then maybe the answer is to remove that cost - I can think of two ways for the compiler to do it that don't rely on advancing the state of the art in value range analysis, one of which is an evil hack (binary patching, with a big table so that all the places that pay that penalty can be fixed up when the initialization finishes), and the other of which is using attributes to make the analysis cheaper, so that the compiler can then flag the Once underlying the OnceLock or LazyLock as "will always be completed" and optimize accordingly).
What about if a crate panics during initialization and this ends up calling a #[panic_handler] in a crate that depends on the panicking crate? This is entirely safe, yet still unsound. How do you want to protect against this without unsafe?
Well, I understand what you do. But such comparasion is really simple. The discussion above already shows that, a simple crate could not handle complex dependency graph.
For example, A is a panic handler which needs to be initialized, and it panic at the initialize procedure.
In this case, ctor is unsound, since it could not ensure #[panic_handler] is called after it is initialized. And also, static-init cannot be used, since #[panic_handler] does not accept any token as extra parameters. In this case we have to use a different method to do the same thing, which is annoying.
As I mentioned above, Initialize panic handler is safe (if ctor is used) but the initialization step could also panic. Here an uninitialized static variable is obtained by ctor, which is highly unexpected.
It is not easy to wrap any unsafe parameter as "safe". When you wrote one, you cannot ensure the safety. This is one of the reason why I want to propose a static initializer.
Currently, all the problem could be solved in a uniformed static initializer, why ask rust users learn several different grammar for different situations with issues like penalty of speed, noisy grammar and unsafe code without safety guarantee?
Not really true. Sometimes const data and program are stored seperately (e.g., stockfish use a neural network which could be updated seperately from the main program, and they could be regard as constants after program started). And sometimes calculate const data at runtime is highly unexpect.
For example, The following code compiles to a >1GiB executable. This is why const evaluation is not expect.
Actually I'm trying to solve all the edge cases, and currently, found unsafe and dependency order really helps. And then I found you said, my proposal comes down to a very small group of people.
This is why I feel angry.
Most of the cases, dependency order is useable, and in case it cannot be used, we could adding some extra rules (e.g., make panic_handler initialize first) to make it useable.
You could just say, what about panic_handler panics during its initialization?
This is not unsound, it is just a corner case of panic_handler, not static_initializer.
such function must appear once in the dependency graph of a binary / dylib / cdylib crate.
Here, the dependency graph is taking considering, and according to the real dependency, all other crate that could panic depends on panic handler. Thus panic handler (and its dependency) should initialize first, and if it panics, we should follow the instruction that dealing with panic handler's panics (this should be double panic, and abort by default)
Thus this corner case is solved. panic_handler initialize first, and if initialization of panic_handler panics, we should following the procedure of panic_handler. To make sure panic_handler not stop whole program from functioning, we may add extra attribute, such as #[no_init(recursively)] to the panic_handler we use.
Here, #[no_init(recursively)] is a checker to check whether a crate (and its dependencies) have no initializer in compile time.
crate A defines a #[panic_handler], and depends on crate B
crate B panics during initialization
According to the previous rules crate B should be initialized before crate A, but since crate B panics during initialization it will call the panic handler of crate A, which may access statics that are not yet initialized.
The same problem is present with #[global_allocator], since any crate could potentially allocate in
This doesn't solve the problem in the example above, because "and its dependencies" includes crate B, so you haven't changed the order of initialization. Not that initializing crate A before crate B would solve the problem.
Making panics in static initializer of these crates into an immediate abort could be an idea, but I wonder how you deal with the fact that static initializers are supposed to be able to call any code, and locally in that code you don't know whether you're being called by a static initializer or by normal runtime code. Add a global flag that indicates that? This is getting more and more special cases though...
Also, this doesn't fix the problem with #[global_allocator] which adds another set of crates you have initialize before anything else, and now the problem is not panics, but allocations, which I would consider some major use case and not just an irrecoverable error that could instead abort.
Kinda offtopic, but I noticed that you often introduce the name/syntax for something without first explaining what that's supposed to do. This leaves the reader in a state of confusion until they reach the next paragraph, where you hopefully explain the behaviour of that syntax. I suggest you to switch the order and first introduce the behaviour you want to add, and only after describe how the user can specify that behaviour in code.
When I say "benefits a small group of people", this isn't about the value of the feature, and I did not intend to cause upset; rather, it's a statement about how close to perfect the feature needs to be before it's worth adding. My assumption is that you want to end up with the best possible programming language, just like me.
If a feature benefits a very large number of people (like impl Trait), it's OK to leave plenty of rough edges (like type alias impl Trait) that are simply not yet fully worked out; because a lot of people benefit from the incomplete feature, we should get the parts that are definitely fully worked out into production ASAP, and just make sure that the rough edges aren't unsound.
If a feature benefits a small number of people (like async traits, which I'm personally looking forward to, because I have cases where I can't use the Box<dyn Future<Output= …>> workaround as encoded in the async_trait crate), then we need to find and smooth off the rough edges (such as interactions with panic_handler, WASM, no_std environments etc) before we push it further.
This is all about the costs and benefits balance of the feature - because the number of people who benefit is relatively small, the costs of the feature for people who don't directly benefit also have to be small. Sometimes, that's a trivial thing; adding (say) a hexadecimal representation of IEEE floating point numbers is also a feature that only benefits a small group of people, but it imposes no costs on anyone who doesn't use it in their code, even if they depend on a crate that uses it in preference to f64::from_bits.
This makes the feature inherently problematic; if I have to use #[no_init(recursively)] in order to soundly depend on crate A, but I also need to not use #[no_init(recursively)] to soundly depend on crate B, I have a conflict. The standard library and the language try very hard to make all their features additive, such that I do not need to turn off a feature that a dependent crate might want to use.
It sounds, from your description, like #[no_init(recursively)] is a compile-time assertion that none of my dependencies use static initializers; if so, that's a big problem, because that's not the sort of thing the standard library normally has, and it'll need some very careful justification.
It's a corner case of the interaction between panic_handler and static_initializer - without static_initializer, I cannot have a case where my panic handler depends on a crate with static initializers, and I cannot have a case where static initializers panic triggering the panic handler. Thus, this interaction needs to be addressed as part of designing static_initializer, since panic_handler is already part of Rust.
And the same problem appears in the interaction between global_allocator and static_initializer; my binary crate can set a global allocator to be used throughout the program, but if that global allocator depends on a crate that uses a static initializer, then the static initializers need to not allocate - no String, no Box, no Vec etc. This is a case that needs a fix, IMO.
I think now's a good point to recap all the things that need answers if this is to turn into a strong proposal:
How do you handle WASM and similar platforms where there's no way to run static initializers on program load? This involves thinking not just about main, but also about cases where I build your crate into a cdylib and call a function directly.
How do you resolve the conflicts with existing global_allocator and panic_handler features of Rust, so that I can use a static initializer without having to worry about which crate contains the global allocator or panic handler?
How do you handle ordering of static initializers within a crate?
Between crates, the answer is that if A depends transitively on B, then B's static initializers run before A. But we don't have that nice ordering inside a crate.
How does it compare to ctor, static_init, and other crates that provide the same functionality today? Focus here should be on any cases where it's harder to use, and why, since we can assume that the feature you're designing will be as easy to use, if not easier, in all other cases.
What are you planning to put in place so that a developer is not surprised by code running before main in a dependency? In particular, if I'm trying to debug my program taking 15 seconds to start, how are you going to make sure that I know to check to see if static initializers are taking a long time in a deep dependency?
Key here is that the nature of a static initializer means I may not even be aware that one exists in my dependency tree - so I need something that trivially shows them up.
If any of the answers involve a way of disabling the feature in my dependencies (e.g. #[no_init]), how can a dependency provide an alternative so that it continues to function?
E.g. if static initializers are built around a LazyLock equivalent primitive, but we promise that the LazyLock is forced before main and optimized out if static initializers are enabled, then I can just do lazy initialization if static initializers are disabled.
If we've a crate which requires invoking some init() but produces UB if the user fails to do so during main, then simply explain this in forcefully in the crate docs, and also add detection which panics during debug builds.
Instead of doing the above all the time, provide a unsafe_faster feature which turns on the above behavior, but leave the default behavior as invoking init() behind a Once or whatever.
We could add features which only binary crates can enable, not library crates, although that's not quite the correct distinction really.