Pre-RFC: Sandboxed, deterministic, reproducible, efficient Wasm compilation of proc macros

Per-using caching of all intermediate build artifacts (not just wasm) is being explored in Per-user compiled artifact cache · Issue #5931 · rust-lang/cargo · GitHub.

6 Likes

Even with those optimizations, the overhead might still be large compared to the actual work done by the macro, since a typical macro only needs to do very little work (e.g. parse a 10 line struct declaration and output a 15 line function)

Instantiation (regardless if the instance is a Wizer snapshot with application state ready to go or an empty heap) is on the order of tens of microseconds. You can run our benchmarks, if you'd like: https://github.com/bytecodealliance/wasmtime/blob/main/benches/instantiation.rs

9 Likes

Maybe there could be a setting in rust-toolchain.tomlor even .cargo/config.toml to enable / disable it project-wide. I know the toolchain file has a way to require certain Rustup components, but it might also be worth having a way to explicitly disable pre-compiled macros. Either way, I don't think having a component installed should implicitly change how Rust compiles project.

2 Likes

I agree, the serde incident has shown that relying on the auditing of code to not be perfect, and while a malicious crate author can just always embed malicious runtime code, sandboxing at build time will be helpful nonetheless.

I'm wondering why rust code should be tied so closely to WASM here. We could instead have sandboxing without precompiled binaries here, as precompiled binaries for performance and build reasons are not necessarily related to the safety aspect, and I think separating the too, as some others have said, is useful.

A lot of build scripts are simple, so having a quick interpreter might be useful for reducing startup time. I don't expect more than a few build scripts to be extraordinary complex, and I'm not sure the ones which are that complex are that amenable to sandboxing.

2 Likes

I'll just say I think that binary artifacts should be kept at arms length from .crate source files distribution. Which makes computing source hashes without including the binary easier. But if we mingle source and binary distribution we'll eventually run into other issues such as whether the binary should include dwarf debug info (probably not for bloat reasons), and if not should they be distributed (It might be a pain to try to reproduce even if they are reproducible by a specific compiler version). As such there may be reasons to distribute separate binary and debug-info files separated via wasm-split.

Anyhow I agree with those who desire thinking about this first as a binary cache separate from the crates.io sources cache... I'm uncertain if associating the binary output with cargo publish of a version of a crate is desirable. For instance, if there are wasm compiler improvements which improve the codegen or runtime of a specific proc-macro it seems strange to need to bump the version number of the crate to recompile the crate from the exact same sources with the new crates.io compiler version.

I'd feel better if it was just a separate cache with compiler version, features used compile the binary and a references to a crates.io source package rather than trying to rely on crates.io running the latest version and compiling with a consistent set of features and the latest compiler version.

Edit: One last thing to consider is when the latest version of the compiler miscompiles a proc-macro, this leaves you no option but to not distribute it since you cannot fall back to an older version of the compiler.

8 Likes

Would it make sense to have this property be instead something like sandbox = "wasm", to allow for potential other sandbox strategies in the future?

12 Likes

You can maybe do that with cfg! but not with #[cfg] in general if its masking off code that can't be compiled (eg because relevant type defining crates are not available). I'm not sure that it's really any different from all features in that case.

Yeah, that's where you'd need to do if !cfg!(foo) { /* return compile error */ }. It's a minor change that I believe most would be comfortable with.

As much as i would love Rust to use WASM as an implementation strategy for proc macros, I'm entirely opposed to the dogmatic & uninformed policy this wants to attach to this.

We should use the standard upstream capability based approach with WASM-WASI and leave the policy choices to end users.

If I want an entirely deterministic & reproducible build in a very narrow sense I would simply provide an empty host environment configuration.

Please recall that Rust is supposed to be a practical language. Rust is not a pure functional language, rather it allows the user control so they can mutate data safely. Likewise, Rust allows an advanced set of unsafe operations that are clearly annotated for auditing purposes. Exactly the same approach ought to apply to other parts of rhe language such as procedural macros.

2 Likes

If we add a mechanism to distribute pre-compiled binaries through cargo to benefit compilation times by eliminating the compilation time of proc-macros specifically, why wouldn't/shouldn't we do the same for regular crates as well? The first-compilation speed improvements would be even more dramatic than what would be provided by this proposal. If we introduce a mechanism to pre-compile crates in general, then is there a benefit to tie ourselves to WASM in particular for proc-macros?

4 Likes

I presume that a single WASM binary could be made to work on all platforms. The number of native binaries to build and host would be multiplied by the number of platforms. They may even be specific to OS version (e.g. on macOS some values of the MACOSX_DEPLOYMENT_TARGET env var affect object files).

6 Likes

A single WASM binary that can be compiled by a different (older/stabler; even a newer) version of the compiler than the one I am currently using. It's extremely amenable to caching.

But that also means it's very amenable to being cached locally, since I only need to update that binary when I get a new version of the proc macro crate (or maybe change the compiler, although not necessarily). As written I would say this pre-RFC conflates

  • compiling proc macros to a wasm binary
  • caching that binary on crates.io

The former is awesome. The latter is a small can of worms that potentially could be better thought of as a very general cache, even if it is only ever used for wasm proc macro binaries; it could be split out into its own RFC.

11 Likes

I think this is far from a terrilble idea. Having it all written down as a coherent plan is very helpful, although we'll probably want sub-RFCs for some of the details.

I would strongly recommend separating the compiled binary from the source code during distribution. Ie, don't include the binary in the crate tarball. The binaries should be distributed separately, in parallel; the crate index could say what binaries exist. (There are a lot of details to be worked out here.)

2 Likes

The backwards-incompatibility is only an issue if it is a global opt-in / opt-out across all macro crates, right? If we create a setting per macro crate that the end user can set, it is backwards-compatible, assuming that the default is no sandboxing.

Overall, I really like this proposal. I do think there are two distinct benefits here:

  • the primary one of sandboxing leading to deterministic behaviour, and
  • enabling the use of pre-compiled proc-macro implementations to improve cold-start compile speed

I see the first as the most interesting from a build-system perspective. Proc-macro non-determinism hasn't been an overwhelming factor for Cargo builds, but it matters a lot to build systems like Buck/Bazel which cache much more aggressively.

The second depends on the first, but opens up a huge range of its own concerns. I agree with the other comments that this whole aspect should be split into its own separate RFC.

Where does this limitation come from? I can see that it's a bit recursive, but not in a way that causes problems?

[various]

use wasi

One big disadvantage of wasi is that it does not allow for deterministic RNG seeding. It would be really nice to guarantee that all HashMap iterations are deterministic, solving a whole class of non-determinism problems at once. On the other hand, if we want to extend the wasm proc-macro API to be able to read file (eg read grammars, IDL, etc -> Rust) then we probably want to use the wasi APIs rather than make up new ones.

2 Likes

it actually does, see Roadmap to determinism in WASI · Issue #190 · WebAssembly/WASI · GitHub

2 Likes

I was going from https://github.com/WebAssembly/wasi-random:

WASI Random does not provide any facility for replacing random data with deterministic data. It is intended to be usable in use cases where determinism would break application assumptions. Implementations may have debugging facilities which make this API deterministic, however these should only be used for debugging, and not production use.

wasi-random has the insecure-seed method for hashmap seeding. This method is explicitly documented for implementations to have the option to provide a deterministic value. The rest of the wasi-random interface could simply be left unimplemented by rustc's wasi impl.

1 Like

As an aside, if I use a language like rhai that can be embedded in rust, and leave the processing logic of TokenStream to rhai, will it be faster than traditional syn?

Logically speaking, scripting language only parses the code that is run, so there is no need to compile all the codes like compilation. Will there be a time advantage?

If it is for build.rs, then I think using a script maybe faster, although introducing yet another languages/complexity.

Maybe rustc_cidegen_cranelift can help improving compile-time of build.rs/proc-macro?