Pre-RFC: Sandboxed, deterministic, reproducible, efficient Wasm compilation of proc macros

matklad · August 21, 2023, 9:24am

That's fairly obvious: which syn, 1 or 2? "distribution with rustup" comes with "exemption from version resolution", and tradeoff are such as we want to version-resolve as many things as possible.

Diggsey · August 21, 2023, 9:46am

Your macro must not have any (enabled) transitive dependency which is a procedural macro.

What's the reason for this limitation? I've definitely used custom derives in procedural macros before (serde being one example...)

Nemo157 · August 21, 2023, 9:48am

It's not entirely obvious that crates.io must be involved at all in this. If a proc-macro crate tags itself as sandboxed = true that could just tell cargo to build it as a wasm-sandboxed-binary, and then rustc has support for loading those. The pre-compilation can then be handled as just a distributed caching service of these artifacts, running separately to the registry.

bjorn3 · August 21, 2023, 10:05am

Because sandboxing is backward incompatible it has to be an opt-in. There are several popular proc macros which can't be sandboxed. For example because they extract a database schema from a live database. You could argue that this shouldn't have been allowed, but there was nothing preventing this, nor docs disallowing it, so we have to live with it.

lu_zero · August 21, 2023, 10:11am

I'd split the RFC in 2 (or even 3)

I like the idea of having build.rs and proc_macros being compiled and executed as wasm since it would be cleaner and getting that done would be already overall very nice.

Distributing binaries on the other hand is something that's orthogonal to it IMHO and probably has more infra issues to iron out so it could come much later than integrating a tiny wasm runtime and use it.

I'd consider an intermediate step that would provide something very close to the 0-second compile time: making so cargo cooperates with a caching system (e.g. sccache) better so proc_macros can be locally compiled once and distributed locally/by organization using the usual suspects.

CodesInChaos · August 21, 2023, 10:14am

I think this RFC should be split into two separate ones (or at least two clearly separated sections):

Sandboxing and compilation targeting wasm instead of the host, while compiling on the user's machine
A mechanism to distribute wasm binaries via crates.io

I, for one, am very interested in using isolated wasm while compiling the macro locally. I don't care much either way about the distribution of wasm via crates.io. The benefits appear negligible to me, though the risks aren't huge either when the verification mechanism described in this RFC are applied.

Comments on the content of the RFC:

I think isolated macros should always use wasm with no user opt-in. Only the pre-compiled distribution mechanism should be opt-in.
I think panic=abort is fine, since a panic is a fatal error that shouldn't be recovered from. However it should still produce meaningful error messages, even if it increases binary size a bit (i.e. no panic_immediate_abort)
I don't like build.rs being used or not, depending on native vs wasm. I'd rather forbid them entirely than make it conditional.
While reusing native processes for multiple macro invocations (could be more than one to exploit parallelism) is definitely the right choice, there is the interesting question if each macro invocation should get a fresh wasm environment (cloning the state after the initialization code has run, but before any tokens were passed to it).

A fresh environment would offer much stronger reproducibility/determinism guarantees, but would have worse performance. I'd like to see a benchmark of this performance cost.
The RFC seems to only allow the user two options: 1. "Use the precompiled wasm" 2. "Build locally as native", and doesn't seem to offer my preferred option 3. "Build locally as wasm"

alexpyattaev · August 21, 2023, 10:22am

This is definitely a step in the right direction. I've made a bunch of nasty hacks to sandbox rust analyzer and cargo with firejail, but I am not proud of them. Having all of the build process (where possible) properly sandboxed by default will save everyone a massive amount of headache. Some ideas to consider:

there exists a WASI (wasm system interface) spec, which allows for restricted FS and network access WASI-tutorial.md
with WASI, both build.rs and proc macros can be safely turned into a WASM artifact which runs with write access to the local directory (i.e. build root), read access to the whole disk, and limited network access (e.g. to crates.io and/or other whitelisted domains). 99% of the time this will be sufficient to achieve whatever legitimate goals the build script is trying to achieve.
In order to make it possible to actually call "normal" applications (like GCC, for example), an interface from WASI to something like firejail or docker can be provided, such that the access controls can be enforced. This way, the build script can be 100% sandboxed from doing something terrible with the apps it runs, and if docker and/or nix are used reproducible builds are made easier. The downside of this solution is that both firejail, nix and docker are essentially linux-only sort of tools. While we can just claim that running Windows is unsafe , doing same for the other platforms is probably not going to work.

Edit: in all cases building for wasm from sources (as suggested above) instead of fetching binary artifacts from crates.io appears to be a good idea, as it allows for wasm implementation updates to be applied without forcing crates.io to rebuild stuff from source and/or authors to bump versions.

bjorn3 · August 21, 2023, 10:23am

Wasmtime is optimized for re-instantiating a wasm module on every invocation. It for example uses COW for initializing the linear memory with the memory segment without having to copy everything every time. There is also Wizer for running initialization code once and baking the result into the wasm module, but proc macros can't run anything before their first invocation anyway and shouldn't have global state, so I don't think using it makes sense.

crlf0710 · August 21, 2023, 10:26am

In the longer run, personally i still would like to see the publish.rs idea carry out, and make it take up the duty of proc_macro expansions, and proc_macro dependencies be stripped off by the publishing process. This reduces burdens for everyone.

bjorn3 · August 21, 2023, 10:41am

That doesn't help when you yourself depend on a proc macro. And it is less ergonomic.

CodesInChaos · August 21, 2023, 10:55am

Wasmtime is optimized for re-instantiating a wasm module on every invocation

Even with those optimizations, the overhead might still be large compared to the actual work done by the macro, since a typical macro only needs to do very little work (e.g. parse a 10 line struct declaration and output a 15 line function)

but proc macros can't run anything before their first invocation anyway and shouldn't have global state

I think an initialization function that runs before the first invocation has its uses. It could process environment variables and similar configuration, and it can build immutable lookup tables. While this work could also be done per-invocation, this would probably increase the total cost because the code now runs more often (unless cloning the initialized runtime is more expensive than the initialization code on an empty runtime)

bjorn3 · August 21, 2023, 11:05am

Such a proc macro is incompatible with rust-analyzer as it uses a single instance for all crates and instead simply swaps out the set of env vars right before every macro invocation to match the crate for which it expands. Also there currently is no stable way for proc macros to indicate that they read an env var, so it may not get rebuilt when it changes.

CodesInChaos · August 21, 2023, 11:39am

The MVP of isolated-macros should not offer access to environment variables at all. When that capability is added, it should be done through a specialized API that keeps track of which external information was accessed.

There is always the option to create a new instance per macro invocation or whenever one of the inputs changes. From my point of view this is just a caching technique that improves performance.

dhm · August 21, 2023, 11:58am

Yes, let's push for wasm-(pre-)compiled-and-run proc-macros!

(I do agree that we may want to do this in two steps, implementation/RFC-wise: first featuring the wasm-compilation and execution by Cargo, and then the pre-compilation and shipping from crates.io)

I want to insist on this point, especially with pervasive APIs such as syn's Visit{,Mut}ors or Folders: these do not come with try_ flavors of it, so there is no way to early-return / bail-out other than through an unwinding mechanism^[1]. I agree it is not pretty, and ideally we'd / we'll have Try-visitors, but it does not warrant the "should never unwind" treatment;

I'd thus rather rephrase this part as:

proc-macros are not supposed to unwind through to their callers;
internal unwinding ought to be rare enough not to warrant missing the -Zpanic-immediate-abort optimization for the .wasm binary artifacts. Similar to fs and whatnot, proc-macro authors wishing to rely on internally-caught unwinding will have to forgo the wasm-macro target.

What about per-project setups? Whilst rustfmt.toml files make it easy to express the desire to include components, there does not appear to be any way to easily opt out of components on a per-project basis. I think this situation hints at the following:

presence of components being necessary to enable functionality is indeed a typical usage of rustup;
presence of components being sufficient to enable functionality, with no other way to opt out, seems a bit excessive.

I'd thus amend a bit this section in favor of either:

ideally,

for there to be an optional ~/.cargo/config key (and thus matching env-var) to explictly opt-out of downloading pre-bundled .wasm precompiled proc-macros;
- (and now that we are at it, to also be able to explicitly opt-in, so as to fail or at least warn with a nice message in case of the appropriate rustup component being missing)
otherwise,

for a rust-toolchain.toml way to opt-out of the component (making presence of the component be "hidden" (rather than fully uninstalled) to cargo/rustc invocations within that project).

while it is possible to stop sub-recursing, imagine having an error on the first associated item of a trait; there is no way to prevent the whole visitor machinery from visiting the rest of the associated items other than by "poisoning" your visitor so that each visit of these items does not itself subrecurse ↩︎

newpavlov · August 21, 2023, 12:12pm

As someone who was VERY and unpleasantly surprised when I learnt about how proc macros work in Rust, I like the idea of deterministic and sanboxed proc macros (and maybe eventually build scripts as well?). But I wonder if it's worth to tie it to the WASM ecosystem. Maybe we should specify that sandboxed Rust macros run in an abstract machine and use WASM only as an implementation detail? It also may be worth to keep door open for natively compiled sandboxed macros, it may not be done as part of crates.io, but could be useful for private environments (e.g. imagine pulling natively pre-compiled serde_derive from a company server instead of re-compiling it every time).

Uther · August 21, 2023, 12:22pm

Is determinism an actual goal of WebAssembly or just an happy accident, because it is still young ? Isn't it planned to add features that may involve randomness, like threading, in the future?

bjorn3 · August 21, 2023, 12:26pm

Wasm itself is very much intended to be as deterministic as possible. Currently there is an exception for NaN bitpatterns of floats, but even there the wasm engine option to canonicalize them exists. Any other non-determinism including spawning threads is done by calling host functions. We can choose to provide only deterministic host functions.

github.com

WebAssembly/design/blob/main/Nondeterminism.md

# Nondeterminism in WebAssembly

WebAssembly is a [portable](Portability.md) [sandboxed](Security.md) platform
with limited, local, nondeterminism.

  * *Limited*: nondeterministic execution can only occur in a small number of
    well-defined cases (described below) and, in those cases, the implementation
    may select from a limited set of possible behaviors.
  * *Local*: when nondeterministic execution occurs, the effect is local,
    there is no "spooky action at a distance".

The [rationale](Rationale.md) document details why WebAssembly is designed as
detailed in this document.

The following is a list of the places where the WebAssembly specification
currently admits nondeterminism:

 * New features will be added to WebAssembly, which means different implementations
   will have different support for each feature. This can be detected with
   `has_feature`, but is still a source of differences between executions.

This file has been truncated. show original

mathstuf · August 21, 2023, 1:05pm

I forsee a few situations here (not insurmountable, but I think they should be considered):

different rustup remotes (e.g., the pre-release channel): does this change what +stable is latest or are they only available as, e.g., +1.79 or something specific?
out-of-date metadata: how to determine what is "latest" in the face of outdated rustup metadata?
race conditions: I publish, rustup publishes, crates.io verifies and says it's different (probably rare enough to just say "bump and republish", but should be considered)
relabeling: what if I "relabel" another toolchain as +stable locally?
platform differences: do we know that wasm builds via any platform's hosted toolchain agrees with the others to the bit?

For all but the last one, I think crates.io (or any publishing service, really) issuing a challenge of what it thinks "+stable" is at publish request time would serve as a suitable source of truth. It would then have to remember that for its scheduled build to avoid the "rustup bumps stable" race.

epage · August 21, 2023, 2:12pm

While this is just a Pre-RFC, I feel like a context is missing for why different decisions were made that need to be filled out to justify the limitations.

I also agree with @lu_zero that this should be multiple RFCs. There are a lot of details to work out here and handling them all in one RFC is likely for us to drop the ball on a good number of details and to extend out the discussion beyond what is reasonable. I also suspect the actual pre-compiled RFC should initially start off as an eRFC as there is likely a lot we'd need to work out through the implementation before we commit to it.

For me, the biggest concerns for pre-compilation are

How do we ensure trust of the pre-compiled binary?
How do we handle dependency updates in pre-compiled binaries?
- These are similar to binaries that can only be "cargo installed" with --locked, limiting you to the set of versions that were in the lockfile when the package was published
How do we deal with the discrepancy of the users lockfile and the proc macros lockfile, especially if the behavior of the users build varies based on whether this wasm sandbox is installed (which is orthogonal to actually wanting to use pre-compiled proc macros)

EDIT: For completeness, I want to add that I have a general unease about forcing the local version of rustc to match the server side.

kayabaNerve · August 21, 2023, 2:35pm

I'd very much +1 the wasm cache being local only, and global.

Spending 3s compiling serde_derive, even for the 185 patch versions, is 9m. If that's only done once, on my global system, when I'm building a much larger project? And it survives cargo clean/rustc changes?

(beyond the 185 patch versions, there is the various feature combinations, yet I don't believe most people interact with all 185 patch versions in all feature combinations)

Topic		Replies	Views
Idea: Pre-expanded crates and `expansion-dependencies` language design	5	652	January 18, 2023
Pre-expanding proc macros tools and infrastructure	29	3511	February 9, 2023
Sandbox build.rs (and possibly proc-macro) by providing a runner as env variable cargo	13	1191	January 25, 2023
Pre-RFC: procmacros implemented in wasm compiler	98	10040	January 21, 2020
Sandbox build.rs and proc macros	24	3239	July 30, 2022

Pre-RFC: Sandboxed, deterministic, reproducible, efficient Wasm compilation of proc macros

Related topics