Pull target environment into Rust typechecker

stepancheg · November 22, 2020, 9:50pm

An idea. Not even pre-pre-pre-RFC yet. Posting here to collect feedback before writing a more formal proposal.

Pull target environment into Rust typechecker

There's an issue: when a project is refactored (moved, renamed, signatures changed), it fails to compile in different target environment. The fix is usually trivial, but time to discover is significant:

need to wait for CI to finish
setting CI for all possible target environments is not always easy

Usually these breakages are caused by one of three things:

test vs non-test builds: the library is patched, but test is forgotten, and test fails to compile. This happens more often when working on multi-crate projects.
operating system differences: the code is checked on Linux, but fails to compile on Windows
features: code is checked with current enabled set of features, but fails to compile with a different set of features. This is especially problematic when the number of features is more then two, and combinatorial explosion prevents checking all possible combinations of features.
std vs no-std builds

This document proposes how to fix these problems: pull conditional environment checks into Rust typechecker. This would be an alternative/complement to current #[cfg] filters in AST.

This won't guarantee that code will work correctly in different than current environments, but it will at least will be compiled. And with current Rust strong type system, that will often be enough.

Outline of the idea

Environment

First we define an environment. And environment can be any simple Rust object comprised of:

boolean or integer
struct
enum, the most common case

Ror example:

/// Defined in Rust std
enum EnvOs {
  Linux,
  Macos,
  Windows,
  ...
}

// `env` is similar to `const`, but can be referenced in `env_` modifiers
env CURRENT_OS: EnvOs = if cfg!(windows) {
  EnvOs::Windows
} else if cfg!(linux) {
  EnvOs::Linux
} else if cfg!(macos) {
  EnvOs::Windows
} else {
  ...
}

Or:

env CURRENT_RUST_VERSION: u32 = ...;

Simple example use case: cross-platform file open API

Let's assume we can open file only on Windows, Linux and macOS.

// First, we define native functions
extern "C" {
  // `open` function is available to typechecker, everywhere
  // but when generating machine code, this is equivalent of
  // `#[cfg(or(linux, macos))]`
  env_if(CURRENT_OS in [EnvOs::Linux, EnvOs::Macos])
  fn open(...) -> c_int;

  env_if(CURRENT_OS == EnvOs::Windows)
  fn CreateFileW(...) -> HANDLE;
}

// Then we define our file handle
env_if(CURRENT_OS in [EnvOs::Linux, EnvOs::Macos, EnvOs::Windows])
struct FileHandle {
  env_if(CURRENT_OS == Os::Windows)
  windows_handle: HANDLE,
  env_if(CURRENT_OS != Os::Windows)
  // of equivalently env_if(CURRENT_OS in [EnvOs::Linux, EnvOs::Macos])
  posix_handle: c_int,
}

env_if(CURRENT_OS in [EnvOs::Linux, EnvOs::Windows, EnvOs::Macos])
fn open_file(path: &str) -> FileHandle
// note no curly braces before `env_match`:
// it is not a regular match expression,
// it cannot be used inside the function body;
// code after `=>` is the actual function bodies
env_match CURRENT_OS {
  EnvOs::Windows => {
    FileHandle {
      // this field is available when `CURRENT_OS == EnvOs::Windows`
      windows_handle: CreateFileW(...),
      // there's no `posix_handle` field on Windows, so code typechecks
    }
  }
  _ => {
    // here compiler knows that
    // CURRENT_OS is `EnvOs::Linux` or `EnvOs::Macos`
    FileHandle {
      // `posix_handle` field is availble when current os is not windows
      posix_handle: open(...),
      // and there's no `windows_handle` field
    }
  }
}

Example with feature

env FEATURE_EXTRA_ASSERTIONS = cfg!(feature = "extra-assertions");

struct Data {
  data: Vec<u8>,
  env_if(FEATURE_EXTRA_ASSERTIONS == true)
  checksum: u64,
}

impl Data {
  // This function is runtime no-op when extra-assertions feature is off
  // but it is typechecked regardless of whether feature is on or off
  fn verify(&self)
  env_match FEATURE_EXTRA_ASSERTIONS {
    true => {
      // `checksum` field is available here
      assert!(self.checksum == self.compute_checksum());
    }
    false => {
      // there's no `checksum` field, but we don't use it
    }
  }
}

Example with test

env TEST: bool = cfg!(test);

#[test]
env_if(TEST == true)
fn my_test() {
  // this function is always typechecked, but not compiled
  // when we are not compiling crate as test
}

Environments can be mixed

env_if(CURRENT_OS == EnvOs::Linux)
fn connect_unix_socket(...) {}

#[test]
env_if(CURRENT_OS == EnvOs::Linux && TEST == true)
fn test_connect_unix_socket() {
  // this function is typechecked even if we are not on Linux,
  // and even if we are not building a test
  connect_unix_socket(...);
}

How to deal with native dependencies

Use case: windows-only native library wo, which depends on a binding to the actual native library wo-sys. wo crate should be available even on non-windows, but so-sys cannot be compiled on non-Windows.

This issue can be solved by mixing env_ attributes with current #[cfg()] attributes. Like this:

// `wo` crate is available everywhere,
// but depending on/compiling `wo_sys` only on actual Windows.
#[cfg(windows)]
extern crate wo_sys;

// `do_it` function is available everywhere

// On Windows we provide actual function calling native function.
// We still expose `env_if`, so when compiling on Windows,
// we need to check this function is not called in Linux environment.
#[cfg(windows)]
env_if(CURRENT_OS == EnvOs::Windows)
fn do_it() {
  wo_sys::do_it();
}

// This is a stub function to be used on non-Windows.
#[cfg(not(windows))]
env_if(CURRENT_OS == EnvOs::Windows)
fn do_it() {
  // This function cannot be instantiated,
  // `unreachable!()` is just a precaution.
  unreachable!();
}

A little more formal explanation

Single environment type predicate is a subset of possible values for given environment type. For example,

os != windows
test == true

Environment predicate is a inersection of single environment predicates. For example:

(os != windows) && (test == true)

env_if modifier (e. g. at function) defines an environment predicate.

Each function is typechecked with an environment predicate.

env_match block is used to split a function body into several bodies typechecked with different environment predicates.

Finally, function body can only access elements (types, functions, fields etc) which have wider predicate. For example, function with (os == linux) can call a function (os != windows), but not vice versa.

Full formal specification would be quite lengthy (for example, it need to specify how traits and implementations work). This document is just a sketch of the idea.

matklad · November 23, 2020, 8:10am

With my IDE dev hat on, I applaud this efforts. I am however pretty skeptical about feasibility herein.

At the moment, conditional compilation happens during parsing/expansion stage. Pushing it much further in the compilation pipeline, to the monomorphisation/linking step would be a great boon for IDE. Today's conditional compilation generally forces the IDE to look at a specific current set of cfg's, and so things like automated refactors can break for different set of cfg flags. There are some (hard to implement, poorly performing) hacks to remedy this situation, but they are qualitatively different from a world where refactors are just precisely defined equivalent transformations.

Ideally, in the "mir-only rlib" world, the rlibs would include code for all versions of cfg flag (so each item might have several versions to it, annotated with a predicate), and it's only during codegen that we select a specific representative for each multiversioned item and check that each reachable item has at least one representative.

I don't know about languages where conditional compilation works quite like this, I'd be curious to read about the prior art here. The closest analogue I know is the expect/actual from Kotlin (OT, but was this you? ). This is somewhat expected, one can't build a great IDE with more traditional approach to conditional compilation.

rpjohnst · November 23, 2020, 3:50pm

I don't know of any languages using this (yet?) but there is the concept of coeffects as the dual of effects, which is often applied to this particular use case.

(I've also seen coeffects applied to things like linear types, dynamic scope, resource usage, etc. so I suspect there may be a lot of stuff in Rust that could be thought of this way!)

bjorn3 · November 23, 2020, 4:23pm

I once came across these two pages about coeffects: http://tomasp.net/coeffects/ and http://tomasp.net/blog/2014/why-coeffects-matter/index.html

atagunov · November 23, 2020, 7:48pm

I'm probably naive/uneducated but aren't co-effects technically same as effects?

I thought in languages that supports effects

you define some sort of "effect interface" (=approx= interface in Java or trait in Rust)
you install effect handler somewhere high in your callstack
a code somewhere deep in the call stack "invokes" the effect; to that code it looks like a method call
- this pseudo-method call may never return - if you invoked an exception effect all the stack between effect handler and effect invocation is destroyed
- this pseudo-method may return once IO is complete - if you have invoked an IO effect
- this pseudo-method may return in an hour - if you invoked a yield effect and user-level thread scheduler decided to run some other code for this hour; actually your stack between effect handler and effect invocation may have been packaged into an object on the heap and stored there - not Rust way - but other languages (ML family?) probably do this
- in languages with GC and possibly without mutation you could even have this method return multiple times - though I don't understand this well enough - if the point of this effect is for example try different solutions to an equation..

My point is that there is potentially a bi-directional data flow here: code invoking an effect passes data to effect handler and effect handler when resuming the invoking code (resuming a "continuation") can pass data into it. Therefore we can perfectly well model writing to files and reading system time with effects. Why do we need an extra concept of "co-effects"?..

rpjohnst · November 23, 2020, 11:37pm

That's an operational way of thinking about this, while effect-vs-coeffect is primarily about the type system. In that context, the "dual" of something (intuitively) means that the direction of some arrows have been reversed.

For example, the async effect applies to the output of a function: async fn f() -> i32 desugars to fn f() -> impl Future<Output = i32>. If you want to use a value from a Future, you can only do so by producing another Future based on the first one. On the other hand, if you have a value you want to use as a Future, that's just a trivial future::ready away.

So the dual here would mean that the extra context applies to the input of a function. Take this thread, for example: platform-specific code accepts the current platform as a sort of input. Once you get the system time from the platform, it's just a normal value that you can use however you like. On the other hand, just because you have a time value does not mean you can make platform-specific code see it as the system time.

(Of course, both of these could be bypassed- you can block on an executor to get a normal value out of a Future, and the platform might let you set the system time. But neither of these operations are common across effects or coeffects- each one may or may not provide an equivalent and each one may work completely differently.)

atagunov · November 23, 2020, 11:45pm

Thx a lot for spending time to explain.. I kind of get it and not get it at the same time.. Is there really a material difference between a function that reads system time (co-effect) and reads a file (effect)? What if OS provides a special file to read system time?

rpjohnst · November 23, 2020, 11:50pm

There is certainly some overlap in actual use cases. The difference is how you express things in the type system, and you can look at both time and files both ways:

Is it fn system_time() -> IO<Time>, fn read() -> IO<Vec[u8]>?
Or is it fn use_system_time(system_time: Platform<Time>), fn use_file(file_contents: Platform<Vec[u8]>)?

atagunov · November 24, 2020, 12:06am

Would you agree

effects are much more powerful than co-effects
a language with effects need not bother with co-effects on top of them
a language without either like Rust may choose to introduce co-effects
only because they would be a less radical change than effects

?

rpjohnst · November 24, 2020, 1:13am

I think I would disagree on all counts. I'm not really sure you can compare the "power" of type system features like that, and even if you could there are usually good reasons to choose less-powerful tools over more-powerful tools.

(For instance, macros are arguably "more powerful" than the alternatives for a given use case, but that power turns around to bite you when you want "more powerful" IDE support.)

atagunov · November 24, 2020, 1:31am

co-effects seem easy to ~~model~~ implement via effects (guess I can't stop thinking operationally)
but not vice versa - exception and async seems not possible to model with co-effects?

Exactly my point. If effects are more powerful, Rust may specifically want to use co-effects because they are less radical alternative of the two.

rpjohnst · November 24, 2020, 2:23am

The links bjorn3 shared already give some examples where this is not the case- the dataflow system is one of them.

You may also be interested in the related "comonads" (effects and monads are closely related) for more examples like that- the canonical one is probably the Game of Life.

atagunov · November 24, 2020, 11:45pm

@rpjohnst, thx a lot for pointers, I need to educate myself

Going back to @stepancheg's post.. What was it he suggested?.. Simultaneous type-checking the code for all possible combinations of env flags?

stepancheg · November 25, 2020, 6:11am

Basically, if function aa calls function bb, it is required that aa compatible env is a superset of bb compatible env. That's it.

No function body expression complex NP analysis is required. This check is independent of regular code typecheck.

In other words, env constant cannot be referenced in const expressions or types. Regular typecheck and env checks are two different worlds.

Something like this is not possible:

env RUST_VERSION = ...;
const ARRAY: [u32; RUST_VERSION] = ...;

Next time I pick less confusing syntax.

atagunov · November 25, 2020, 1:14pm

..so if an env_if makes members of a struct differ between Linux and Windows
does an fn outside any env_if-s that uses the struct need to be compiled into MIR twice?

rpjohnst · November 25, 2020, 4:40pm

Presumably it would only need to be compiled once, for the active env_if state of the target platform. We're not producing binaries that run on both Linux and Windows, after all

atagunov · November 25, 2020, 4:58pm

Actually in my understanding the crux of @stepancheg's suggestion was to sort of compile for both Linux and Windows at the same time. We already have #[cfg(windows)] to compile for either Windows or Linux.

What @stepancheg wanted I understand was to be able to know if the application would compile for Windows while compiling for Linux.

So I understand @stepancheg's idea is to compile like this when building the main binary too. Compile everything to MIR, make sure it compiles for all possible combinations of env flags and then only generate code for one target combination of those flags.

Now my question is: if a fn defined outside of any of the env_if uses a struct that does use env_if to define a different set of members for different platforms, do we need to compile that fn to MIR more than once?..

stepancheg · November 25, 2020, 9:29pm

Function without env_if (== function with any env) should not be able to access struct fields annotated with env_if.

stepancheg · November 25, 2020, 9:32pm

It is simpler. For typechecking we only need function signatures, not function bodies. So all function signatures (and struct signatures) will be available, but only functions matching current environment will have bodies in MIR.

atagunov · November 25, 2020, 11:14pm

So if you're compiling for Linux would the bodies of Windows-only functions be checked? I understand the goal was to make sure that project still compiles under Windows even when building for Linux?

Topic		Replies	Views
Function to hold lock on execution environment (for FFI) libs	26	2772	April 4, 2021
RFC: Implement a sandbox for environment variables and files compiler	7	1547	March 25, 2019
STD on crates.io	6	2433	March 25, 2019
Rust 1.62.0 prerelease testing announcements	6	1289	September 28, 2022
Pre-RFC: Traits for crates (or: canonical API portability) language design	11	660	January 18, 2025

Pull target environment into Rust typechecker

Pull target environment into Rust typechecker

Outline of the idea

Environment

Simple example use case: cross-platform file open API

Example with feature

Example with test

Environments can be mixed

How to deal with native dependencies

A little more formal explanation

Related topics