Summary
There are regular statements that "safe code cannot cause UB", including from Rust team members. This statement is technically wrong. The correct statement is "safe code in clients cannot cause UB". I'm wondering why these statements are made (i.e. whether they are simplifications or have unstated assumptions). In particular I'm worried that it gives the wrong impression to people unfamiliar with Rust, who might take such statements at face value.
In scope for this thread:
- Why does the language claim that "safe code cannot cause UB"?
- Why doesn't the language guarantee that "safe code cannot cause UB"?
- Does the language plan to guarantee that "safe code cannot cause UB" in the future?
Out of scope for this thread (please open a new thread and link it here if desired):
- Why the language guarantees that "safe code in clients cannot cause UB"?
- How well safety contracts are proven? (This thread is only about safety contracts, not about how they are verified.)
- How formal safety contracts are. (This thread is only about the content of safety contracts, not about how they are written.)
Details
Terminology for unsafe (expand if unfamiliar)
In the context of unsafe, code can be classified as follows:
|------------unsafe-code-------------|-----------------safe-code------------------|
|-directly-unsafe-code-| |--seemingly-safe-code--|
|------------------actually-unsafe-code----------------------|-actually-safe-code-|
|-----may-have-UB------|
|---------has-unsafe-keyword---------|
|-----------------------may-cause-UB-------------------------|
- Directly-unsafe code means code that may trigger undefined behavior unless some property holds, which the type system cannot check and thus the programmer must check. Typical examples are dereferencing a raw pointer, accessing a mutable static variable, and accessing a field of a union.
- Unsafe API means API annotated with the
unsafekeyword. They are meant to propagate safety requirements when needed. Typical examples are unsafe functions and unsafe traits. - Unsafe code means code annotated with the
unsafekeyword. This contains directly-unsafe code and code using an unsafe API. Typical examples (that are not directly-unsafe) are calling an unsafe function[1] and implementing an unsafe trait. - Safe code means code not annotated with the
unsafekeyword. This is the complement of unsafe code. - Actually-unsafe code means code that may cause undefined behavior unless some property holds, which the type system cannot check and thus the programmer must check. This definition is equivalent to the scope of unsafe: code that has to be checked by the programmer for properties that directly-unsafe code relies on to not trigger undefined behavior. This contains code that the programmer must check to prevent undefined behavior somewhere. Typical examples (that are not unsafe) are the body of
Vec::set_len()or the hypotheticalVec::evil()of the scope of unsafe article. - Seemingly-safe code means code that is safe and actually-unsafe, as introduced by the scope of unsafe article. This is the complement of unsafe code within actually-unsafe code.
- Actually-safe code means code that is safe and not actually-unsafe. This is the complement of seemingly-safe code within safe code.
The mismatch between unsafe code and actually-unsafe code (and between safe code and actually-safe code) is not the subject of this thread. How to make those concepts match is described in the unsafe mental model. This thread is about the language convention regarding safety guarantees, in particular how much control users have to write actually-safe code (and if they can at all).
Terminology for contracts (expand if unfamiliar)
- A unit of implementation is code whose parts cannot be independently updated and are thus tightly coupled. There is no unique objective choice. In particular, a unit of implementation can always be split in smaller units of implementation. The maximal example is a crate linked with its pinned dependencies (and so recursively). More common examples are a crate and a module. Minimal examples are a function definition and a trait implementation.
- A client of a unit of implementation is another unit of implementation that uses it. A dependency of a unit of implementation is another unit of implementation that it uses. Those concepts are opposite of each other: if X is a client of Y then Y is a dependency of X.
- A public API is how a unit of implementation interfaces with its client and dependencies. A unit of implementation chooses the public API with its clients and chooses its dependencies based on the public API they chose for their clients.
- A public API defines (among other things) a logic contract (resp. safety contract) used to prove that programs execute their intended logic (resp. execute without undefined behavior). Note that the safety contract is only meaningful for unsafe code. Safe code only sees the logic contract.
- An X contract has X requirements and X guarantees from one side to the other (where X is logic or safety).
- A unit of implementation is correct for X, if it satisfies its side of the X contract of all the public APIs it is involved in, assuming the other sides also satisfy their side of the contract. That unit of implementation is said to trust the other sides to correctly implement the contract, when their implementation is unknown. In contrast, a unit of implementation does not need to trust another unit of implementation if they are both part of the same unit of implementation, they can simply use the other unit of implementation ignoring any possible internal contract between them.
When designing a public API, the logic contract can be freely chosen but the safety contract must follow these constraints:
- The safe part of public APIs must not have safety requirements. Public APIs should avoid having safety requirements. This constraint prevents the scope of unsafe to leak from a unit of implementation to its clients through the safe part of its public API. This constraint however allows the scope of unsafe to leak through the unsafe part of that public API, which the client may choose to use or not. In particular, the client is in control of whether (and how) it ends up in the scope of unsafe of its dependencies.
- The safety guarantees of public APIs must be their logic contract. Unsafe code should avoid relying on safety guarantees. This constraint allows the scope of unsafe to leak from a unit of implementation to its dependencies. There is no opt-out. In particular, a dependency is not in control of whether it ends up in the scope of unsafe of its clients.
Illustration for functions
Public APIs for functions must have the following shape:
#[logic_requires(log_req)]
#[logic_ensures(|res| log_ens)]
#[safety_requires(safe_req)] // trivial (or absent) for safe functions
#[safety_ensures(|res| old(log_req) => log_ens)]
// `X => Y` means `Y || !X` and `old(X)` means "X held before the call"
The first constraint says that safe functions cannot choose safe_req (it must be true). The second constraint says that safe_ens cannot be chosen and must be old(log_req) => log_ens, including for safe functions.
Let's consider the following alternative to the second constraint:
- Public APIs must explicitly document their safety guarantees (they have none by default). Public APIs should avoid having safety guarantees. This constraint prevents the scope of unsafe to leak from a unit of implementation to its dependencies unless explicitly authorized, which the dependency may choose to which extent. In particular, a dependency is in control of whether (and how) it ends up in the scope of unsafe of its clients.
All crates are virtually in the scope of unsafe
All crates can cause undefined behavior, even if they use #![forbid(unsafe_code)].
Let's consider a crate author that doesn't want to deal with undefined behavior. They write a crate foo which uses only safe code and forbids unsafe code. Another crate author writes a crate bar which depends on foo. They write unsafe code which relies on the safety guarantees of foo (which is the logic contract of foo). It's all working.
The initial author refactors their crate foo and publish a patch or minor version. Sadly this new version is not correct for logic anymore (or maybe the one before wasn't either, but it didn't impact bar). A third user who wrote a program using bar now runs cargo update, then builds and deploys the binary. A few days later their bitcoins are stolen due to undefined behavior in their program. Who is at fault? Which crate is unsound (in the sense of incorrect for safety)?
From a theoretical point of view, the bug is in foo. But it's a logic bug, and absolutely no crate in the program handles bitcoins. A logic bug in the safe crate foo caused undefined behavior because it was in the scope of unsafe of the sound (in the sense of correct for safety) crate bar.
From a vulnerability point of view, one could say that bar is at fault, because it's the code that triggered undefined behavior. While it is permitted for bar to rely on foo being correct for logic (through its safety guarantees) for the purpose of safety, it is also not recommended to do so.
My opinion is that the language is at fault, because there is no way for foo to opt-out of the scope of unsafe of its clients, and bar thus didn't know foo's intentions and couldn't respect them.
For more context, the following claim suggests that foo is not at fault (regardless of whether it would have been fine providing safety guarantees):
If you do not use
unsafein your code, you know that any potential causes of memory unsafety are not your fault.
The following hypothesis suggests that bar is at fault by not being self-contained:
Unsafe code blocks should be straightforward and self-contained to minimise the amount of code that developers have to vouch for, e.g. through manual reviews.
Self-containedness: Is the behaviour of unsafe code dependent only on code in its own crate?
But the current convention says that bar is not at fault because unsafe code can rely on correctness of its dependencies, which on the one side rejects the preceding hypothesis and on the other fails to attribute the fault on any of the involved crates when taken together with the first claim.
With the alternative constraint, foo would have no safety guarantees by default, thus making sure it only contains actually-safe code, and preventing bar to rely on its logic contract. If it wanted to, it could document exactly which safety guarantees it is ready to support. Then bar would either use them if they would be sufficient, or need to find an alternative crate or implement the logic on their own.
Reviewing a crate for safety is unreasonably costly
Why review a crate for safety?
In the context of security, a crate:
- Can be thoroughly audited for safety (since undefined behavior may provide arbitrary code execution, regardless of what the actual code is supposed to do).
- Only skimmed for specific logic errors (insufficient user input validation, incompatible time-of-check and time-of-use, etc).
This strategy is a nice spot in the "security guarantees" versus "audit effort" space. Having to thoroughly review for logic would be a much worse trade-off.
Because the logic contract is embedded in the safety contract, proving that a unit of implementation is correct for safety essentially amounts to proving that it is correct for logic. Because logic contracts are usually orders of magnitude more complex than safety contracts, reviewing a crate for safety is much more costly than it would have been with the alternative constraint. From the example of the previous section, this means that it is not possible to review bar for safety without reviewing foo for logic (past, current, and future versions).
Note that this alternative already exists in some form today. Some crates document as "Safety-usable invariant" the safety guarantees that are used by their clients (these documentation are added while reviewing those clients, thus minimizing safety guarantees to what is necessary). But this is more of a mix (or trade-off) between the current and alternative constraints. Some crates are assumed to be correct for logic while others need safety guarantees be documented.
A logic bug is a safety bug
Because the logic contract is embedded in the safety contract, a logic bug is a safety bug. This blurs the distinction between safety and logic, which is one of the primary ways Rust is meant to differ from C. In particular, the type system is supposed to provide the safety contract for actually-safe functions.
In other words, in Rust (like in C) there's essentially only one contract, because the logic contract and the safety contract are equivalent. This equivalence follows from the fact that the logic requirements must imply the safety requirements and the fact that the safety guarantees are the logic contract.
With the alternative constraint the concepts of logic and safety are disjoint. You cannot have a safety bug because of a logic bug.
Asymmetry with safety requirements
There are 4 constraints regarding safety contracts depending on whether we look at safety requirements or guarantees and whether we look at whether the scope of unsafe should be encapsulated or leak:
| Safety requirements | Safety guarantees | |
|---|---|---|
| Encapsulated scope of unsafe | (Current) Safety requirements must be documented and should be avoided | (Proposed) Safety guarantees must be documented and should be avoided |
| Leaky scope of unsafe | (Discounted) Safety requirements must be the logic requirements and unsafe code should avoid relying on them | (Current) Safety guarantees must be the logic contract and unsafe code should avoid relying on them |
Today we choose differently between requirements and guarantees. The alternative constraint mimics the choice for requirements, making sure the scope of unsafe can be encapsulated (from both clients and dependencies, instead of just from dependencies).
The other option would be to change the constraint for safety requirements to mimic the one of guarantees. In that case we get exactly the situation in C, where there's only one contract (the logic contract) and code should avoid relying on those logic properties to not trigger undefined behavior (since those properties can easily break, due to their complexity).
The logic contract is not necessarily what unsafe code needs
The logic requirements of a dependency might be too strong compared to what a unit of implementation may be able to ensure based on the safety requirements it has with its clients. Similarly, the logic guarantees might be too strong compared to what it actually needs. For example if unsafe code needs a sort function to return a permutation if its comparison function is deterministic, it wouldn't be able to use the logic contract that says it returns the input sorted if the comparison function implements a total order.
With the alternative constraint, the sort function could document this safety contract (and others, including the logic contract).
The standard library is correct for logic
It seems that the alternative constraint would be very verbose for crates that actually want to guarantee their logic contract for safety, like the standard library. This could be overcome by stating the safety guarantees at crate-level rather than at each item: "This crate guarantees for safety that it is correct for logic (unless explicitly stated)."
The current constraint is not an issue in practice
As long as unsafe code only relies on the correctness of its dependencies in extremely rare and well-chosen cases (like the standard library or very well-written crates), there is no issue. This can be explained by those well-chosen crates upholding their logic contract to the highest level, in particular high enough for safety purposes.
Note that this hypothesis used to hold in 2020, and probably still holds in the vast majority of cases. Maybe a few security vulnerability from time to time is not worth the explicit documentation of safety guarantees (and letting crates opt out of the scope of unsafe).
Question: is the current constraint a deliberate choice?
When Rust team members state that "safe code cannot cause UB", I wonder how much nuance is implied and what the long-term language vision actually is. Assuming the current constraint, I can see two non-exclusive interpretations:
- It's just a shorthand and the actual statement is more subtle. In particular, a safe binary crate cannot cause UB, but safe libraries can implicitly cause UB because they cannot control how unsafe clients rely on their logic.
- It assumes that unsafe code will only rely on the correctness of dependencies that have a high probability of being correct (such as the standard library). A similar assumption applied to clients would be untenable since they are written by arbitrary authors (which can't all be held accountable to writing correct code).
But this could also hide a long-term vision based on the alternative constraint, allowing crates to write actually-safe code independently of their clients and dependencies. So I'm curious to know the official reasoning behind such statements.
which is not an intrinsic ↩︎