Private unsafe fields are a poorly motivated feature

They are consistent.

In your first example, it is fine for the field itself to be safe.
The struct does not contain unsafe code with safety requirements whose violation could lead to UB. Even if val were zero, this would only be a logical error, not undefined behavior.

If you decide that the struct does hold a type invariant (e.g., val != 0 must always hold), then that invariant should be stated explicitly, and the filed, as well as other struct APIs may break the invariant should be declared unsafe. In general, type invariants should be defined first, and unsafe should follow from them.

In your second example, crazy should be declared unsafe unless the struct declare the type invariant self.val != 0. It can only be safe with the type invariant.

Note that even with the type invariant, the type may also provide unchecked constructors; such constructors must be declared unsafe. In this way, any undefined behavior related to the internal unsafe callee can only be triggered through explicit unsafe constructor (or unsafe field related operations).

This is analogous to Vec, where the len field and the method set_len are unsafe because Vec has type invariants related to len. As long as these invariants hold, other memory access APIs of the struct can remain safe. Vec also provides unsafe constructors and methods which may break these invariants if misused.

In the third example, the name NonZero alone does not establish a type invariant. Type names are not type invariants. If such a struct is able to create a zero value via safe code, then the implementation is buggy. If the type claims a type invariant but also exposes ways to violate it (e.g., unchecked constructors), then those constructors must be declared unsafe, and callers must explicitly uphold the invariant.

2 Likes

Any other choice just results in "well I'm sure it's not actually UB so I'll just use unsafe" -- we already fight that for things like UTF-8 validation, so I really don't want to make it any worse. This is one of the reasons that MaybeUninit is so good, because it focuses the unsafe just to the one thing that actually needs it. Making uninitialized memory isn't unsafe. Writing to uninit memory isn't unsafe, nor is using that initialized memory in the same branch (assuming you used MaybeUninit::write or Box::write or similar). The unsafe is only for the spot that really can be UB, when you're asserting that it's initialized.

To quote myself,

Today's unsafe is distinguished because if violated you can't trust what you see -- telemetry might be wrong, checks that you see might not happen, etc. That's very different from correctness issues, where the code might be wrong but at least you can debug it, check logs, etc.

What's your proposed rule for what should be unsafe?

4 Likes

Yes, my core argument was that the essential part of the feature is making it impossible to bypass a wrapper (via mem::replace or whatever), which only needs blocking mut.

But it seems like most were swayed by the "but what about unsafe foo: Cell<i32>?" kinds of things, and would rather copy-paste // SAFETY: it's just a read it's fine everywhere.

They wanted people to make wrappers for the common case of "read is safe" instead of the uncommon case of "read you have to be careful too", which is frustrating to me because if you want "this is &str but not UTF-8" that should be a wrapper type because that way you can pass it to a different function while preserving that difference.

The only argument here that makes sense to me is that it's true that making such a wrapper around &str would want a "reads are unsafe too" marker. But that could just be the longer, less-nice version of the annotation instead.

1 Like

I think you're conflating "UB" with "language UB". But @HjVT specifically said "AM-level UB" as opposed to just "UB".

Do you not accept "library UB" as a valid concept?

The reference seems to allow library-defined-UB:

Behavior considered undefined

[...]

If a type has a custom range of a valid values, then a valid value must be in that range. In the standard library, this affects NonNull<T> and NonZero<T>.

In suggests types in crates other than std may also define custom ranges of valid values. And so, my Prime type defines a custom range of valid values: the prime numbers.

The rule I use is this: if the module author says something is undefined behavior, then it is undefined behavior (library UB). If the behavior is undefined for some inputs, then the function should be unsafe.

This is useful so that the library doesn't have to overpromise things and define behaviors it doesn't want to define. It also allows the library to make strong guarantees it wouldn't otherwise be able to make (such as: Prime::get() will never ever return 91).

4 Likes

Library UB exists to provide proofs for Language UB later.

&str being UTF-8 is library UB so that when I'm processing it I can do things like "// SAFETY: this indexing is in-bounds because in valid UTF-8 a 0b1110xxxx byte is always followed by 2 more", or include things like match utf8_byte { 0xFF => unreachable_unchecked(), ... because 0xFF is impossible in valid UTF-8, so that unreachable_unchecked is justified.

There's no point of library UB for things that would still be sound if you just said "this isn't UB".

(And NonNull is language UB, not library UB. It's a validity invariant.)

1 Like

That's your preference. I see value for it beyond that (as described above). This is similar to how language UB exists for reasons beyond what hardware can't handle.

But regardless of why it exists, the point is it does exist -- library says something is undefined behavior for whatever reason, and so it is undefined behavior.

So you can't say Prime::new_unchecked(91) isn't actually UB. It is actually UB (library UB) because the function documentation says so.

The fact that a certain version of the library may do something predictable at runtime doesn't mean it's not actually UB, similarly to how u32::unchecked_add(u32::MAX, u32::MAX) may do something predictable at runtime for some specific version of std / compiler, but is still UB because the documentation says so.

3 Likes

There's actually a difference here. The reason language UB is special is that when debugging you usually read the source Rust code, not the target machine code (which you didn't write, so why read it?). This is what the quote by @scottmcm meant. You want explainability at source level. There is no such equivalent reasoning for library UB.

[Note: I believe the name "library UB" was and is a mistake. The only thing called UB should be language UB. That's how it is in other languages. This is a contention point.]

It's analogous to saying that you don't usually read library source code when debugging something that uses that library, which you didn't write. You want explainability at the API + documentation level without looking at the library code.

This is totally false about C++. Library UB not only exists, it's very common. For instance: calling std::sort with a non-transitive comparison operator is undefined behavior in C++.

3 Likes

So if the library deletes your hard-drive even though its API never mentioned that, you would consider the library to be unsound? That makes sense, but I believe it's more realistic to read library code than read compiler code when debugging.

Also Rust type-system doesn't track I/O and other correctness properties, while it does for UB. So you wouldn't get anything out from pure safe code.

In C++, the standard library is part of the language. It's defined by the standard. Compilers can use the specification of those symbols to optimize the code calling them (and reciprocally, when they see code that implements such specification, they can replace it with a call to the standard symbol). That's not how it is in Rust, where the standard library is just a special library (it defines a small amount of language types, true, and is pinned to the compiler, but otherwise it's code is not special and the compiler won't recognize it in any way). That's the same for OCaml, Haskell, and languages that are not C/C++.

1 Like

I disagree with this – the Rust standard library is able to do a number of things that user code can't do.

Some of it is defining lang items, as you mentioned, but there are lots of special cases for things that aren't closely linked to the language. (For example, Arc is a type that you wouldn't expect to have special cases, but it does – you can declare a method receiver as self: Arc<Self> in stable Rust, but you wouldn't be able to do that with your own version of Arc.)

Some of it is custom optimisation rules: NonZeroUsize is assumed to never be 0 by the optimiser even if you don't do anything with it (again, this is something that it's not possible to implement outside the standard library), and a zero NonZeroUsize could thus easily be used to create undefined behaviour (e.g. storing it in a Result<NonZeroUsize, Zst> could be used to materialize an arbitrary ZST).

Some of it is minor compatibility special cases like deref coercion on [].into_iter() working differently based on the edition. (OK, both arrays and IntoIterator already need special cases anyway, with arrays being a primitive and IntoIterator being implicitly usable by for loops. But the special case for combining them is unrelated to both of those.)

Rust does that too, for portions of the standard library – basically everything in std::intrinsics works like that (and a large proportion of it is directly linked up to special cases in LLVM). The entire module is unstable (and likely intentionally perma-unstable), but it's part of the standard library (and many of the intrinsics have stable wrappers in other modules).

IIRC even the "the compiler sees code that does X, so it replaces it with a call to the standard library function X" happens in Rust to some extent (and thus declaring your own function with the same name but different semantics will cause a miscompile even if it's never explicitly called). Rust gets away with this soundly by giving the functions in question unmangled names (so you have to use a #[unsafe(no_mangle)] to cause a collision in the first place, which acts as a method of forcing unsound code to write unsafe) and by putting them into the compiler_builtins library rather than core/std/alloc, so arguably you can consider them to not be part of the standard library (but they're still linked against all Rust code).

2 Likes

This is irrelevant. I could write my own C++ sorting library and state the same: "parameters must satisfy [...] otherwise the behavior is undefined". Plenty of third-party C++ libraries do that.

3 Likes

I don't think there is a way to use a wrapper (type) around the field type to make the read safe though o.O

You'll need to write a wrapper method, and you'll need to use unsafe to define this method, which doesn't mean you have eliminated the need to use unsafe for doing the read, it just means you have centralized it.

3 Likes

True, my list was non-exhaustive for 2 reasons:

  • I only tried to address UB-related stuff, so I explicitly ignored things like Arc as receiver.
  • I forgot that it has access to other stuff than lang items, like to define NonZeroUsize. (Those could eventually be nightly features, compared to the lang items which are not.)

Those are by definition part of the language. They're not even implemented by the standard library. They're just exposed because users don't have access to the whole language (yet). This layer of indirection gives some freedom while the language (by that I mean the calculus, think MiniRust) gets stable.

That would be interesting to see indeed. That function should not have any lang or rustc item, otherwise it would be part of the language.

You can write that, but it won't be (language) undefined behavior, just library undefined behavior. The difference between "calling a language function implemented by the standard library and violating its safety condition" and "calling a library function in a similar way and with the same implementation", is that the call itself is (language) UB in the case of the language function, so the compiler can optimize based on the call itself, while for the library function, the compiler needs to look at the implementation to possibly optimize something.

For example, if you call memset(dst, src, cnt) in C, the compiler can assume that dst and src are non-null without looking at memset implementation. The same happens in Rust, if you call intrinsics::copy_nonoverlapping(src, dst, cnt), the compiler can assume that src and dst are aligned. But if you call ptr::copy_nonoverlapping(src, dst, cnt) instead, then it has to look at the implementation to see it's a call to the intrinsics, and then optimize. Otherwise the whole ub_checks::assert_unsafe_precondition!() macro would be dead-code if library UB had the same meaning as language UB.

1 Like

That was my point. It was a counter-example to your claim that other languages don't have library UB. They do.

I guess you might be suggesting that in C++ library UB is always specifically called "library UB" and never just "UB"? That's also false. People normally just say "doing such and such is undefined behavior" or "behavior is undefined if you do this" for library UB.

3 Likes

In your own crate, you can use unsafe however you want! There is no Rust Police that will stop you. But β€œunsafe is only for UB/soundness invariants” is the long-established convention of the standard library and ecosystem, it’s the basis of various soundness proofs, and it’s what users expect.

6 Likes

This has been proposed a number of times. I still think it's a good idea, specially because Rust already has another kind of safety, which it calls unwind safety (note that today, we use AssertUnwindSafe(..) rather than unsafe(unwind) { .. })

Also, it's unfortunate that in Rust, I/O safety was collapsed into memory safety, even though there are safe APIs out there that receive RawFd.

The only link between I/O safety and memory safety is the suggestion that we could maybe have a safe interface for memory mapped files, and maybe even access them as &[u8] or something, but that doesn't really work if files can be changed by other programs.

The other reasoning:

I/O safety errors can also mean that a piece of code can read, write, or delete data used by other parts of the program, without naming them or being given a reference to them. It becomes impossible to bound the set of things a crate can do without knowing the implementation details of all other crates linked into the program.

is unfortunately too similar to plain logical correctness (it can also happen with APIs that internally use static variables for example, but don't expose those variables themselves), and isn't what memory safety is about

1 Like

This.

While reading the whole thread, I was thinking that it is unfortunate that we don't have user-level unsafe, and if we use unsafe for this, it weakens the whole concept about unsafe. But if we had it (or similar ongoing work on contracts), it would help library writers to provide and demonstrate more effectively their invariants. It would really raise the language's robustness as a whole.

3 Likes

I agree with (and like) the idea of having user-defined properties tracked in the syntax, but I don't think it should use the word unsafe. Because unsafe is for safety, and it's not because that's the only syntactic mechanism that exists for type system side-channel, that it should be used for other (similar) purposes.

User-defined properties (like correctness, panic-freedom, purity, etc) should use a different keyword, which could indeed by generic for uniformity: userprop(nopanic) or userprop<pure> (up to bikesheding obviously).

Again, this mechanism might come for free with contracts, once they get generalized for more than just safety. There was initial plans to have it for correctness too, but by making them generic over the program property of interest, we permit user-defined properties like panic-freedom and purity.

4 Likes

I find it hard to separate the idea of "memory-safety-level unsafe" from "user-defined/correctness unsafe" because I think there's too much overlap/interaction.

Suppose I write an alternative standard library that places restrictions on what files a program can access – accessing a file you don't have permission to is "user-level unsafe". Except that it isn't, it's actually memory-safety-level unsafe because changing a file can break memory safety (most infamously /proc/self/mem, but you can also do things like overwriting an executable that's automatically run by the system to put unsound code into it).

Suppose I write a library that authenticates users. It's written entirely in safe Rust, so the compiler is supposed to ensure memory safety. But if I left a bug in it that could allow a user to be authenticated when they shouldn't, that user might gain access to the system – and depending on the access level they gain, that could be much worse than a memory safety bug. (They might even write some code with an unfulfiled unsafe precondition and then run it, so this scenario can lead to memory safety problems in addition to worse things.) In this case, correctness is effectively a memory-safety requirement and violating it can cause memory soundness bugs, even though the code is written in safe Rust.

Suppose I write a library that connects to a database and does database queries. I would hope to use a technique that prevented user-provided strings being misinterpreted as SQL code, but if I didn't, the compiler wouldn't catch it – and that means that normal inputs would cause arbitrary code to run. Again, this isn't a memory-safety problem from the point of view of what Rust tries to catch – but it can lead to arbitrary code being run. I suspect that least some SQL dialects provide ways to run arbitrary executables, which could violate memory safety if the executables aren't memory-safe – but even if they don't, "an attacker can do arbitrary reads and writes on your database server" is on the same level as bad outcomes as "the code contains a memory-safety bug".

In general, Rust doesn't consider misusable things like reading and writing files, and correctness issues in safety-critical algorithms, to be unsafe because it would add too much overhead to code and cause too many false alarms – at some point, you have to draw a line in order to keep the unsafety situation usable. But that doesn't change the fact that doing so is actually unsafe, and that if code can provide an API that prevents this sort of uncheckable unsound operation occurring, it should.

For example, I consider it unsound to run SQL code that could be attacker-controlled, and it's typically also very easy to avoid (by using a hardcoded SQL string with placeholders and getting the database to substitute the placeholders for you). So crates for doing SQL should probably be enforcing that the strings are not attacker-controlled, perhaps using an SqlCommand type that has "is not attacker-controlled" as a safety requirement. I think it would be reasonable for such a type to have a new that took a compile-time constant as its argument and an unsafe new_unchecked that took an arbitrary string (with a safety requirement to ensure that it doesn't include potentially attacker-controlled information). I don't think it would be reasonable for the new_unchecked to not be marked unsafe on the basis that it doesn't directly cause memory safety issues – there's still an "arbitrary bad behaviour" possibility here, effectively a UB one, if the method is misused.

3 Likes

The idea sounds good in principle, but I don't see how this could work. Maybe it can, I just don't see how.

For instance, what kind of unsafe should transmuting u8 to bool be? Just converting doesn't yet cause memory problems. So perhaps it could be considered some kind of special unsafe(bool_range). But pretty soon any use of that out-of-range bool can cause other types of problems, including memory problems. How does one convert an earlier unsafe(bool_range) operation to an unsafe(memory) consequence later? I don't see how that could possibly work.

So I don't really think it's all that bad to simply use unsafe.