When defining datastructures, I sometimes need an unsafe method of accessing the structure, which can break the invariants of the structure. I really like using unsafe to support this, but some other developers I work with think unsafe should only be used for code which can cause memory-corruption/crashes. Therefore I'd like to suggest the ability to introduce new "types" of unsafe in packages, which both lets devs be more specific about what is unsafe.
I'm happy to discuss real examples, but for now let's consider a simple fake example, a pair of integers which have to be different: struct DisjointPair { a: usize, b: usize }. Occasionally, for efficiency, I need raw mutable access to a , so I currently write:
I think a real example might be more useful than the fake one, since I’m here wondering
what exactly is the invariant of DisjointPair? You mean something like a “distinct pair”, i.e. the two values have to be different?
what can happen when the values aren’t different? Perhaps you are using the assumption that a and b are different in some places, for optimization purposes, in a way that can cause memory unsafety if the invariant is violated. Or perhaps you are planning on adding such an optimization in the future. Or you explicitly want users of your type to be able to rely on correct behavior w.r.t. the distinctness invariant, even in unsafe code in a way that can cause memory corruption if the invariant is violated. In all these cases using ordinary unsafe would be fine
in case that memory corruption or other kinds of UB really are impossible: what would compiler support give you that you can’t get from appropriately naming the accessor function, e.g. get_a_unckecked, and documenting the invariants that are supposed to be upheld
Yes, I mean the values have to be different. If the values are not different, then the algorithm using this structure can produce incorrect answers. There are no memory-unsafety issues, just answers which are incorrect.
Data structures I've implemented like this include lots of data structures used in Constraint solvers (a part of A.I., see https://www.github.com/minion/minion), or the "partition" structure used in tools like graph isomorphism tools like Nauty. In both cases there are fairly complicated data structures which most of the time check all their own invariants, but need an "escape hatch" to let people occasionally change their internals directly -- for example in the case of Nauty you sometimes take direct access of two internal lists, but must only reorder it's elements, not change them and call a "fixup" function after changing the list, to correct some internal counters.
I agree that I don't technically gain anything over "an appropriately named access function, and documentation", but on the other hand, what does unsafe itself provide over documentation and appropriately named accessor functions? My understanding (sorry if I am wrong) is that it doesn't actually change the behaviour of code, just causes code which calls unsafe functions/ performs unsafe operations to be rejected.
However, the advantage of unsafe, for me, is it makes it easy to find exactly the pieces of code one has to check for correctness. I like being able to extend that to invariants which do not cause UB, but could still cause incorrect behaviour in datastructures.
Yes, it doesn’t change behavior, indeed. It gives the ability to call unsafe functions and allows you to dereference raw pointers. I guess the main appeal for unsafe being a special language construct with its own keyword is that we then get to claim that Rust is a memory-safe language, and we can do things like easily search for uses of the unsafe keyword or disallow unsafe code by means of #![forbid(unsafe_code)].
Memory unsafety is a bit special in that it can be hard to catch even with tests. A data structure that allows unguarded access to its internal structure could even try to introduce appropriate sanity checks in debug build mode and then you’d be fairly certain that you never get a corrupted data structure after you tested your code properly.
Lots of things in std lead to bad misbehavior, too, e.g. if you do things like implement Hash with side-effects for HashMap keys, or modify RefCells inside a BTreeSet.
Also, often it is possible to just check or (soft-)enforce invariants. You could hand out a guard object with an API that only allows immutable read access to list elements plus re-ordering, e.g. by providing some kind of swap method that takes two indices and swaps them. The guard object would also call fixup on drop. The fixup call would not be strictly enforced this way, since someone could mem::forget the guard, but as long as missing a drop cannot cause memory safety this should be enough to prevent accidental bugs.
It seems that unsafe has a somewhat unclear focus. In most Rust documentation, it is used directly in relation to memory unsafety. However, in actual use it often marks any function that makes certain unenforced assumptions. The conflation of these might be worth clarification.
For example, str::from_utf8_unchecked is an unsafe function not for memory safety purposes, but to maintain its type invariant. (This invariant, in turn, is used to guarantee memory safety elsewhere.) The ability to demarcate regions that may break a specific invariant seems to me quite valuable. (Moreover, unsafe seems altogether too strong for what could be the more restrictive unsafe(utf8).)
(Some of) the value of unsafe comes from its performance implications – many languages may provide memory safety, but at the cost of frequent run-time checks. Rust provides a method to offload these checks to the user, enabling more flexible and efficient guarantees. This seems to be what the OP is looking for, rather than a fixed pattern.
The point is that memory unsafety isn't the only class of bugs worth preventing.
Given that standard unsafe would provide enough power to perform get_a manually (and so break any invariants), I'm not convinced that it shouldn't imply unsafe(disjoint).
That transitive relationship means that the type invariant is a memory safety issue. The fact that it won't immediately break is irrelevant. You could say the same for Vec::set_len -- it's harmless in itself, but carries a large burden in the future.
I'm still confused as to what this unsafe tagging is supposed to do. Is there anything other than moving the safety documentation from the doc attribute to the code, that's supposed to be happening?
The example of RefCells in a BTreeSet is a good example -- this is the kind of case I'm considering, and Rust doesn't consider that "unsafe". I am interested in the idea of being able to construct an explicit "let me be unsafe with the datastructure" object, which will still give me a way of clearly encapsulating and finding places where unsafe (to the datastructure) behaviour is occurring.
Does your version of unsafe code have the potential to lead to exploitable security vulnerabilities? If not, in my opinion you are talking about unsound code, not unsafe code.
Those of us with a background in cybersecurity use the unsafe keyword to focus our code reviews. Your effort to extend the meaning of that word does us a disservice and would make our job considerably more difficult, as we would have to discern that your code was only a soundness hazard, not a security hazard. Please choose a different word to delimit your areas of concern.
I’ve often seen the adjective “unsound” to specifically describe (non-unsafe fn) APIs that, when called from potentially malicious but unsafe-free Rust code (and without exploiting compiler bugs) can lead to memory-unsafety, i.e. UB.
You apparently use it differently here.
Just to clear up confusion for people who are used to this word being used like that in anunsafe-Rust context.
I'd be happy to use a different word (and edit it into my prior post), as long as that word is not unsafe. Many programs have logical invariants that need to be maintained for them to function correctly. In Rust code, violation of those invariants cannot lead to security vulnerabilities unless the code also uses unsafe.
In the Rust standard library, e.g. HashMap and BTreeMap declare certain ways to interact with their API a “logic error”. With that kind of phrasing, code that e.g. constructs a BTreeSet<RefCell<i32>>, and then starts modifying the values in the set in a way that changes their relative order before using the BTreeMap API again “has a logic error”, I guess?
I disagree, correctness bugs can be security vulnerabilities, they just aren't memory bugs. a good example is a ssh server where if the authentication function gives the wrong answer that is usually directly exploitable.
Seperately, that "has a logic error" in HashMap is far too vague. Can it panic? leak? It should at least make clear every future call on that object can return invalid answers.
Yes, I, too, would be interested in what that entails. I would intuitively assume that a type like HashMap, when it goes into an inconsistend state, can cause panics and arbitrary (safely obtainable) return values on any method call, but not necessarily leak memory (of course it would be allowed to; but the standard library seems to usually try to limit memory leaks to only come from other memory leaks as well as from explicit leaking API, e.g. mem::forget, Box::leak, and reference-counting cycles). Should probably be documented better though.
You seem to be interpreting unsafe as insecure, which I would consider to have distinct meanings. Limiting unsafe to points of insecurity also seems like a misnomer.
unsafe in Rust is a keyword that informs the rustc compiler that the programmer is assuming part of the proof task that Rust's SAT-solver usually performs on its own to verify memory and thread safety of the submitted program. In what way do these proposed "custom variants of unsafe" assume part of the SAT-solver's proof task.
I think transitivity oversimplified things. Following it to its logical conclusion, any broken invariant could be used to cause memory unsafety, at which point the term loses meaning.
I think the more important distinction would be what invariants are directly affected. from_utf8_unchecked only directly affects the UTF-8 invariant, while set_len directly affects the memory safety of the bounds invariant.
We don't have to follow it that far. If the invariant is assumed by any unsafe code implementing the type, that makes it a memory safety invariant. But an invariant like BigUint keeping its most-significant digit nonzero might only lead to logic errors when violated.
I can see that use of the term unsafe is upsetting to some people, and I understand that. I think unsound, might be a better choice.
I would think security developers might like a standard way for library authors to clearly mark those functions which can be used to cause unsound behaviour.