Whenever “undefined behaviour” gets mentioned I feel the need to reference this paper:
What every compiler writer should know about programmers
(subtitle: “Optimization” based on undefined behaviour hurts performance)
Whenever “undefined behaviour” gets mentioned I feel the need to reference this paper:
What every compiler writer should know about programmers
(subtitle: “Optimization” based on undefined behaviour hurts performance)
I agree with the sentiment, however...
optimizations that assume that undefined behaviour does not exist are a bad i dea not just for security, but also for the performance of non-be nchmark programs.
That is literally the only reason undefined behavior even exists. (It could be made implementation-defined otherwise).
The fix for the problem stated in the paper is to define more behavior. Not to change what the term "undefined behavior" means.
I mean from UB checker perspective, or more specifically that the language definition itself breaks UB checking. That’s because of pigeonhole principle making impossible to tell apart memory representation of pointer_a vs pointer_b + offset, since offset has the same number of bits (usize) as a pointer.
Buut maybe the language definition can be stretched a bit. If you make usize 128 bit (I think Rust still keeps that possibility open by forbidding u64::from(usize)) but keep offsets/indexes de-facto limited to 64-bit (even though index is still usize, it has to be <= u64::max_value()). This way casts to and from 128-bit integers will be lossless.
Making usize 128-bit for purpose of UB checking will also catch tons of programs that do ptr as u64 as *const (which is another kind of error, right?
)
It dosn't break UB checking. It just requires extra state for UB checking. That is pretty common.
Rust will need to support usize == u128 anyway, since one of the defined base instruction sets for RISC-V, RV128I, provides 128-bit addressing. LLVM does not yet support that option, which presumably is why Rust's recent 128-bit support in 1.26.0 did not include usize == u128. Here (Edit 2) is a list of files where u128 support in 1.26.0 was inadequate to cover usize == u128.
Yeah, but, then you'll need usize == u256 to suppport a u128 address of the allocation and a u128 offset within the allocation.
If you only want to track pointer usage in normal, well-behaved programs then it might be OK, but I think it’s not possible to use metadata to track all valid Rust programs.
First, Rust programs are allowed to perform I/O in ways that doesn’t allow for any extra metadata. I can take a pointer, print it out on a piece of paper, then type it back on my keyboard, read that and cast it back to a pointer, and — if I don’t make a typo — it won’t be UB.
Even if you limit analysis to programs that don’t do I/O, the processing of integers can be arbitrarily complex.
For example, I could split my integer-from-pointer into to halves, and then search through halves of many other integers-from-pointers until I find two that match each half, and reconstruct my original pointer from halves of two technically unrelated pointers. Even if you tracked where every pointer-flavored bit originated from, in the end you’d see a pointer created from a mishmash of “wrong” bits. Of course nobody would really write such program, but if you’re designing a theoretical model that describes how UB works, then a metadata-based model doesn’t correctly describe it in such case.
No. u256 is not required. What you have described is a two-component structure where the address is a u128 and the offset is either a u64, as specified in the post to which I was replying, or perhaps a second u128. In either case, it is not a one-component unsigned value but rather a two-component structure, similar to the "fat pointers" that abound in Rust for slices, vtables, etc.
Saying that a u128 address requires u256 is akin to saying that a u64 address requires u128. Rust has long had support for u64 addresses but only recently added support for u128 integers.
I'm pretty sure that's not allowed, as LLVM will optimize pointers assuming that you're not doing that: LLVM Language Reference Manual — LLVM 18.0.0git documentation
It's generally speaking unclear, but if this was done through integer casts, then my interpretation of LLVM's documentation and behavior is that this is allowed. The compiler anyway cannot track where that integer is traveling. Maybe it gets signed, sent over the network, sent back and used after verifying the signature? I could imagine that actually being useful.
Doesn’t dbus rely on something like that?
Note that the C standard’s committee is currently concerned with the same question, see http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2263.htm
And related to that http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2221.htm and http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2222.htm
Although I have not followed it actively, I believe they are currently considering something very much like @RalfJung’s simple memory model (aka the CompCert memory model).
A Simple Pointer Model
struct Pointer {
alloc_id: usize,
offset: isize,
}
This is more complex than just a value. Even with just value an extra table could then be used to lookup alloc_id; by a less than comparison. As you mention it (alloc_id) must avoid one past causing an overlap. All the validity work is then functional rather than typed. (In the end probably does not really matter.)
Is this or should this be UB? First is aliasing, which Cell can fix. Second is how strict are lexical lifetimes, should rust be disallowing any potential optimisation. ((re)Ordering also comes into play but probably covered by aliasing.)
My thoughts are along the lines of;
Stack being reused such as tail recursion optimisation.
a getting mutably written as final act then read into another variable, but this could just occur in registers and not be written back to stack. (Thinking vaguely of inline functions still have never case used by pointers.)
I'm in agreement with @notriddle that pointers are for FFI and so need to meet expectations that brings. (Not so with references unless used in-conjunction with pointers.)
Actually, just looking at the DRAM or SRAM interface from a hardware point of view, there are at least three states for the bits in the RAM; 0, 1 and "somewhere in between". The memory controller on the CPU will decide what "somewhere in between" represents, and it does not have to be consistent about it (e.g. it can treat the same "value" from the RAM as 0 for some purposes, and 1 in other cases).
So, there's effectively an undefined state for RAM, where the hardware cannot tell what bit was stored, and resolves this difficulty by choosing a value at random based on what it finds simplest at that moment.
Note that the C standard’s committee is currently concerned with the same question, see http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2263.htm
Thanks for all these links! I wasn't aware of all of then when I wrote my post, but have been pointed at it multiple times now. This is great.
Now I just need to find some time to read all that stuff...
Thanks again @RalfJung! I'm really excited to see in this area. Your posts continue to give me a healthy mind stretching.
(working title
– in the next post.
I had trouble parsing this sentence. I think you closed a parenthesis with a smiley face ![]()
I think you closed a parenthesis with a smiley face
Yay indeed, I like doing that. ![]()
Another related blog post by John Regehr: https://blog.regehr.org/archives/1621
Congrats on the paper!
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.