Pointers Are Complicated, or: What's in a Byte?

mvduin · July 27, 2018, 6:31am

Whenever “undefined behaviour” gets mentioned I feel the need to reference this paper:

What every compiler writer should know about programmers
(subtitle: “Optimization” based on undefined behaviour hurts performance)

RalfJung · July 27, 2018, 10:44am

I agree with the sentiment, however...

optimizations that assume that undefined behaviour does not exist are a bad i dea not just for security, but also for the performance of non-be nchmark programs.

That is literally the only reason undefined behavior even exists. (It could be made implementation-defined otherwise).

The fix for the problem stated in the paper is to define more behavior. Not to change what the term "undefined behavior" means.

kornel · July 27, 2018, 4:42pm

I mean from UB checker perspective, or more specifically that the language definition itself breaks UB checking. That’s because of pigeonhole principle making impossible to tell apart memory representation of pointer_a vs pointer_b + offset, since offset has the same number of bits (usize) as a pointer.

Buut maybe the language definition can be stretched a bit. If you make usize 128 bit (I think Rust still keeps that possibility open by forbidding u64::from(usize)) but keep offsets/indexes de-facto limited to 64-bit (even though index is still usize, it has to be <= u64::max_value()). This way casts to and from 128-bit integers will be lossless.

Making usize 128-bit for purpose of UB checking will also catch tons of programs that do ptr as u64 as *const (which is another kind of error, right? )

RalfJung · July 27, 2018, 6:00pm

It dosn't break UB checking. It just requires extra state for UB checking. That is pretty common.

Tom-Phinney · July 27, 2018, 6:09pm

Rust will need to support usize == u128 anyway, since one of the defined base instruction sets for RISC-V, RV128I, provides 128-bit addressing. LLVM does not yet support that option, which presumably is why Rust's recent 128-bit support in 1.26.0 did not include usize == u128. Here (Edit 2) is a list of files where u128 support in 1.26.0 was inadequate to cover usize == u128.

gbutler · July 27, 2018, 7:53pm

Yeah, but, then you'll need usize == u256 to suppport a u128 address of the allocation and a u128 offset within the allocation.

kornel · July 27, 2018, 10:03pm

If you only want to track pointer usage in normal, well-behaved programs then it might be OK, but I think it’s not possible to use metadata to track all valid Rust programs.

First, Rust programs are allowed to perform I/O in ways that doesn’t allow for any extra metadata. I can take a pointer, print it out on a piece of paper, then type it back on my keyboard, read that and cast it back to a pointer, and — if I don’t make a typo — it won’t be UB.

Even if you limit analysis to programs that don’t do I/O, the processing of integers can be arbitrarily complex.

For example, I could split my integer-from-pointer into to halves, and then search through halves of many other integers-from-pointers until I find two that match each half, and reconstruct my original pointer from halves of two technically unrelated pointers. Even if you tracked where every pointer-flavored bit originated from, in the end you’d see a pointer created from a mishmash of “wrong” bits. Of course nobody would really write such program, but if you’re designing a theoretical model that describes how UB works, then a metadata-based model doesn’t correctly describe it in such case.

Tom-Phinney · July 27, 2018, 10:18pm

No. u256 is not required. What you have described is a two-component structure where the address is a u128 and the offset is either a u64, as specified in the post to which I was replying, or perhaps a second u128. In either case, it is not a one-component unsigned value but rather a two-component structure, similar to the "fat pointers" that abound in Rust for slices, vtables, etc.

Saying that a u128 address requires u256 is akin to saying that a u64 address requires u128. Rust has long had support for u64 addresses but only recently added support for u128 integers.

scottmcm · July 27, 2018, 10:25pm

I'm pretty sure that's not allowed, as LLVM will optimize pointers assuming that you're not doing that: LLVM Language Reference Manual — LLVM 18.0.0git documentation

RalfJung · July 27, 2018, 11:00pm

It's generally speaking unclear, but if this was done through integer casts, then my interpretation of LLVM's documentation and behavior is that this is allowed. The compiler anyway cannot track where that integer is traveling. Maybe it gets signed, sent over the network, sent back and used after verifying the signature? I could imagine that actually being useful.

Soni · July 27, 2018, 11:46pm

Doesn’t dbus rely on something like that?

robbert · July 31, 2018, 9:05am

Note that the C standard’s committee is currently concerned with the same question, see http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2263.htm

And related to that http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2221.htm and http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2222.htm

Although I have not followed it actively, I believe they are currently considering something very much like @RalfJung’s simple memory model (aka the CompCert memory model).

jonh · July 31, 2018, 1:16pm

A Simple Pointer Model

struct Pointer {
    alloc_id: usize,
    offset: isize,
}

This is more complex than just a value. Even with just value an extra table could then be used to lookup alloc_id; by a less than comparison. As you mention it (alloc_id) must avoid one past causing an overlap. All the validity work is then functional rather than typed. (In the end probably does not really matter.)

notriddle:

let a = 1;
let b = &a as *const i32;
let c = b as usize;
let d = c as *const i32;
assert_eq!(unsafe { *d }, 1);
The memory model must allow you to round-trip from pointer to integer and back again. Otherwise, you wouldn’t be able to use byte operations like libc::memcpy to move pointers around.

Is this or should this be UB? First is aliasing, which Cell can fix. Second is how strict are lexical lifetimes, should rust be disallowing any potential optimisation. ((re)Ordering also comes into play but probably covered by aliasing.)

My thoughts are along the lines of; Stack being reused such as tail recursion optimisation. a getting mutably written as final act then read into another variable, but this could just occur in registers and not be written back to stack. (Thinking vaguely of inline functions still have never case used by pointers.)

I'm in agreement with @notriddle that pointers are for FFI and so need to meet expectations that brings. (Not so with references unless used in-conjunction with pointers.)

farnz · August 1, 2018, 10:28am

Actually, just looking at the DRAM or SRAM interface from a hardware point of view, there are at least three states for the bits in the RAM; 0, 1 and "somewhere in between". The memory controller on the CPU will decide what "somewhere in between" represents, and it does not have to be consistent about it (e.g. it can treat the same "value" from the RAM as 0 for some purposes, and 1 in other cases).

So, there's effectively an undefined state for RAM, where the hardware cannot tell what bit was stored, and resolves this difficulty by choosing a value at random based on what it finds simplest at that moment.

RalfJung · August 3, 2018, 10:23am

Thanks for all these links! I wasn't aware of all of then when I wrote my post, but have been pointed at it multiple times now. This is great. Now I just need to find some time to read all that stuff...

mark-i-m · August 4, 2018, 1:30am

Thanks again @RalfJung! I'm really excited to see in this area. Your posts continue to give me a healthy mind stretching.

(working title – in the next post.

I had trouble parsing this sentence. I think you closed a parenthesis with a smiley face

RalfJung · August 4, 2018, 7:27am

Yay indeed, I like doing that.

RalfJung · September 19, 2018, 12:41pm

Another related blog post by John Regehr: https://blog.regehr.org/archives/1621

bgeron · September 19, 2018, 2:48pm

Congrats on the paper!

system · March 25, 2019, 8:30am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pointers Are Complicated II, or: We need better language specs	146	11197	January 17, 2021
Int2ptr and runtime provenance models Unsafe Code Guidelines	23	3119	November 14, 2021
Comparing dangling pointers language design	42	11439	February 3, 2016
Types as Contracts Unsafe Code Guidelines	117	13004	August 8, 2018
Stacked Borrows: An Aliasing Model For Rust Unsafe Code Guidelines	48	8399	October 18, 2018

Pointers Are Complicated, or: What's in a Byte?

Related topics