[Pre-RFC] usize is not size_t

InfernoDeity · October 6, 2021, 11:14pm

I still think there's a problem of confusion when there are more than one kind of address.

jrtc27 · October 6, 2021, 11:25pm

Everyone should just forget SCTAG exists; as a purist, it's an awful idea, shouldn't exist and should never be used, and as I've said before, firmware can stop even the kernel from using it (as can a hypervisor). The reason it's even in the architecture also isn't for materialising arbitrary capabilities in random device drivers, it's for fast swapping back in of pages, and possibly fast derivation of capabilities at process startup (akin to what already needs to be done to relocate position-independent binaries), both of which have safe alternatives, but those alternatives may not be quite right in Morello as it's based on an older version of CHERI when it comes to those instructions.

The way it's actually meant to work is that you go request the capability from a higher authority; e.g. in an OS kernel you don't poke that address, you go ask the kernel memory allocator to allocate a spare page of virtual address space, which gives you the capability to it, and you then go ask whatever deals with the page tables to map that physical address there (which is the abstraction that already necessarily exists due to using virtual memory), or, in a bare-metal environment, you probably just have a lightweight trusted component that holds onto a capability covering the whole address space, and you can hand it a physical address you want to touch because you're the device driver, and it hands back a capability to it after checking that you should indeed be allowed to access that bit of memory.

jacobbramley · October 7, 2021, 9:08am

That, and it's not even available in Morello's EL0 (user space), so it's irrelevant to a large proportion of deployed Rust code.

RalfJung · October 27, 2021, 6:38pm

There could be a globally known static provenance for memory that is definitely disjoint from any address that would ever be provided by operations that are part of the language (so disjoint from stack, heap, globals, everything that might be mentioned in the spec).

Then you are free to cast statically assigned addresses to pointers using that provenance (implicitly making a promise that that pointer will never be used for any address that would be valid with any other provenance).

Such a NON_LANGUAGE_PROVENANCE or OUTSIDE_WORLD_PROVENANCE or whatever you want to call it would be rather easy to support; the mentioned restriction makes it much less painful than a general int-to-ptr. (If you also take into account Stacked Borrows, things become more interesting, but still very well-behaved; references to this non-language memory would still be subject to aliasing rules but all the tags of these references would "derive from" the NON_LANGUAGE_PROVENANCE root. The crucial bit is that the language never has to guess which provenance a pointer should have.)

InfernoDeity · October 27, 2021, 11:00pm

Given there hasn't been significant discussion on this for ~1 month, I'm going to proceed with a proposal. I assume this has to go through T-lang, so I'll file an MCP? Initiative Proposal is it now (is there even a difference)? for supporting size_t!=uintptr_t generally, and that should handle figuring out exactly how to proceed. LMK if you'd like to be initially involved with that.

H2CO3 · October 28, 2021, 10:58am

I agree in principle, but this would lead to a massive false positive rate on the overwhelming majority of targets where usize is uptr.

josh · October 28, 2021, 11:17am

It wouldn't be a false positive; that's the point. If it would be too disruptive to make the types different, then we shouldn't make the types different.

I don't want to make the types almost always the same, making it almost always work to use usize, and have it only fail on obscure platforms with absolutely no indication otherwise on any other platform.

H2CO3 · October 28, 2021, 12:16pm

This.

InfernoDeity · October 28, 2021, 1:52pm

No matter what we do, we are going to have to weigh the disruption against the benefits of separating them, and of allowing platforms that separate them.

josh · October 28, 2021, 4:45pm

I agree. But we shouldn't discount the disruption by making it invisible to other platforms.

tcsc · October 30, 2021, 8:10pm

I guess I never finished the comment about this before, and I don't see these points being made already, but the search function here sucks so who knows...

I don't think this would be easy to lint — complaining about ptr as usize is not enough.

I've written code many times which relies on size_of::<usize>() and size_of::<*const T>() being the same, but almost none of this code does pointer-to-integer casts (or vice versa) — typically, this takes the form of computing an offset using size_of::<usize>(), and passing that offset to ptr::{add, sub, offset} (or their wrapping variety).

That is, consider a case like the following (adapted from an unfinished project of mine):

#[repr(C)]
struct Header {
    next: Option<NonNull<Header>>,
    size: usize,
    data: [u8; 0],
}

// omitted, but present in the real code:
// const HEADER_NEXT_OFFSET: usize = size_of::<usize>() * 0;
// const HEADER_SIZE_OFFSET: usize = size_of::<usize>() * 1;
const HEADER_DATA_OFFSET: usize = size_of::<usize>() * 2;

unsafe fn header_to_data(head: *mut Header) -> *mut () {
    head.cast::<u8>()
        .add(HEADER_DATA_OFFSET)
        .cast::<()>()
}

unsafe fn data_to_header(data: *mut ()) -> *mut Header {
    data.cast::<u8>()
        .sub(HEADER_DATA_OFFSET)
        .cast::<Header>()
}

Anyway, I don't really know how you'd write a lint that detects this. I think it would be very hard, if it's even possible.

(You might argue that this code doesn't need to be written this way, and maybe[^1] that's true? But also code like this definitely exists — and not all of it will get maintenance updates, nor should it need to, given that we promised this was correct...)

I also suspect this kind of thing might be fairly common, maybe even as common as ptr as usize/a_usize as *const T... in my own code at least, the kind of offsetting which is succeptable to this issue is decidedly more common than casting between pointers and ints[^2]. As a result, breaking the RFC rfcs/0544-rename-int-uint.md at master · rust-lang/rfcs · GitHub promises worries me a lot.

Comparatively, I'm a lot less worried about breaking FFI code that assumes size_t and usize are the same, even though I done a fair bit[^3] of FFI code, almost all of which either uses --size_t-is-usize, or is a handwritten binding where I translated size_t to usize. I'm less worried about this because:

My experience is that people tend to be fairly cautious about assuming code is portable to new platforms, especially C/FFI code. Conversely, people tend not to be cautious about rustc updates (we make all those stability promises, after all).
Of the two examples most frequently cited (CHERI and W65), only one of them (CHERI) seems likely to be supported by typical C libraries, and that one is already quite weird.

Even ignoring this, I suspect a lot of C libraries won't work out of the box on these targets, either because they assume size_t/uintptr_t are the same, or just because it's a new target, and often this takes at least some degree of porting (due to all the implementation-defined behavior, especially for a "weirder" target like CHERI).

That said, I'm not unconcerned about the FFI code — in particular, system APIs are concerning (often people forgo a libc dep and just write a one-off extern "C" { ... }), and this code seems highly likely to break.

Not to mention, I suspect it will push more people to not use --size_t-is-usize, which is unfortunate because it leads to pregenerated bindings which assume size_t is unsigned long because it happens to be on unix (e.g. the reason why the sdl2 crate's pregenerated bindings are partially broken on Windows)... But I guess that's more of a standing bindgen flaw, rather than something this issue should be directly worried about.

Anyway, sorry for writing a novel about this, I have... A lot of thoughts.

(lets imagine this supports footnotes...)

[^1]: While header_to_data can probably use addr_of_mut! these days, data_to_header cannot. That said, this example can probably use size_of::<Header>() rather than HEADER_DATA_OFFSET, but many cannot. Also, people coming from C will likely use an offset rather than a size here out of habit — in C, when emulating flexible array members, you often use use 1 for the array length.

While this code isn't succeptable to the issue I describe, here's an example of code where I used the offset for address arithmetic, even though the size would have been correct: Storing an object as &Header, but reading the data past the end of the header · Issue #256 · rust-lang/unsafe-code-guidelines · GitHub

[^2]: Partially because provenance is spooky, but also it just doesn't come up unless I'm doing pointer tagging or whatever.

[^3]: Among others, libsqlite3-sys, imgui-sys, (in the past) much of the FFI code used on the mobile firefoxes...

jrtc27 · November 7, 2021, 6:03pm

tcsc:

I guess I never finished the comment about this before, and I don't see these points being made already, but the search function here sucks so who knows...

I don't think this would be easy to lint — complaining about ptr as usize is not enough.

I've written code many times which relies on size_of::<usize>() and size_of::<*const T>() being the same, but almost none of this code does pointer-to-integer casts (or vice versa) — typically, this takes the form of computing an offset using size_of::<usize>(), and passing that offset to ptr::{add, sub, offset} (or their wrapping variety).

That is, consider a case like the following (adapted from an unfinished project of mine):
#[repr(C)]
struct Header {
    next: Option<NonNull<Header>>,
    size: usize,
    data: [u8; 0],
}

// omitted, but present in the real code:
// const HEADER_NEXT_OFFSET: usize = size_of::<usize>() * 0;
// const HEADER_SIZE_OFFSET: usize = size_of::<usize>() * 1;
const HEADER_DATA_OFFSET: usize = size_of::<usize>() * 2;

unsafe fn header_to_data(head: *mut Header) -> *mut () {
    head.cast::<u8>()
        .add(HEADER_DATA_OFFSET)
        .cast::<()>()
}

unsafe fn data_to_header(data: *mut ()) -> *mut Header {
    data.cast::<u8>()
        .sub(HEADER_DATA_OFFSET)
        .cast::<Header>()
}
Anyway, I don't really know how you'd write a lint that detects this. I think it would be very hard, if it's even possible.

(You might argue that this code doesn't need to be written this way, and maybe[^1] that's true? But also code like this definitely exists — and not all of it will get maintenance updates, nor should it need to, given that we promised this was correct...)

Honestly, you really shouldn't write that code, because it embeds all manner of knowledge about the ABI that, whilst true of the specific targets you have checked it against, have absolutely no guarantees about being true on another targets in future, whether CHERI or not (e.g. I believe that even if next and size have the same representation there's nothing in the C standard saying that there can't be padding between them). This problem was solved decades ago in C by stddef.h's offsetof, so if Rust forces you to write assumption-laden code like yours then that's a sorry state of affairs. That's also an easy problem to solve though, just add an offsetof to Rust and move on.

It's not like we have a choice. The language designers decided to conflate two concepts that C had kept separate and now we have to rectify that short-sightedness somehow. (There's a lesson in here for any budding language designers...)

It's 2021, just use a flexible array, and if your compiler doesn't support C99 then get a better compiler.

jrtc27 · November 7, 2021, 6:06pm

tcsc:

Comparatively, I'm a lot less worried about breaking FFI code that assumes size_t and usize are the same, even though I done a fair bit[^3] of FFI code, almost all of which either uses --size_t-is-usize , or is a handwritten binding where I translated size_t to usize . I'm less worried about this because:

My experience is that people tend to be fairly cautious about assuming code is portable to new platforms, especially C/FFI code. Conversely, people tend not to be cautious about rustc updates (we make all those stability promises, after all).

Of the two examples most frequently cited (CHERI and W65), only one of them (CHERI) seems likely to be supported by typical C libraries, and that one is already quite weird.Even ignoring this, I suspect a lot of C libraries won't work out of the box on these targets, either because they assume size_t / uintptr_t are the same, or just because it's a new target, and often this takes at least some degree of porting (due to all the implementation-defined behavior, especially for a "weirder" target like CHERI).

That said, I'm not unconcerned about the FFI code — in particular, system APIs are concerning (often people forgo a libc dep and just write a one-off extern "C" { ... } ), and this code seems highly likely to break.

Not to mention, I suspect it will push more people to not use --size_t-is-usize , which is unfortunate because it leads to pregenerated bindings which assume size_t is unsigned long because it happens to be on unix (e.g. the reason why the sdl2 crate's pregenerated bindings are partially broken on Windows)... But I guess that's more of a standing bindgen flaw, rather than something this issue should be directly worried about.

Well, I personally think changing isize/usize to not be size_t would be a bad idea; most of the time people really do want a size_t-like thing when they use it, not a uintptr_t. So adding iptr/uptr as distinct from isize/usize would keep --size_t-is-usize working along with anything that assumes size_t is usize, only requiring the places that cast pointers to usize to change.

bjorn3 · November 7, 2021, 8:26pm

#[repr(C)] is already not exactly the same as what C does. For example for unions on MSVC(?) it gives a slightly different union size when one field is an SSE vector or something like that. I can't remember what it exactly was. #[repr(C)] in practise really is "keep field order and insert padding as necessary for correct alignment". Changing it to always match C will break a lot of code, including all POD crates and implementations as now suddenly the memory layout no longer matches what was expected. Adding padding also immediately makes them unsound as padding is undefined and now the POD crates would provide a "safe" way to access it by transmuting to raw bytes, thus causing UB.

That won't solve the problem for code that needs a specific memory layout for eg zero-copy deserialization like POD implementations and the rkyv crate.

InfernoDeity · November 7, 2021, 8:29pm

Meanwhile, I'm sure more than one person/crate relies on it being actually C layout... Wow, Rust has a great track record with correctly naming stuff used for FFI.

bjorn3 · November 7, 2021, 8:57pm

github.com/rust-lang/rust

repr(C) is unsound on MSVC targets

opened 04:06PM - 11 Feb 21 UTC

mahkoh

O-windows A-ffi P-medium T-lang O-windows-msvc I-unsound C-bug

Consider ```rust #![allow(dead_code)] use std::mem; #[no_mangle] pub …fn sizeof_empty_struct_1() -> usize { #[repr(C)] struct EmptyS1 { f: [i64; 0], } // Expected: 4 // Actual: 0 mem::size_of::<EmptyS1>() } #[no_mangle] pub fn sizeof_empty_struct_2() -> usize { #[repr(C, align(8))] struct X { i: i32, } #[repr(C)] struct EmptyS2 { x: [X; 0], } // Expected: 8 // Actual: 0 mem::size_of::<EmptyS2>() } #[no_mangle] pub fn sizeof_enum() -> usize { #[repr(C)] enum E { A = 1111111111111111111 } // Expected: 4 // Actual: 8 mem::size_of::<E>() } #[no_mangle] pub fn sizeof_empty_union_1() -> usize { #[repr(C)] union EmptyU1 { f: [i8; 0], } // Expected: 1 // Actual: 0 mem::size_of::<EmptyU1>() } #[no_mangle] pub fn sizeof_empty_union_2() -> usize { #[repr(C)] union EmptyU2 { f: [i64; 0], } // Expected: 8 // Actual: 0 mem::size_of::<EmptyU2>() } ``` and the corresponding MSVC output: https://godbolt.org/z/csv4qc The behavior of MSVC is described here as far as it is known to me: https://github.com/mahkoh/repr-c/blob/a04e931b67eed500aea672587492bd7335ea549d/repc/impl/src/builder/msvc.rs#L215-L236

InfernoDeity · November 7, 2021, 9:13pm

Yes I am aware of this. I'm just commenting on the fact that clearly rust has the best naming scheme for everything that crosses FFI boundaries.

CAD97 · November 7, 2021, 9:13pm

I believe what you're remembering is the following two issues:

github.com/rust-lang/rust

repr(C) is unsound on MSVC targets

opened 04:06PM - 11 Feb 21 UTC

mahkoh

O-windows A-ffi P-medium T-lang O-windows-msvc I-unsound C-bug

Consider ```rust #![allow(dead_code)] use std::mem; #[no_mangle] pub …fn sizeof_empty_struct_1() -> usize { #[repr(C)] struct EmptyS1 { f: [i64; 0], } // Expected: 4 // Actual: 0 mem::size_of::<EmptyS1>() } #[no_mangle] pub fn sizeof_empty_struct_2() -> usize { #[repr(C, align(8))] struct X { i: i32, } #[repr(C)] struct EmptyS2 { x: [X; 0], } // Expected: 8 // Actual: 0 mem::size_of::<EmptyS2>() } #[no_mangle] pub fn sizeof_enum() -> usize { #[repr(C)] enum E { A = 1111111111111111111 } // Expected: 4 // Actual: 8 mem::size_of::<E>() } #[no_mangle] pub fn sizeof_empty_union_1() -> usize { #[repr(C)] union EmptyU1 { f: [i8; 0], } // Expected: 1 // Actual: 0 mem::size_of::<EmptyU1>() } #[no_mangle] pub fn sizeof_empty_union_2() -> usize { #[repr(C)] union EmptyU2 { f: [i64; 0], } // Expected: 8 // Actual: 0 mem::size_of::<EmptyU2>() } ``` and the corresponding MSVC output: https://godbolt.org/z/csv4qc The behavior of MSVC is described here as far as it is known to me: https://github.com/mahkoh/repr-c/blob/a04e931b67eed500aea672587492bd7335ea549d/repc/impl/src/builder/msvc.rs#L215-L236

(repr(C) with a single field of zero-length array is allowed in MSVC with different behavior than that of rustc and repr(int) C-style enums with overflowing discriminants behave differently in MSVC than rustc)

github.com/rust-lang/rust

repr(C, align) is unsound on enums

opened 03:04PM - 14 Feb 21 UTC

mahkoh

C-bug needs-triage-legacy

repr(align) on enums is implemented as described in the reference: >The align… modifier can also be applied on an enum. When it is, the effect on the enum's alignment is the same as if the enum was wrapped in a newtype struct with the same align modifier. But this is not how aligned enums work in Clang, GCC, and MSVC - GCC ignores alignment requests on enums. - Clang sets the alignment to the requested alignment even if this decreases the alignment. The size is unaffected. - MSVC increases the alignment to the requested alignment. The size is unaffected.

(repr(C, align) (plus repr(packed)?) doesn't match #pragma align)

T-lang did say that the intent to #[repr(C)] is in fact to match "platform C" behavior as much as possible, and not to be #[repr(trivial)] (which it is often used as today). #81996 (comment) #81996 (comment) #81996 (comment) #81996 (comment)

Unfortunately, that thread got rather adversarial due to participants with different backgrounds failing to communicate and talking past each other (a big part of which was me ).

The last update from T-lang can be summarized as roughly

We would like to fix #[repr(C)] to always match the platform C ABI, given a C-compatible struct definition.
Or if not possible due to existing communicated/defacto guarantees, at least lint when a mismatch is possible on major platforms.
The addition of #[repr(inorder)] or #[repr(trivial)] or similar for the current #[repr(C)] is likely to get support if #[repr(C)] needs to change to match the platform ABI.

josh · November 8, 2021, 8:45am

To the extent it isn't, I think that's a bug we should fix, even if that means edition boundaries or other transition measures.

bjorn3 · November 8, 2021, 9:19am

It makes sense to fix this. I think it should be fixed on an edition boundary though as there is too much code out there that assumes #[repr(C)] means #[repr(trivial)]. Even the standard library depends on it through object which is used by backtrace-rs. Object uses it to allow zero-copy deserialization of object file datastructures.

Topic		Replies	Views
Pre-RFC: `usize` semantics Unsafe Code Guidelines	155	8513	June 5, 2024
ABI discussion for w65 language design	17	2217	November 18, 2021
Should u64 implement From<usize>? libs	21	3075	March 26, 2020
Int2ptr and runtime provenance models Unsafe Code Guidelines	24	2981	February 12, 2022
[Pre-RFC] Custom DSTs language design	33	2790	March 25, 2019

[Pre-RFC] usize is not size_t

Related topics