Rolling out (or unrolling) struct field reorderings

So in PR 37429, @camlorn made the excellent change to reorder struct fields. Unfortunately – and not unexpectedly – making this change broke various bits of unsafe code, notably in Servo but perhaps elsewhere, which was inadvertently relying on the layout of structs. (@eddyb has been doing some intense investigation and may have found the source of the problem, but maybe not.)

I wanted to discuss two things:

  • What is the proper way (in general) for us to roll out changes of this kind?
  • What should we do about this particular change?

My feeling is that when we have a change that we believe may cause existing code to stop working – even if that code is relying on things it should not be relying on – we should try to phase it in. At minimum, we could provide an opt-in (i.e., -Z reorder-struct-fields) and do some sort of PSA announcing that this change will be coming to nightly in one month’s time (or whatever), so that people can do some advance testing.

With respect to this change, I think we should roll it back immediately on beta (which is uncontroversial, I think) as well as master. We should then add it back ASAP into master with an opt-in flag. We can then make an announcement and turn it on my by default in a month or so. In other words, we should try to follow this procedure. =)

Ideally, the flag would not be boolean, but would have some sort of filter (e.g., on paths) so that when a problem is found, people can do binary search to figure out which structs are at fault. This seems to me to be an important point in practice but not so much in theory.

We discussed this some on #rustc, but no clear conclusion was reached. @camlorn is preparing a PR for beta at least. I thought maybe it’d be good to have the discussion somewhat more broadly, hence this internals post.

5 Likes

That pr is now up here.

I think we should revert the change too. We have high standards about preventing breakage for our users, and they have high expectations, so we must stay vigilant.

As with any potential breakage, it is important to understand the impact before making the change. We can’t just hope things will turn out ok.

Making it opt-in for now seems fine, but I would like to do more extensive ecosystem testing before turning it on. Here are two things we can do:

  • Run the test suite of every crate in crates.io looking for breakage
  • Do some global analysis of every crate to find users that are not using repr properly for FFI.

Are there other common failure modes we can look for? Can we identify bad transmutes?

Even after this change is rolled out, future changes to the reordering algorithm, or ports to platforms with different alignment constraints and thus different resulting orderings, may cause additional breakage. To mitigate this, I suggest adding a compiler flag to randomize struct field ordering for testing purposes.

10 Likes

There's already a warning -- can you do a crater run with -D improper_ctypes?

+1 to making this opt-in, and to @comex’s suggestion. We should at least gate making this opt-out by having crater run with that setting on. It also be worthwhile adding a lint that warns if a struct without a #[repr(...)] is used in an unsafe cast, or passed to an extern function. If we don’t, this would be a great source of fun for future underhanded rust contests.

7 Likes

So an update on the Servo situation: bisecting the source of the problem was easy, and only took about an hour of builds.

The main tool was sed -E -i 's/\b(pub )?struct /#[repr(C)]\0/' .cargo/**.rs, on a fresh .cargo (Servo has a separated one because it uses a custom rustc/cargo version pair).

After that, I had to fix some macros (where the sed edited the macro input matching) and I had a working build.

With a full copy of the unchanged and modified .cargo directories, all I had to do is copy sub-directories from either the “bad” (original) or “good” (modified) until I was left with tt_os2.rs from the freetype crate.

Apparently @nox already found that case 8 hours earlier (but the initial testing didn’t validate the fix) and by the time @nikomatsakis made this post, Servo’s rustup had already succeeded.

In conclusion, I think bugs from this aren’t hard to track down even without compiler support, but if we add support it should be easier to use for bisecting the cause rather than letting people defer fixing the original problems.

While improper_ctypes is useful, @nox found this through some cast-related search, and I’d like to hear more about that, maybe we can automate it. Taint tracking at the MIR level is not out of the question, we have enough parts to make it work, at least as a lint with semi-decent signal-to-noise ratio, and we’d weed out false positives by hand.

Randomization is a good idea, and we can use it in conjunction with this. The way the “optimal packing” field reordering works is it sorts the fields by alignment, effectively clustering them. We can then shuffle each cluster without wasting memory due to padding and enable that by default, with opt-in full randomization for testing, preferably with that filter thing, say something like this:

# Bisect for structs in some_crate based on the first letter of their name.
RUSTFLAGS=-Zinject-struct-attr='some_crate::.*\b[A-M]\w* repr(C)'
# Give fully randomized field order to any non-repr(C) struct.
RUSTFLAGS=-Zinject-struct-attr='.* repr(random_default)'

(This design for the compiler flag could be helpful in other situations as well, at an extreme it could be combined with an injected plugin that processes the injected attribute to handle more complex logic)

4 Likes

I propose we use optimization fuel for this case: you can read about it at Debugging compilers with optimization fuel. IMO this is much better than path filtering.

Concretely, fuel is decremented in Struct::new if optimize is true and field reordering is done. When there is no fuel, don’t do field reordering any more.

2 Likes

That doesn’t sound like it’s deterministic, although you could maybe do it at HIR lowering time, for injecting the attribute? The problem is that the compiler would have to atomically mutate the “fuel” across all crates, otherwise this can’t be used for bisecting (in Servo’s 300 crates, for example).

Since I’m the one who will be coding this, how hard are such flags? And what happens to std? Do we force std to always be reordered?

I think this would need to have the compiler drag in regex?

We should also make the lints for this kind of thing be on by default, I think, or maybe just add one that only catches the by-value cases. I know there are legitimate reasons to pass pointers to Rust structs, so I suppose we wouldn’t want to warn on that one.

The short version is that IMO we shouldn’t add flags unless they’re more useful than my application of sed. We can totally use regex after the makefiles are removed, but that’s not here yet.

I haven’t heard of more breakage other than Servo, but I’m waiting for @nox to share how the independent fix was found for that, since it wasn’t bisecting from what I’ve heard, and it might be automatable. Something about pointer casts.

I think we should not make this opt-in, but rather opt-out. My main concern is that opt-in won't actually help many people find and fix bugs, since people won't even realize that there is a problem. Even with a well-advertized PSA, I expect the majority of crate-writers, even those writing unsafe code, to not put in the effort to specifically try toggling the flag. We'll have the exact same problem when we want to make the field reordering default.

Opt-out breaks programs early, which is good because it allows people to notice that things are wrong, but also allows for actionable advice - if your program is acting strangely, try disabling field reordering and see if that fixes it. If so, you're missing a #[repr(C)]. Opt-out is both less work (since only the crates that actually have problems will need to test) and more complete (since it will more clearly tell people that there is a problem).

I'd prefer if the default setting of the compiler was deterministic.

This was something we considered initially. It does seem possible, with some caveats. The standard library, in particular, will be built with the default. But I think that if you did RUSTFLAGS="-Z do-not-reorder-structs" or something, you would at least build all other crates in a deterministic way (as eddyb points out, it would be better still if we could have some way for you to select which crates -- or even which modules -- will disable reordering).

(I think the way this opt-out flag would work is to basically force the #[repr] to be #[repr(C)] for all structs that are not otherwise labeled.)

I do agree that in practice many people will not know to try with the flag or will not bother, even if it is opt-in. I guess it seems good (however) to have given people the chance, at minimum.

For "deterministic builds" reasons, or more than that? The random seed could be based on a hash of something, so the order only changes when the code changes.

I think randomization is great for fuzzing, but not something to do all the time. I'd rather the default was just a stable sort by alignment, so it's something you can still reason about.

Mostly I care about this for debugging purposes. Even though memory safety makes heisenbugs a rare thing, they can still happen in unsafe code. I don't want any random shuffling when I'm trying to figure something out, not even "when the code changes" since that could be me adding a print or something, trying to track down the bug.

6 Likes

Random structure reordering can have surprising performance impact, with limited visibility into what’s a timebomb. I think it’s a valuable tool but should not be the default.

3 Likes

This is cool!

OK, I propose we do the following then:

  • Disable the change on master/beta
  • Re-introduce on master, with opt-in using an “optimization fuel” style:
    • -Z reorder-fields=all
    • -Z reorder-fields=N // convert the first N structs to “optimized” repr, which will optimize if possible
  • Make a PSA
  • After one cycle (*), convert master to use “opt out” style
    • same as before, basically, but reordering defaults to all instead of 0

(*) We could potentially do this faster, e.g. on Jan 15 or something.

1 Like

This is reasonable, as long as the optimization fuel option prints the last struct it optimized somewhere. This is a nice solution, as long as it interacts well with Cargo.

I’m not convinced this just works, not without additional metadata. Is anyone else convinced that it just works? I thought we recomputed type layout on already compiled crates, which would necessitate us knowing if the type in the other crate is optimized or not.

I absolutely think that this should be faster than the 12-week cycle. This is an optimization, after all.

I’m also thinking that we probably don’t need a new process. It’s hard for me to come up with other optimizations that would break things like this one does, and something like an RFC seems pretty heavyweight.

If we want to change layout more in future, I think we should add and/or enable a lint which detects all transmutes that are not one of the 3 or so cases mentioned in the book. I want to do an RFC for this, I think, but haven’t gotten around to it yet.