Consider running crater for a one-year-old snapshot of crates.io

Today I was bitten by Rust 1.80.0 vs the time crate issue which I can't easily workaround, and it got me thinking about what extra mitigations we can put in place to prevent this sort of thing happening in the future. Apologies if this has been suggested before!

What if, in addition to normal crater runs, we also gate a release on an additional run against a recent-past snapshot of crates.io? That is, once a year we take a snapshot of all of the crates.io, freeze it, and, for the next year, use it as the basis for such "old crater" tests, and then rotate in a year?

One of the failures with time was the fact that, although the code was fixed to compile, it was fixed recently, and the change didn't quite percolate through the ecosystem. Without judging whether this level of breakage is ok, clearly, the impact would have been much smaller if the ecosystem had a year-ish to upgrade naturally.

And it feels like trying to compile all the code from the past year is a decent proxy for "does this break loads of recent code?".

2 Likes

A problem with that as described is that once a year that resets. You would ideally like some sort of sliding window approach. For example, take a snapshot every 6 weeks (to match release schedule) and use the closest to one year old when you need it.

Another problem is resources to run this, if it runs on every crater run you get twice the amount of resources needed (approximately). You could dedupe crates that haven't been updated in the last year, but it will still be a significant number extra. Perhaps this should only be done for the beta crater runs?

Third, there were a lot of discussions around that time of solving the underlying inference breakage instead, though most of those seem to have been forgotten, I haven't seen any concrete RFCs based on that. I'm on my phone so you will have to search for those discussions (on here and on zulip) yourself.

1 Like

Better yet, just make it an option that can be chosen by whoever initiates the run.

2 Likes

I've personally thought it would be an interesting experiment to run crater over a snapshot of crates.io from shortly after the 1.0 release. You could compare what fraction of crates compile with the latest stable versus rustc 1.0. I bet there are a lot more broken crates than many people realize! If so, it would also make for an interesting case study on what sort of patterns library authors should avoid to reduce the likelihood of future breakage.

2 Likes

I would go further and say we should regularly run crater on every crate-version ever published. It's "just" costs compute/time.

1 Like

I wouldn't necessary go all the way to promising that rust would not break any code ever. This is not how we operate so far, tonnes of code from 1.x times no longer build, and that seems mostly fine.

So, I think it might be OK to use "stability guarantee doesn't cover 100% of type inference" from time to time. What I would like to do though, is to reduce the amount of human judgement that goes into decision whether a particular minor breaking change is minor enough, or not.

In other words, I would be happy with either:

A) We just never ever "break the userspace". There's a hard crater check for that.

B) We sometimes, ahem, bend the user-space a bit here and there, if the practical impact is very small. While there's some element of human judgement involved in this decisions, there's a hard crater check that the practical impact is small, which works "retroactively" (tests some pre-agreed old slice of ecosystem).

I don't have an informed opinion on whether A or B is better. I'd trust T-libs+leadership consul on this one, as it's a pretty major decision.

What I have concerns about is C, which I think most closely matches the current practice:

We sometimes bend user-space, and this hinges ultimately on human judgement. Crater runs measure the blast radius, it's up to T-libs to decide if it is small enough, and, crucially, there's no automated check that the judgement didn't misjudge the impact to significantly.

I'm not proposing using the results of such a comprehensive crater run from blocking changes, but rather to have a more accurate view of the blast radius. If 10% of Rust 1.0 contemporary crates are broken, that might be fine, or it might be too much, but today we don't know. I want this as a mechanism to enhance what we already do. Crater runs against last crate-versions and their dependencies is a great check to have, but I fear it is insufficient to determine the "real" blast radius of a change. I want to be able to say in Rust 1.120, "this is project from 2023 is N% likely to work". Even better, have cargo know "given the dependencies of this project, we know for a fact that it won't work". Think of it as a "computed M(aximum)SRV". As a tangential issue, we're similarly blind to the codebases of orgs with private repos, but that is IMO a more tractable issue with a separate solution.

4 Likes

Rust 1.0 crate versions don't really matter though. At all. Either they are still the most recent version (in which case a normal crater run will pick up on that). Or they are no longer relevant at all.

I really don't follow your argument as to why we should care about that. If the old version is still used by other modern crates, that will show up in crater. If the old version is unused, it doesn't matter.

Looking at one year old is enough, that covers the slow to update / blocked on upgrading scenario for code that crater doesn't see (proprietary or otherwise not on crates.io). It is a proxy measurement, but it seems like a fairly reasonable one.

Looking much further back than a year, what we have a proxy measurement for is the people who aren't even making an effort to upgrade their dependencies. And that shouldn't hold rust back.

If you are considering the LTS angle (centos, rhel, etc) then my question becomes: why should you use a new compiler with ancient crates? Such a distro will package an old compiler and old crates.

It gives us a lot more data to go on, though; if I have to go back to 2017 to find any crates that are broken by a "breaking change", that's a very different statement in terms of blast radius to "all crates released before 2024 are broken".

Similarly, there's a huge difference in impact if you must upgrade to serde 1.0.200 or later, as compared to must upgrade to git2 0.18.0 or later; serde 1.0.0 is over 7 years old, so your code has to be truly ancient to depend on pre-1.0 serde, while git2 0.18.0 is only just over a year old, and you may well not yet have caught up on the changes between 0.17 and 0.18 (or 0.19, if you're going straight to latest).

And this sort of information helps with decision making; if the breakage is all in crates like serde that have been on the same major version for 5+ years, and we're breaking only versions older than 2 years, eh, no biggie. If we've broken every FFI -sys crate over 15 months old, that's a big deal and needs more thought.

There is a decay in the usefulness of the results. If you are breaking sys crates older than 24 months or 36 months, it is less of an issue than 15 months. The older the results, the less they should matter.

People will likely to be able to argue about how much the usefulness decay with age for a very long time. You could have two snapshots (1 year and 2 years) and weigh the relative importance of those. I'm however arguing that going all the way back to 1.0 is fairly useless and a waste of resources. (For reasons I stated in my previous comment.)

It is kind of like with retro computer (which is a hobby of mine). At some point it just doesn't make sense to run the latest Linux kernel on a Pentium 3. Nor to run a OS from the 90s on a modern computer (outside of an emulator).

I believe the same concept applies to mixing software versions in a single OS. If for some reason I want to build software from 1996 (which have happened) it is okay if I have to put in some leg work if I need to get it compiling on modern tool chains and OSes (instead of emulation and period accurate toolchains).

Rust already does an amazing job on the backward compatibility, much better than C or (especially) C++. But at some point that shouldn't hold back new features any more. You need a balance.

At work we also test and fix that our software can be built on more modern systems than what we really target and build releases for. This let's us use newer tooling to find more bugs than we would otherwise do (since it is mostly C++ that is very useful: newer clang-tidy, ubsan, ASAN etc). My experience is that it is usually not difficult to do this, with the recent release of Ubuntu 24.04 we are actually targeting two different LTSes right now, which is a big span (two years), plus an embedded yocto sysroot which is even older.

So people complaining about being unable to upgrade (at least for longer than a few weeks to maybe a month) have some other issue in their code base. Technical debt, dependency on unmaintained third party code, insufficient automated testing frameworks, etc. They should get on top of that first. The question then becomes, for how long should the rest of the world humor them? I think a year is plenty.

1 Like

If you have a C or C++ codebase and "aren't even making an effort to upgrade [its] dependencies", the codebase will probably still compile after many years, or (especially for C) even decades. Not always, but often.

That is also usually the case with Rust, but it seems like the time change had pretty massive fallout. It doesn't help that Rust projects are more likely to vendor large numbers of third-party dependencies in the first place.

Sometimes a project hasn't been maintained for a few years but still works fine (if you can compile it). Sometimes you specifically want to test against an old version of a codebase (such as for bisection). Either way, it's important that the code is still able to build, unchanged.

If we only cared about building the latest versions of projects under constant active maintenance, then periodic breaking changes would be acceptable and there would be no need for stability promises in the first place.

6 Likes

Thanks for clarification! Indeed, it seems that there are two separate, but related, ideas here:

  • Run crater for old stuff to get more data
  • Use old data to install some sort of an automated safety-net check to double-check human judgement

I am emphasizing the second component here because, it seems to me, we were actually aware of the breakage in the current situation --- crater run on current crates flagged time, and there was a PR to fix it. So the abstract knowledge that "old time breaks" was there, the new version of time was a result of a crater run.

This is in contrast to a hypothetical situation where it just happened that the recently released recent time passes crater, but somewhat old version doesn't.

2 Likes

On the other hand, the more information we gather, the easier it becomes to spot patterns that indicate that we need to reconsider a breaking change. It's easy to dismiss a build failure of all versions of time before 0.3.35 as "just one crate, already fixed"; it's a lot harder to dismiss the issue if 90% of crates over 3 years old are broken, because the extra information is that this pattern used to be very common (even if time is the only crate that still has the problem pattern), and thus is more likely to be present in "hidden" codebases (like proprietary ones).

3 Likes