The case for a new relese channel: testing

jonhoo · April 2, 2021, 11:40pm

Currently, Rust has three release channels: stable, beta, and nightly. Stable's role is obvious, but beta and nightly each serve distinct use-cases:

Beta: Give the next stable release a try before it become stable to catch stable regressions.
Nightly: 1) Expose nightly features so users can test them out if they need them — this allows gathering evidence for an eventual stabilization. 2) Expose the latest compiler changes, whether those are new stabilizations, compiler improvements, standard library changes, etc. — this allows changes to bake with a small subset of users (those who use nightly) before they land for a wider audience (like on beta).

I believe there is an important use-case that is not addressed by the existing channels, and which would benefit both Rust itself and its users: wide-scale testing of specific unstable features. By way of example, consider a company that wishes to contribute an aardvark feature to Rust. The company's engineers submit an implementation PR (and maybe an RFC) which lands under the aardvark feature flag. At this point, the Rust maintainers (rightly) want evidence for the feature's correctness, as well as experience data for how well this implementation of the feature works in practice. They want to know if the feature is broken, or if the API is janky, restrictive, or too hard to use, so that they can make an evidence-backed decision as to whether the feature should eventually be stabilized in its current form. The company engineers want to help by testing the new feature internally, which would provide both a large and diversified sample of real-world use.

Here, a problem arises. The developers face a problem: how do they test this feature internally? They essentially have three options:

Make the nightly compiler available internally.
Wait for the next beta release and set RUSTC_BOOTSTRAP=1 in their build environment.
Hope that the feature will stabilize without their involvement.

These options are all bad. The last option (3) means little evidence will be provided for stabilization, which means either the stabilization goes ahead with scant supporting data, or the stabilization falls through due to lack of evidence, which may mean Rust misses out on an important feature. The middle option (2) takes advantage of an obscure hack that is really not intended for public use, and which the Rust maintainers discourage the use of. In particular, RUSTC_BOOTSTRAP violates the expectation that the stable compiler (and beta since it'll become stable) is stable and thus does not allow unstable features, and is mostly a historical relic that should arguably be removed. The first option (1) would mean also opting into very recent code-changes that haven't undergone as much battle-testing as a stable release, and which may have as-yet undiscovered issues that can only be fixed by upgrading to the next nightly. Which may in turn have similar issues that require another upgrade, etc. While Rust's nightly channel is fairly stable in practice, it's likely not going to fly with the company's internal security team, or with the engineers' desire to not be regularly paged for nightly breakages.

The issue that arises here is that nightly includes both "new code" and "unstable features" into a single release channel. There is no way to get the latter without the former, which is really what the company engineers would need in order to test out the unstable feature they contributed in a reasonable way (likely combined with -Zallow-features=aardvark to ensure only that feature is used).

And so we get to my suggestion for a new release channel. Let's call it testing for the sake of exposition. I propose testing should have the following properties:

It should allow unstable features. Without this, this release channel wouldn't serve much of a purpose.
It should be identical to the current beta. This ensures that the code that lands in the release is fairly stable, while not forcing the company engineers to wait ~9 weeks to be able to test their new feature. It also means that fixes will be backported into this release channel when they are backported into the beta, which is likely to make large users more willing to adopt it. Hooking this off of beta also reduces the cost of maintaining the channel, since no separate release train is needed.
It should require -Zallow-features. This is probably somewhat controversial. I believe this channel should be specifically for testing particular unstable features, not a general replacement for nightly. Users who are actively working on a feature, or who are relying on a large number of features, should continue using nightly — that way, we keep bug reports and such about nightly features current in the majority of cases. It also encourages "good use" of testing for adopters who may not already have been aware of -Zallow-features — it is unlikely that they intended to blanket allow all current unstable features, whatever they are.
The list of available features should be curated and time-limited. That is, there should be an explicit process (probably rfcbot) for making a given unstable feature accessible on testing. This goes further to ensure that testing does not become a replacement for nightly, and that only features that are deemed "ready to test" are actually tested. It also reduces the possibility for weird interactions among unstable features since the set is controlled. The set of available features should be time-limited so that a feature does not live in testing forever, and put pressure on a decision ultimately being made. Josh proposed eight weeks, which I like. That is a maximum time — a feature can be pulled and stabilized sooner. I think there should be a recommendation to delay removal from testing until a stabilized feature is available on beta, but that's secondary.

This is not the first time this has been proposed in one form or another. Some related reading:

skade · April 6, 2021, 8:47pm

Hm, at quick glance, I think it's an interesting thing, but there's multiple things that spring out to me:

The motivation here is for one company to be both implementing and internally validating a feature. That brings a whole organisational conundrum with it that should not be underestimated, the simplest of which is trust in data. Also, if the feature cannot be tested on a public codebase, how is QA and regression testing to be done? Indeed, I know that there multiple sketches on how to make such information exchange happen and the takeaway was always that this is a ton of work.
Feature development should generally be done in ways that multiple organisations and individuals can collaborate. The whole idea of "testing internally at a company" is an antithesis to that. In the rare case were this is indeed impossible, I would assume the company to be large enough to conduct their own builds for testing.
I do understand the issue though that there's a balance between the development of nightly and something that can be handed out to be found. Beta is already a little in scope there. Currently, I'd rather check what beta can do than opening up new channels.

That being said, I want to highlight that the mechanism you outline mostly works for actual language features. We had phases were e.g. one day of the week saw a special nightly release with certain backend features activated.

In summary, I agree there is something to improve, I'm not entirely convinced of the solution.

jethrogb · April 6, 2021, 8:52pm

Speaking from experience with the exact use case described, I can say that just using nightly has generally worked out fine. Nightly is rarely so broken it's unusable. I'd even go out on a limb and say at least 90% of all nightlies are perfectly fine to use in a production environment.

To test a new feature, you don't need to update nightly daily. Just pick one that has the feature you need and you know works.

jonhoo · April 6, 2021, 9:02pm

That's a good point, though I think it perhaps takes my example use-case slightly too literally. It's not that the company in question would be the only tester of the feature — it's a public, unstable feature after all. It's more that the interest of the company provides a venue in which to bake the change at a large scale in a shorter time-period. That should certainly not be the only basis for stabilization, but I do think it's very valuable input to the stabilization decision that currently doesn't happen.

Your point about regression testing is a good one, but is, I think, orthogonal to this. We want regression testing for Rust across private code bases independently of unstable features. The proposal here is more along the lines of "how can we enable large codebases to be comfortable in testing out unstable features?" than "how can we test stable changes on private code".

I'm not sure I follow this point. I'm not proposing that any of the development happen in private — the feature would be developed in the open through PRs and RFCs just as any other feature. The case I'm highlighting is more that once a feature has been developed (whether by the company that wants to test it or someone else), there isn't currently a "comfortable" way to test that feature at scale, which means that such testing doesn't happen, which means less supporting evidence is available for the stabilization decision. While it's true that an organization could choose to test a feature internally and not release any data on that experience, that's already the status quo. The hope here is to make it "safer" to test an unstable feature so that more organizations may be willing to do so, which should (hopefully) result in more testing feedback.

I actually think beta should specifically not be used for this. Beta should be exactly the next stable, with no "additional" features, because otherwise it's not really a testing ground for the next stable, it's "the next stable plus extra bits". I think @jyn514 has had some thoughts on this in the past too.

Yes, this is specifically for language features. I'm not sure how special nightly releases would address the use-case I'm highlighting since it would still be a nightly release.

While it may be true that this is often true, I don't think it's an argument that's likely to fly in the general case. Nightly releases inherently have less baking time than other releases, and thus are more likely to have issues, which is inherently added risk. You're right that you can pick one that happens to (seemingly) be okay, but now you need to have a process for searching for nightly candidates, including when to upgrade from one to the next. Having the test releases follow the normal release train is, I think, much more likely to make such a release seem "safe" from an operational perspective.

josh · April 6, 2021, 9:25pm

I'm not entirely sure if we should do this. But if we do, I think such a channel should be limited to specific features, rather than allowing arbitrary features via -Zallow-features. The purpose of this channel should not be for long-term use by people who want unstable features on stable Rust, even if they list the specific features they're using. The purpose of this channel should be to test such features and provide experience reports. As such, I think it'd be appropriate to have an approval process, along the lines of "we want to enable testing of #![feature(xyz)] on the testing channel, for X amount of time, to gather experience reports; at the end of that time (or sooner if we get enough reports), we'll either propose stabilization of that feature, or disable that feature in the testing channel".

jonhoo · April 6, 2021, 10:00pm

I'm torn here. While I agree that this would be nice, I also worry that maintaining this set of features would be fairly onerous due to its manual nature. Do we require an FCP for adding to the set? How do we automate removal from the set so that it doesn't just grow stale over time?

scottmcm · April 6, 2021, 10:01pm

I wonder about the discovery story for this in general. How does one know what to test?

From that perspective, the "searching for candidates" problem might disappear, as the "call for testing" would suggest the nightly to use. Waiting 3 weeks (on average) for a testing build to come out doesn't seem necessarily better -- especially if the testing quickly finds a problem that needs a new build.

jonhoo · April 6, 2021, 10:08pm

Yeah, there's a tension here between "I want the latest version of this feature" and "I want the stable version of everything else". I suppose I'm thinking of this less as "we're going to do a call for testing for feature X" and more as "I really want this feature, I want a way to test-deploy it internally so I can give feedback". I think the former is tough to pull off in practice because it requires an active driver for each feature regardless of whether there's interest in it, which I think realistically won't happen. Basically, I feel like we want the process here to support "on-demand testing" rather than "solicited testing".

josh · April 6, 2021, 10:08pm

The point of the channel would be to test specific features that need more widespread testing. The primary work there is getting the widespread testing; turning features on or off will be a relatively small part of maintenance and curation there. Every feature enabled there could have a specified end date, and we'd expect the people most interested in stabilizing a specific feature to pay close attention to whether there are enough experience reports by that time.

jonhoo · April 6, 2021, 10:12pm

So the idea would be that if there's a feature I want to help test, I would first need to advocate for it to be added to the "testing set", and then wait for that to happen? I think that makes sense for very large features that we really want a "call for testers" for, but doesn't work quite as well for features that aren't necessarily useful for many users, but rather for few but large users. The example I'm thinking of here is something like cargo's patch-in-config, which is probably not that useful for most individual users of Rust, but is likely very useful for many large organizations using Rust in the context of their internal build systems.

josh · April 6, 2021, 10:19pm

Yes.

I think testing could support either use case; either way, I think we'd get useful experience reports.

jonhoo · April 6, 2021, 11:08pm

Yup, I agree. I like it if we can find a good process for adding/removing features from the set. Some questions:

Getting a feature added to the testing set should require approval, but what kind? An FCP seems like overkill, but single-maintainer may be too little? Maybe two is a good number?
Features should only be approved for a certain amount of time, but how long? We could always cut the period short if enough evidence is gathered, so maybe err on the "long" side. But at the same time, if it's too long, we'll probably end up accumulating a decently large set of features over time, which likely reduces the set's usefulness. As a starting point, how about six weeks to match a release cycle?
Once the time limit for a testing feature expires, how is the feature removed from the set? Ideally this process is automated, or at least a reminder to remove is automatically written to the tracking issue by some kind of bot. Is this something that can easily be added to the existing bot infrastructure?
A "nice to have" bordering on "important" is to plan for the continuity of a feature if the decision is "stabilize". That is, if we decide to stabilize a feature, we probably want to make sure that there is always some non-nightly that allows the use of the feature. So, the feature should not be removed from the testing set until the next beta is release. Otherwise, users of the feature may have to revert changes they've deployed in the gap between testing and beta, which seems unfortunate. Thoughts?

josh · April 6, 2021, 11:59pm

(Disclaimer: all of this is focusing on the question of "how" and ignoring the question of "if"; this shouldn't be taken as agreement yet.)

I'd suggest using the normal rfcbot process, just without paying attention to the usual 10-day delay after consensus. For instance, this is a tool we can apply during a regular weekly meeting, in which case if there are enough people present, we'd FCP and check the boxes for everyone present, and it'd immediately go into FCP, at which point we proceed.
It'll depend on the feature and how much usage we expect it to get. Generally, I'd guess either 6 weeks or 12 weeks would work, for any feature that actually requires changing Rust code to take advantage of it. For features that just involve passing an option and seeing the result, as little as 2 weeks may suffice. That said, given your point about continuity, we may want an amount of time that'll be slightly longer than the time until the next release, so something like 8 weeks might be preferable.
One approach would be for the feature-enablement configuration to be in a file together with dates, and CI could compile in only those features whose dates haven't passed. That way, if we forget to prune an entry, it's still disabled on time. That also makes it easy for the scripts that generate meeting agendas to add near-future dates to the agenda.
I'd hesitate to push that too hard. I understand the desire for continuity, but anyone using the testing channel is helping with an experiment, and needs to understand that they may need to roll back any changes they make. For that matter, we may end up changing or evolving a feature on the basis of experience reports.

We should also have a loose upper bound on the number of features we're testing concurrently.

jonhoo · April 7, 2021, 12:16am

Ack.

That's a good point. In that case I think FCP without the 10d delay is a good way to go so it can re-use existing mechanisms.
I suppose the natural follow-up here is whether there should be different time periods for different features, and how those periods are decided. It feels easier to pick one and say that all changes have the same maximum testing time, and then acknowledge that features can be stabilized sooner than the maximum testing time if sufficient evidence is present. I like your proposal of 8w.
That's a super interesting proposal. I suppose this would then need to be some kind of build script, but I don't have enough insight into the current build process for rustc/cargo to say how easy/hard that would be to implement.
Yeah, this one is also a balancing act. The reason I propose it is that I think it'd reduce the friction if a feature is stabilized. You're completely right that anyone testing it would need to be prepared to roll back or adapt to changes, but if the decision is to stabilize, I think it'd be worthwhile to say that "the testing window will be extended to the next beta release date".

Returning to the question of whether something like this should happen, I think the primary question is "why a new channel?", which then deconstructs into:

Why not unstable features on stable/beta? This is already possible with RUSTC_BOOTSTRAP, but it's probably not something we want to encourage. Users should be able to rely on the stability guarantees of stable, which immediately goes out the window if there's an escape hatch to get unstable features there anyway. Arguably, RUSTC_BOOTSTRAP should be removed and bootstrapping should happen with testing. As for "why not beta", the argument is that beta should be exactly equal to the stable that is to come. If that's not the case, we end up conflating the role of beta and undermining users' willingness to run beta since it means also opting out of the stability guarantees.
Why not use nightly? As I've tried to articulate earlier in this thread, nightly means opting into too many other things at the same time, such as changes that landed yesterday and thus haven't baked for very long. There is also the concern that as the rate of change increases for Rust with increasing adoption, it'll become increasingly hard to find a nightly that has no known problems. In some sense, that's what releases are for — finding a slice in time where everything works correctly, and where even if they're discovered not to, to commit to backport changes to restore correctness. It feels unfortunate to place that burden on testers.

scottmcm · April 7, 2021, 7:20am

I think I'm seeing this one differently. From things like const generics, we've seen features that end up needing extra warnings to discourage people from trying them out when they're not ready yet. So we seem to get on-demand testing with existing nightly mechanisms.

Whereas with the -- very good -- success of getting people off using nightly regularly, I feel like solicited testing is more of the problem. We're getting more and more things that are more "nice, but not essential" things that mean people are unlikely to move to nightly to try it out. It makes me wish for "please try these out" calls to get more experience reports.

(But there's probably a bunch of stuff in the middle too.)

pietroalbini · April 7, 2021, 9:12am

I'm wondering about the choice to base testing off of beta. You mentioned that one of the disadvantage of nightly is its lower QA, and that might not be acceptable inside companies.

I'd argue that beta is not suitable either. The beta channel has a reliability story that drastically changes over the course of six weeks: at the start of a cycle it's as reliable as just picking the nightly of the day of the beta cutoff, then as the weeks go by it becomes more and more reliable as backports are landed. Near the end of the release cycle it's practically as reliable as stable, and finally when a new cycle starts it reverts back to be as reliable as nightly.

I'm also wondering whether you'd want fixes to tested features to be backported to testing (and thus beta). I would see no problem in backporting the changes to testing if it wasn't tied to beta, but landing backports for unstable features on beta makes be a bit hesitant.

InfernoDeity · April 7, 2021, 1:04pm

The issue is that the rustc built during a bootstrapping stage is responsible for compiling it's own standard library (that necessarily uses many unstable features). Removing RUSTC_BOOTSTRAP would have to be reconciled with the fact that, currently, when building stable rustc, the final standard library is built by stable. I don't know enough about the rustc bootstrap process to comment on challenges with building the standard library using a separate testing channel rustc built before building the stable compiler, but I can assume that similar issues would have presented themselves if the same was tried using the existing nightly channel.

RUSTC_BOOTSTRAP should cease to exist wrt. users and user crates imo, but I don't think it's fiesible to eliminate it entirely (except by providing a new mechanism for stdlib use).

jonhoo · April 7, 2021, 7:13pm

Oh, yeah, that's a very good point. I suspect there's a difference between "features wanted by many individual users" and "features wanted by few but large users", and the latter is maybe what I'm getting at. Take, for example, the "patch section in .cargo/config" feature — that's unlikely to be useful or even interesting to most individual users, but it's likely to be very useful and important to "large" users that, say, integrate with other build systems. That's not to say the feature is more or less important overall, but more to indicate that different features have different target user populations, and those user populations in turn have different requirements for how they use Rust.

To continue the example, individual Rust users are probably more willing (arguably too willing?) to jump on nightly to get features they really want (like const generics), which means that those features are likely to get a lot of testing right off the bat. Features that aren't interesting to individual users but are still important for many users indirectly through larger orgs on the other hand don't get that surge of testing, because such users generally can't jump on nightly, and thus can't test. The voices of the "real" user base behind such features are also hard to get at, since they may not know that pain they're feeling in their developer experience with the org build tools stem from a particular upstream feature.

All that said, I agree with you (all) that having a specific list of features that are available on a hypothetical testing channel is a good idea. Both because it discourages "just use testing" and because it means we don't expose things that aren't quire ready for wide-scale testing just yet.

You're right, and I don't have a good answer for this. There's a balancing act here between wanting to not have to wait months after a feature lands on nightly to test it, and wanting the stability that comes with longer baking times. I chose beta for the proposal mainly because it doesn't impose a 9 week (6w + 1/2 6w in expectation) wait time, but also at least gets backports (more on that below). I think if testing forked off of stable instead, that wouldn't necessarily diminish testing, but it would sort of artificially delay how long it takes to gather evidence, and some tester might "lose steam" because they have to sit idle for two months.

There is the option to say that testing gets cut from beta once it's at least X weeks old (and on any subsequent beta releases), which might help mitigate some of the concern, though that is also something users could opt into themselves by choosing when they update their testing (but not sure if that's better).

I think there should be no guarantee that fixes to unstable features get backported to beta — it's too onerous a requirement I think. Maybe there could be an option for people to contribute such backports, but I don't think it should be the general expectation. Of course, this then raises the issue that users may have to wait a while to test feature fixes on testing, but I think this is a fundamental choice between backporting nightly fixes to beta (and thus testing) and having testing come with some amount of stability guarantees. Either testing draws eagerly from nightly or fixes are backported — there isn't really an in-between (I don't think). And I think "draw from nightly" would be too much of a repellent to make testing useful in the first place.

I don't have a good proposal here either. Specifically, I don't know how we can have a way to opt out of the stability guarantees for a stable compiler but just for building stdlib/the compiler. I wish I did. We could require that the Rust build process build two versions of the compiler that are identical except for a flag for whether unstable features should be allowed, and then use the version with the flag set to compile the stdlib and then ship it with the other version (that does not have the flag set). But it'd be painful and I'm sure I'm missing something.

I think that discussion is somewhat orthogonal to the proposal though — ultimately, we probably do not want to encourage the use of RUSTC_BOOTSTRAP in any form outside of building stddlib, which then lands us back into this proposal.

tcsc · April 8, 2021, 9:32am

So, I think probably more people use nightly as their daily driver than you might expect. This is true for large corps. too (as of rustconf2019 in the "enterprise in rust" unofficial meeting it was anyway), although we probably should strive to lower the amount of this (both for users and companies), and perhaps this is a good way to do so.

I also will note that I agree that this only makes sense as a way to gather feedback on a small set of "greenlit" features. If it can use any feature, it's basically nightly without the name nightly.

That said, I think it would be useful as a tool that project groups needing wider ecosystem feedback could use.

One case thats immediately obvious to me, as it's a group I participate in, is the portable simd work. Once things are further along, getting it on a testing channel and asking crates with simd accelerated functionality to try porting their SIMD routines to it would be very valuable to uncover any API holes, pain points, or confusion that could help guide some of the doc.
Another interesting example might be "custom test frameworks", chosen mainly because it occupies a strange place where it would almost certainly be widely used if it were stable, but is almost never used because it's a nightly feature.

(Note: As it stands IMO this feature needs a project group to help before it can stabilize — there's many issues to untangle, not all of them technical, so testing doesn't make sense for it as it stands, however after all that, putting it on a testing channel for broader feedback seems like it might help kick the tires a great deal)

Admittedly, for these cases, none of this does that much that we don't accomplish by writing a blog post asking for usage feedback of what's on nightly...

(Aside from, of course, allowing corporations which have outright banned nightly usage to sidestep this, which seems, well, I guess it's fine, since it's a substantially limited nightly-like — maybe I'm being too cynical here though — especially because I mostly like the idea)

From things like const generics, we've seen features that end up needing extra warnings to discourage people from trying them out when they're not ready yet. So we seem to get on-demand testing with existing nightly mechanisms.

I think this is going to be true for some features, but not for others. One thing about const generics is that it existed in the same shape as unstable for a long time, and shipped using a subset of that shape.

It also is one of the first things you hit as a limitation for rust generics if you are expecting C++ templates. The second there is probably specialization, and my gut is that specialization/min_specialization will see the same thing as you describe.

Thing is, I don't really know that this applies more widely. That is, for some features they'll benefit from the "spotlight" of being on testing, for some they probably won't since people will use nightly just to test them.

I mentioned test frameworks as an example of a major feature that I think people would use if stable, but won't rush to nightly just to use — I strongly suspect there are others, but... am unsure which they might be — most of the things in that "would use, but wont use nightly for" I can think of are small nice-to-haves.

Probably any new types of proc macro (have we stabilized all of the places where you can use them?) would fit here as well, but I don't pay attention closely to the macro ecosystem.

jyn514 · April 8, 2021, 9:30pm

RUSTC_BOOTSTRAP should cease to exist wrt. users and user crates imo, but I don't think it's fiesible to eliminate it entirely (except by providing a new mechanism for stdlib use).

This is a can of worms I don't think we should open here. The absolute minimal smallest version of this still took two years and lots of heated argument to push through: Forbid setting `RUSTC_BOOTSTRAP` from a build script · Issue #7088 · rust-lang/cargo · GitHub.

That said I agree it isn't feasible to get rid of RUSTC_BOOTSTRAP for the compiler itself, since otherwise there's no way to build the stable compiler with stable.

I also will note that I agree that this only makes sense as a way to gather feedback on a small set of "greenlit" features. If it can use any feature, it's basically nightly without the name nightly.

Big , about half the rustdoc flags only exist for docs.rs and I would really hate for people to start using them and getting mad if we removed them.

I mentioned test frameworks as an example of a major feature that I think people would use if stable, but won't rush to nightly just to use — I strongly suspect there are others, but... am unsure which they might be — most of the things in that "would use, but wont use nightly for" I can think of are small nice-to-haves.

Most library features I expect - things like impl Iterator for [T; N] and doc(cfg).

Topic		Replies	Views
Getting more testing of unstable features	40	4394	March 25, 2019
Allow unstable features on beta?	36	8636	March 25, 2019
New release train request: Unstable	2	1240	March 25, 2019
Idea: Semi-stabilization language design	37	4537	July 7, 2019
Blog Post: Stability without stressing the !@#! out policy	14	1673	March 14, 2025

The case for a new relese channel: testing

Related topics