Distributing x86-64 tier 1 with host tools v1–3 each

As a soft roll-out for distributing newer-than-v1 x86-64 tier 1 tools, the list could be extended by providing v1, v2, and v3 of each of the existing targets.

How?

First of all, the target used for building the compiler wouldn't go hand-in-hand with the target the compiler then targets by default. Bumping the default compilation target is a separate discussion. The suggestion below assumes that regardless of the target level the toolchain was built with, it would have the agreed-on default target.

Currently, the following x86-64 tier 1 with host tools targets are available:

  • x86_64-pc-windows-gnu
  • x86_64-pc-windows-msvc
  • x86_64-unknown-linux-gnu

If implemented, the current targets would effectively be multiplied by three:

  • x86_64-pc-windows-gnu (v1, v2, v3)
  • x86_64-pc-windows-msvc (v1, v2, v3)
  • x86_64-unknown-linux-gnu (v1, v2, v3)

Rustup (itself deployed as v1 only) would have a dynamic run-time check for the architecture of the computer it is run on. Specifically, when downloaded and run, it would check if v3 or v2 are supported, defaulting to v1 if not. Then, it would install the highest level supported toolchain – manually overridable by the user. If we'd like to be even more careful, this could wait for n months/releases, while the toolchains would already be available and opt-in.

Why?

A performance improvement of ~1–3 % from using a v3 toolchain would become immediately available for many developers. Moreover, beginning the distribution of the newer toolchains would help iron out any possible bugs (testing in production, yay!) and maybe shift the implementation of new compiler optimizations towards changes that happen to favor the newer architectures.

We could also gather statistics on users' microarchitecture level preferences from the actually downloaded toolchains, which brings us to the next topic.

Anyway, it seems to me that taking the step towards using the newer microarchitecture levels by default is bound to happen at some point. Is there a good reason to wait still?

What about Users?

As for the demographic of gaming enthusiasts, according to the Steam Hardware Survey, almost everyone seems to have at least a v2 capable computer:

  • CMPXCHG16B: 99.94 %
  • LAHF / SAHF: 99.94 %
  • SSE4.1: 99.82 %
  • SSE4.2: 99.78 %
  • SSSE3: 99.86 %

Currently, AVX2 prevalence is at 95.03 %. The demographic of developers is most likely a bit different, but I don't have any good data at hand. Anyway, I'd wager not a very significant number of developers are using a 20-year-old computer for Rust development – mostly because the compile times would probably not be nice.

If the subsequent shift in use would lead to different bugs being found & fixed and to different kinds of optimizations, the v1 users would possibly be left behind. However, I guess that's going to happen anyway at some point.

Downsides

  • Effort.
  • Maybe some cloud/bandwidth costs due to hosting more toolchains.

There's a downside of building+testing, too. We're 3xing the number of almost-3-hour full release builds in CI, too. And if we want to call it tier 1, we actually have to test the artifact, not just build it.

Also, which one would we use as the rustc-perf target? How would we decide whether to take a change that makes -v3 faster but -v2 slower?


Musing: I feel like there's other things that might make sense to roll into this kind of a split, like the "do you want the different allocator that's 30% faster but uses 15% more memory?"

Maybe there's space for a "using this on a big fast fairly-new machine" host tools set and a "look, I need it to work even though things are pretty constrained here" set.

5 Likes

It would perhaps be interesting to see how large the difference is, and if there are particular functions that are extra hot. If it is just a few, perhaps runtime dispatch on target features would be an option.

Runtime dispatch could also be improved. GCC supports function multiversioning that is resolved by the dynamic linker at load time. I believe LLVM supports this too. It isn't clear to me why rust doesn't support this, as it would avoid the overhead of checking the atomic with the detected CPU type.

Using a solution similar to static-keys — Rust library // Lib.rs (runtime patching jumps at start time, inspired by how the Linux kernel does this) should also work, though using ld.so as mentioned above seems like the ideal approach.

4 Likes

The Docker image count for these platforms would also be tripled.

Tbh I don't shipping v3 is particular worth the tradeoffs. v2 is widely supported and at this point the chips that lack support are very old. Whereas even many modern chips do not support v3.

Such niche usage could be supported by someone building rust out of tree but I'm skeptical that it's worth the effort for the rust project itself.

That uses ifunc, right? Ifunc is only supported by glibc.

Yes it uses IFUNCs. But it would still be nice to offer this on glibc, which most distros use. And maybe if there are more users it puts pressure on musl to add support too.

Not that likely I think:

If anything, exclusion of IFUNC is more definite now than in 2014. They keep showing up as vectors for things to break or even for disguising backdoors, and none of the prior reasons for excluding it are really resolvable, nor does it have any performance value over doing things portably with function pointers.

And that still leaves Windows, macOS and all BSD's.

3 Likes

That is unfortunate.

Yes, but there are several optimisations rust only does on certain platforms. Such as PGO, Bolt etc. Also Windows already has a newer base line, as the platform owner foeces a move forward. Same goes for Mac OS. So those platforms are not relevant here. And BSDs is a small minority of users. There are many obscure platforms other than those as well. What about Redox or Haiku?

PGO and Bolt are only done on some targets for rustc itself, you are free to use them on other platforms for your own executables. Multiversioning however would be a language addition. And so far we don't have any language features that are not supported on all targets. We didn't stabilize 128bit integers until LLVM supported them fine on all targets, f16/f128 is not yet stable either due to LLVM not supporting them on many architectures. For tail calls (which are unstable too) people have been arguing about how they aren't supported yet by LLVM on many architectures. We don't expose the #[thread_local] that thread_local! {} internally uses on stable as some targets need libc assisted emulation. Even for targets without hardware float support, we still expose floats and just use software emulation. So at best we could use ifunc internally as optimization for multi versioning on supported targets, but that would be an unclear performance pitfall on all non-glibc targets whenever you try to version based on a feature that is not mandatory on the target (eg AVX10 on Windows).

1 Like

I'm not sure aiming for the least common denominator makes sense. And there are plenty of features in std only supported on some targets: Windows doesn't have file descriptors for example. And querying for supported target features is only in std, not in core.

But what would make more sense would be exposing generic multi-versioning for compile flags, which can then be implemented in a target specific way, just like for thread locals. It could use IFUNCs when available, self modifying code, .init_array, etc. There is probably something on Windows too that could be used as a basis for this.

That is a library feature, not a language feature. And in particular one only available in libstd. libstd doesn't strictly target the common denominator, it merely tries to abstract over platform differences where possible, while still exposing a fair amount of platform-specific API's.

That reminds me. Chromium has a global policy against any .init_array usage. They probably don't want ifunc either for the same reasons.

That doesn't seem to be a Rust issue but a Chromium issue? But sure, you could have a -Cmultiversioning=slow

That's true. Would it be out of the question to obtain more parallel CI capacity? Then again, if money (whose? Foundation's? as of yet unknown sponsor's?) comes to play, it shouldn't be spent on a whim...

I suppose the rustc-perf target could be decided separately. It would be simpler to not have multiple targets for it, but I think the needle could be moved towards the newer architectures to have a perf "loss function" amenable for the newer ones. (I.e. a process where changes improving performance or avoiding regressions would benefit newer hardware.)

Personally, I'd just pick v2 for now. Performance improvements benefiting v3 while slowing down v2 are maybe something that'll just have to wait until we decide to move on to v3.

Exploring this space further sounds like a great idea to me. Even if the special-but-useful-enough target flavours were opt-in, many Rust users could be made happy, x86-64 being as popular as it is. And the distinction could definitely be something else than v1/v2/v3.

1 Like

An alternative soft roll-out way could be to move to v2 with the tier 1 with host tools targets, but demote v1 to tier 2 with host tools and give v3 the same treatment for the time being. This would somewhat decrease the CI/test burden at least.

1 Like

I don't think the benefits of 1-3% faster builds outweigh the downsides of having to support the more complex infrastructure and builds, at least in my opinion. I don't really see any more benefit than the perf, I don't think there is a lack of testing or benefit for compiler optimizations geared towards newer architectures that would matter that much here. I could certainly see a benefit of having standard libraries for different baselines available, but I see that as something best solved by build-std.

5 Likes

I think I'm inclined to concede on this taken on its own.

I suppose the point I'm most interested in is the hypothesis that given years of optimization and perf results, the target platform where those happen would benefit the most, sometimes giving wins on the target platform that would have been regressions on the older target (or any other target for that matter). And I can't come up with any reasonable way to experiment on this.


Interestingly, Ubuntu just recently added distributed x86-64 microarchitecture variants.

1 Like

Phoronix did a benchmark of this: https://www.phoronix.com/news/Ubuntu-Server-25.10-amd64v3

The results don't seem that impressive. Given the extra CI cost of building everything twice I would probably consider more more targeted approach if I was in charge: glibc and other commonly used libraries, media codes and language runtimes (python, perl etc) at a first guess, but then try to experiment and see where it gives the most bang for the buck.

1 Like