Policies around default CPU architecture


#1

I’ve been recently writing a bunch of performance-sensitive float code and I was surprised to see that rustc will not emit SSE4.1 instructions by default on macOS. As a result, very common operations like floor, ceil, and round become library calls. This obviously has impacts on performance. Even though the minimum support microarchitecture can be changed with RUSTFLAGS, this is a library, and I’m concerned that downstream users will not get optimum performance by default.

The relevant code seems to want to emit code for a Core 2 by default. This means that the compiler ensures that the code will work on Conroe or Merom, both of which are 65 nm CPUs manufactured in 2006. This CPU spec basically dates back to the start of the Rust project.

I’m aware that the decision has been made to not use cpu=native, even with cargo install. I’m OK with the decision to not use cpu=native by default, but do we have a policy for bumping the minimum default supported CPU over time? It seems overly conservative to me if we never move the default over time.

If there aren’t any such policies yet, as a straw man proposal, I suggest setting the default at the minimum supported CPU used by the overwhelming majority of the Rust userbase: for example, 99%. Thoughts?


#2

I’d be wary about such a move. Rust tries to be a systems language and that probably also means supporting unusual setups. If we want people to migrate from using C to using Rust, we can’t tell them „But you can’t use too old/low-power/special hardware, sorry“.

So while it probably makes sense to bump the default emitted CPU instructions (which might not be the minimal possible), it would still make sense to me to support the old ones somehow ‒ have a x86_64-unknown-linux-gnu-old and cpu=minimal or something. So if someone wants to support 30 years old HW, they still can, with some reasonable amount of reconfiguration.


#3

I’m not saying that we drop support for older CPUs entirely. I’m simply suggesting changing the default.


#4

Sorry, that that was just misunderstanding on my side. If there’s some way to ensure the „support for old“ does not entirely rot (maybe running the CI with the cpu=minimal or so), than that looks fine to have some way to move forward.


#5

In principle, I love the idea, because I’d love to see more efficient instructions being used when possible for obvious reasons.

I do think some attention needs to be given to failure modes here. One of the motivations for adding vendor intrinsics to std was the ability to annotate functions with a specific target feature, and thus, build and ship truly portable binaries. Previously, in order to take advantage of SIMD optimizations, I was shipping x86_64 binaries that assumed the target CPU supported SSSE3 instructions, which were first made available in 2006. I actually wound up with several bug reports from folks who downloaded the binary but couldn’t use it:

I was quite surprised there were so many folks still using x86_64 CPUs that didn’t support SSSE3. (Although, I only just recently retired my last CPU that doesn’t support SSSE3. I do still have some machines in active use that don’t support AVX, like my laptop.)

So in terms of failure modes, I think there are two angles:

  • What are the failure modes for Rust programmers compiling a binary that they believe is portable to all CPUs with the given target? If we up the default CPU, then it won’t be purely portable.
  • What are the failure modes for users of Rust programs that try to run a binary that uses instructions that their CPU doesn’t support? A SIGILL, core dump or some other seemingly mysterious error doesn’t seem great.

#6
  • I’d love a clear, easy way to specify CPU per workspace. [profile.release] arch = "native" or [profile.release.x86_64] sse4 = true or something like that. RUSTFLAGS="" is awkward to use and too easy to forget.

  • For macOS specifically, it’s relatively easy to know baseline CPU requirements. macOS SDK/Clang has a concept of “macos deployment target”, and each macOS version has a specific minimum CPU requirement. (BTW, Rust/Cargo should be more aware of the macos deployment target too, because the default is supporting only the current OS version, so despite best efforts on the Rust side, the linker ruins portability anyway!).


#7

If you think the current defaults for x86_64-apple-darwin are bad, you should check the defaults for i686-apple-darwin, see https://github.com/rust-lang/rust/issues/53423 . The oldest apple x86 32-bit machines use core CPUs with SSE3, but the default CPU used doesn’t even support MMX, which means that the default CPU for i686-apple-darwin is a CPU ~10 years older than the first CPU that the architecture actually used, true backwards compatibility with impossible hardware.

FWIW, I would be ok with adding an x86_64-apple-darwin-sse4 target and making that the default tier1 target on darwin, but for backwards compatibility purposes I’d prefer if x86_64-apple-darwin would stay the same (maybe moving it to tier2). An alternative would be to change the x86_64-apple-darwin defaults to use SSE4.2 as long as a crater run doesn’t break, and to provide a different target that use the current defaults or something like that. I am worried that changing the defaults might break some people’s code or workflow.


#8

To bring some data to this discussion, here are Steam stats:

100%   | SSE2 
99.99% | SSE3 
99.94% | LAHF / SAHF 
99.90% | CMPXCHG16B 
96.96% | SSSE3 
96.86% | FCMOV 
95.34% | SSE4.1 
94.01% | SSE4.2 
87.00% | AVX 
86.42% | AES 
58.42% | HyperThreading 
16.75% | SSE4a 

For macOS I know of Adium stats. It’s a project with declining popularity, so the stats are showing the long tail of old Macs. Still, shows 99.78% Mac CPUs are 64-bit.


#9

@kornel Good data, but it’s worth noting that the population of Steam users is not fully representative of the computers that people are running Rust programs on – in particular, almost nobody runs Steam on servers. For example, one of the issue reports burntsushi linked was from someone running on Azure, apparently on some strange AMD Opteron procesor that doesn’t support SSSE3 – in 2017!


#10

I was already bitten by ‘i686’ not actually meaning i686.


#11

YES PLEASE

I mean, I can set a workspace to target an entirely different CPU easily (which is great) and this seems like it should be at least as easy.


#12

More broadly, a more powerful cfg attribute (that was perhaps actually not a normal attribute, or even an “attribute” at all if it was determined they weren’t the most flexible way to do it) would also pretty much universally solve all problems along these lines, especially if there was a larger set of more granular built-in defines / environment variables / whatever you want to call them defined by rustc itself by default based on the specific architecture and platform it’s been built for.

I do agree that there’s a ton of stuff you should be able to do in cargo.toml that you cannot do curently, though.

I’ve never really understood why there isn’t an “extra flags” line or something like that where you can just pass literally any rustc flag you want. It would be infinitely ergonomic than the whole cargo rustc -- -Z whatever shenanigans.


#13

You can use the rustflags field in .cargo/config for this:

https://doc.rust-lang.org/cargo/reference/config.html#configuration-keys

You can stick the configuration in your repository if you want it scoped to a specific project.


#14

FWIW, a similar thing happened for armv7 targets and additional targets were added with extra features enabled.


#15

I was aware of that, actually. It’s not really the same thing though.

Basically what I was getting at is that the only reason I can think of that someone would not have made a specific field in cargo.toml (that Cargo is of course aware of) from the getgo is that they were prioritizing the specific use of crates as something to be put online and downloaded by others above everything else, without considering the fact that cargo.toml is in fact the only thing that is currently (or ever has been) available to programmers using Rust that could at least semi-accurately be described as a canonical project-file format.

As far as I’m considered the general concept of editing a cargo.toml (or a build.rs!) for a given crate in some way after downloading it in order to fit your specific use case should be seen as totally reasonable/normal.

In practice, the original author of XYZ crate is not always universally correct with regards to what they’ve put in there. Not even close (especially with regards to pretty much everything relating to cross-platform compatibility, in many, many cases.)


#16

I suggest setting the default at the minimum supported CPU used by the overwhelming majority of the Rust userbase: for example, 99%. Thoughts?

I think 99% is the right choice, but do realize that it means you can’t use SSE4.1, oh no, neither SSSE3.

Maybe SSE4.1 with macOS reaches 99%? Possible. I don’t know.


#17

It’s worth noting that this process is unfinished, because I failed to add a CI job to actually have the glibc Linux release artifacts generated.

While I think adding NEON-enabled targets is the right step at this point in time (no SIMD vs. baseline 128-bit SIMD is a pretty big deal), the approach of adding targets won’t scale for SSE and AVX levels. As mentioned in my Rust 2019 post, I think the way forward is to put Xargo functionality into Cargo and compile the standard library with user settings like all other crates (hopefully with some caching).

As for what defaults make sense, if my cpuid criteria was correctly researched, as of June 2018, 20% of the Firefox x86/x86_64 release population was still on the kind of x86/x86_64 CPU where there’s a substantial performance disparity between aligned and unaligned SIMD loads and stores with actually aligned addresses. IIRC performant unaligned SIMD loads and stores roughly correlate with SSE 4.2 availability. So the landscape is very different when considering an application run by basically everyone, like a Web browser, than when considering gaming systems (Steam numbers).

(Specifically, the criteria that matched 80% was: (cpu_family >= 0x15) OR (cpu_family == 0x6 AND cpu_model >= 0x1A AND cpu_model != 0x1C AND cpu_model != 0x36 AND cpu_model != 0x35 AND cpu_model != 0x27 AND cpu_model != 0x26).)

Of course, separating x86 and x86_64 could give different numbers that could justify changing the baseline SSE level on x86_64. (Meanwhile, bringing the baseline to SSE2 on x86 is still a new thing: Fedora is moving to it. In the case of Ubuntu, dropping x86 install support seems less controversial than compiling 32-bit packages in SSE2 mode!)

In any case, it seems to me that making cargo install default to native should be the first step even if it means that people will routinely run the compiler with options that aren’t the ones tested in CI.


#18

I’m trying to support AES-NI in my crates. This has presently involved plastering my documentation with information about how to configure RUSTFLAGS appropriately, either through environment variables or ~/.cargo/config. I would agree with @kornel that “RUSTFLAGS is awkward to use and too easy to forget”, and the per-target namespace mechanism suggested would be much, much nicer.

Some recent discussion on this on the crate in question:

One of my users suggested a default I’d like to give:

I think ideally the default should be “just work as fast as possible everywhere”

…but presently this requires anyone compiling my crate manually configuring target-feature, and there’s no way to provide a configuration which automatically builds the fastest backends available for all architectures and selects the fastest available one (using e.g. runtime CPU feature detection where necessary).

So instead I’m using a more draconian mechanism to try to force them to set target-feature: making users either configure one of target-feature or a cargo feature to use a non-hardware-accelerated fallback (so as to avoid the “easy to forget” part of “RUSTFLAGS is awkward to use and too easy to forget”) and it’s just leading to confusion.


#19

I’d really, really like to see this happen in 2019. This also allows you to LTO the standard library, compile it for minimize size instead of speed, etc.


#20

Amusingly, this thread motivated me to go look at putting something in my ~/.cargo/config to use native cpu on my laptop. I wanted it on by default, so that cargo install would use it for all the great little rust utilities I use. I looked at the documentation, decided that (in order not to clash with embedded builds for other targets, I should add it to a section for the native target, rather than blanket for everything.

I worked out what I needed, then opened the file to edit that in. Guess what I found already there?

[target.x86_64-unknown-linux-gnu]
rustflags = ["-Ctarget-cpu=native"]

Clearly, I did this once before :roll_eyes: and what I need to remember to do is override this in the workspace for those (rare) occasions where I’m building a binary to run elsewhere.

If there’s a point to this random anecdote, it’s that configuration hidden elsewhere is easily forgotten. Exposing this neatly in Cargo.toml makes it discoverable in the place people are likely to look.