Slower code with "-C target-cpu=native"

I have AMD FX 8320. Every Rust program runs about 1.5-2x slower when I compile with target-cpu=native, both on stable 1.65 and nightly. Command rustc --print=target-cpus says:

Available CPUs for this target:
    native         - Select the CPU of the current host (currently bdver2).
    ...

Did rustc detect wrong CPU? Nope, cat /proc/cpuinfo confirms bdver2:

Summary
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 21
model           : 2
model name      : AMD FX(tm)-8320 Eight-Core Processor
stepping        : 0
microcode       : 0x6000852
cpu MHz         : 3296.080
cache size      : 2048 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 16
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate ssbd ibpb vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
bugs            : fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed
bogomips        : 7032.14
TLB size        : 1536 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

CPU family 21 is Piledriver, which is indeed Bulldozer version 2. Is something's wrong with the set of target features that rustc gives to its LLVM backend? How do I debug this?

Well, how does the assembly language change? Try to find a very short program that exhibits the slowdown, that will make it easier to find the change that matters.

diff -u <(rustc +nightly --print cfg) <(rustc +nightly -Ctarget-cpu=bdver2 --print cfg) shows that the following target features get enabled:

target_feature="aes"
target_feature="avx"
target_feature="bmi1"
target_feature="cmpxchg16b"
target_feature="f16c"
target_feature="fma"
target_feature="lzcnt"
target_feature="pclmulqdq"
target_feature="popcnt"
target_feature="sse3"
target_feature="sse4.1"
target_feature="sse4.2"
target_feature="sse4a"
target_feature="ssse3"
target_feature="tbm"
target_feature="xsave"

You can try if enabling them with something like -Ctarget-feature=+aes,+avx,+bmi1,+cmpxchg16b,+f16c,+fma,+lzcnt,+pclmulqdq,+popcnt,+sse3,+sse4.1,+sse4.2,+sse4a,+ssse3,+tbm,+xsave reproduces the slowdown. If not, this is likely caused by a change in the used instruction scheduler heuristics. You can verify this by using -Ztune-cpu=bdver2 on nightly without -Ctarget-cpu or -Ctarget-feature.

3 Likes

Thanks! Turns out, the culprit is AVX: when compiled with -C target-cpu=native -C target-feature=-avx, programs perform as good as without any flags or with -C target-cpu=generic, and they also run slower when compiled with -C target-feature=+avx. I was quite surprised to learn this. So either AVX is slow on my CPU (unlikely) or, more likely, LLVM has decided to use AVX in places where it doesn't bring any benefits, like for unrolling small loops.

If the issue is related to AVX, do you hit the termal limit of your cooliig system? I think I read somewhere that AVX512 had a very high chance to throttle your CPU because of the heat generated.

1 Like

Not at all in this case. First, the programs I've tested were run for less than 5 seconds. Second, nothing else was running at the time and CPU temperature was 35°C according to lm_sensors.

In https://www.agner.org/optimize/blog/read.php?i=285#285 Agner Fog made this observation about the Piledriver architecture:

LLVM doesn't seem to be aware of that, as it happily uses AVX if enabled.

15 Likes

I think it's usually the power budget/downclocking that's the issue, not the temperatures. libx265 developers ran into this when they added AVX(2? 512?) (there was a VideoLAN Developer Days talk on this; I don't know if it was recorded or not). This gist has some good details. But this seems like a red herring given the info @quaternic shared.

1 Like

Incredible. Thank you so much, now I know to never use AVX on this architecture. So unlikely things still happen. I wonder why did AMD put this garbage of a circuit into Piledriver in the first place, if it's even slower than older stuff :face_with_raised_eyebrow:

Compatibility with programs requiring AVX or for marketing reasons I would guess?

1 Like

Probably the latter :slight_smile: If a capitalist can legally sell you turds packaged as candy, he will do it.

Thanks to @quaternic, I did some actual measurements. Brace yourselves.

Benchmark:

#![feature(bench_black_box)]
const X: [u8; 128] = *b"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla efficitur vehicula libero et rutrum. Aenean ut massa tempus cras.";

fn main() {
    let cell = std::cell::UnsafeCell::new([0; 128]);
    std::hint::black_box(&cell);
    loop {
        unsafe {
            *cell.get() = X;
        }
    }
}

First, let's see how the olden way of loop unrolling goes on my lovely FX.

perf stat cargo +nightly run --release

 12,32 │80:   movaps %xmm0,0x80(%rsp)
 12,72 │      movaps %xmm1,0x70(%rsp)
 12,32 │      movaps %xmm2,0x60(%rsp)
 12,90 │      movaps %xmm3,0x50(%rsp)
 12,35 │      movaps %xmm4,0x40(%rsp)
 12,48 │      movaps %xmm5,0x30(%rsp)
 12,22 │      movaps %xmm6,0x20(%rsp)
  6,56 │      movaps %xmm7,0x10(%rsp)
  6,12 │    ↑ jmp    80

Observe insn per cycle.

 Performance counter stats for 'cargo +nightly run --release':

         21 022,33 msec task-clock:u              #    1,000 CPUs utilized          
                 0      context-switches:u        #    0,000 /sec                   
                 0      cpu-migrations:u          #    0,000 /sec                   
             1 408      page-faults:u             #   66,976 /sec                   
    82 724 942 673      cycles:u                  #    3,935 GHz                    
       174 922 469      stalled-cycles-frontend:u #    0,21% frontend cycles idle   
        14 044 782      stalled-cycles-backend:u  #    0,02% backend cycles idle    
    92 907 312 650      instructions:u            #    1,12  insn per cycle         
                                                  #    0,00  stalled cycles per insn
    10 329 314 731      branches:u                #  491,350 M/sec                  
           427 305      branch-misses:u           #    0,00% of all branches        

      21,032053361 seconds time elapsed

      20,985561000 seconds user
       0,010564000 seconds sys

Impressive. Very nice. Let's see Piledriver's AVX.

RUSTFLAGS="-C target-feature=+avx" perf stat cargo +nightly run --release

  2,83 │60:   vmovaps %ymm0,0x80(%rsp)
 23,55 │      vmovaps %ymm1,0x60(%rsp)
 24,70 │      vmovaps %ymm2,0x40(%rsp)
 24,38 │      vmovaps %ymm3,0x20(%rsp)
 24,55 │    ↑ jmp     60
 Performance counter stats for 'cargo +nightly run --release':

         20 595,21 msec task-clock:u              #    1,000 CPUs utilized          
                 0      context-switches:u        #    0,000 /sec                   
                 0      cpu-migrations:u          #    0,000 /sec                   
             1 410      page-faults:u             #   68,463 /sec                   
    81 007 957 270      cycles:u                  #    3,933 GHz                    
       196 009 605      stalled-cycles-frontend:u #    0,24% frontend cycles idle   
    70 843 823 579      stalled-cycles-backend:u  #   87,45% backend cycles idle    
     5 591 794 648      instructions:u            #    0,07  insn per cycle         
                                                  #   12,67  stalled cycles per insn
     1 119 522 542      branches:u                #   54,358 M/sec                  
           422 744      branch-misses:u           #    0,04% of all branches        

      20,604683023 seconds time elapsed

      20,556737000 seconds user
       0,011829000 seconds sys

Look at that subtle CPU stalling... The wasteful slowness of it. Oh my god, it's even more than twelve.

3 Likes

It's a relatively well-known issue. Using AVX instructions in code with low density of such instructions causes lower performance compared to equivalent code which does not use them. The CloudFlare blog post covers it quite well:

Unfortunately, LLVM is not smart enough to take this fact into consideration during codegen, so it will happily insert several AVX instructions into a mostly non-SIMD code, which in practice reduces performance, not the other way around.

7 Likes

0,07 insn per cycle is not due to downclocking. And the "power licence" stuff is an intel issue.


vmovaps %ymm0,0x80(%rsp)

As previous posts mentioned Bulldozer and Piledriver have an AVX implementation that internally splits the 256bit-instructions into two 128bit uops, so it's better to use XMM registers than YMM registers. There's an llvm argument to control that.

Try adding -Cllvm-args=-force-vector-width=128. According to #53312 prefer-vector-width isn't available because it's a function attribute, not a global flag.

But this should be fixed in llvm to do this by default.

Note that raw latency and throughput aren't the only concerns. In addition to supporting running AVX512 using programs, it's also possible that doing computation with AVX512 is more power-efficient than doing the same computation with smaller registers.

It'll be interesting to see how the new chips from AMD which are also marketing their AVX512 support (for acceleration of ML tasks, supposedly) perform.

1 Like

Piledriver does not support AVX512 or even AVX2. Just the original AVX.

Ah, my bad, I got some wires crossed. Still, the same applies.

Still no effect, but now throws a "remark":

$ RUSTFLAGS="-C target-feature=+avx -C llvm-args=-force-vector-width=128" perf stat cargo +nightly run --release
remark: <unknown>:0:0: loop not vectorized: call instruction cannot be vectorized

The assembly still looks the same, as well as perf stats.

The llvm IR for that example contains a call to memcpy. And then some llvm optimization substitutes that with llvm's own implementation rather than calling the libc function. And I guess that implementation doesn't honor force-vector-width and ignores the costmodel and just blindly puts 256-bit moves there when AVX is enabled, even though llvm-mca knows they're costly on that CPU.

Seems like an LLVM bug to me.

1 Like

Note that it doesn't know that the unaligned 256-bit move (vmovups) is just as slow, or even slower. Thanks to @burjui for measuring it; the benchmark indicated each move taking ~25 cycles instead of the 18 for the aligned ones. That's probably affected by the actual alignment at runtime, but we didn't test for that. This is just a guess, but the numbers would approximately line up if vmovups is effectively just 1 or 2 vmovaps depending on if a cacheline was crossed.

So it could also be that LLVM initially uses vmovups because it seems efficient to it, and some later optimization replaces those with the aligned version when possible.