Initializing large arrays with non-zero values fails to optimize

gorilskij · May 4, 2021, 12:14pm

I was playing around with compiler explorer and I found some very strange optimizing behavior around unused arrays.

The code looks like this:

pub fn weird() {
    let _unused = [1_u32; 881];
}

For arrays with sizes below 881, this generates the expected assembly, namely a single ret, but for sizes >= 881, there is some assembly being generated. This happens at different array sizes for different types (it never happens with u8) and in all cases allocating an array of 0s is always optimized away.

This is a table of maximum sizes of arrays that are optimized away for a few types:

type	max. size
`*16`	2400
`*32`	880
`*64`	880
`*128`	297
`&()`	900
`&str`	148
`(u{8, 16, 32, 128}, u{8, 16, 32, 128})`	148
`(u64, u64)`	297
`(u8, u8, u8)`	99
`(u64, u64, u64)`	148

It's clear that the optimization depends at least somewhat on the size of the type but the numbers are mysterious to me and in any case the array should be completely optimized away in every situation. I should note that all this data relates to -C opt-level=3.

Interestingly, if the array is initialized with 0s and then filled with 1s in a loop, the optimized acts correctly and ignores the whole thing, the problem really only occurs when using the array literal initialization with Copy types.

Mokuz · May 4, 2021, 1:10pm

pub fn weird() {
    let _unused = [(1u64, 2u64, 3u64); 881];
}

pub fn loopp() {
    let mut arr: [(u64, u64, u64); 881] = unsafe { core::mem::uninitialized() };

    for i in &mut arr {
        *i = (1, 2, 3);
    }
}

if you look at the assembly generated for both of these functions, you'll see that they are equal, giving us insight to what's actually happening

example::loopp:
        mov     eax, 832
        test    rax, rax
        je      .LBB0_2
.LBB0_3:
        add     rax, -64
        test    rax, rax
        jne     .LBB0_3
.LBB0_2:
        ret

using rustc 1.51.0 -Copt-level=3 -C target-cpu=native

EDIT: this breaks everything

pub fn ext(mut arr: [(u64, u64, u64); 881]) {
    for i in &mut arr {
        *i = (1, 2, 3);
    }
}

.LCPI0_0:
        .quad   3
        .quad   1
        .quad   2
        .quad   3
.LCPI0_1:
        .quad   1
        .quad   2
        .quad   3
        .quad   1
        .quad   2
        .quad   3
        .quad   1
        .quad   2
.LCPI0_2:
        .quad   1
        .quad   2
example::ext:
        xor     eax, eax
        vmovaps ymm0, ymmword ptr [rip + .LCPI0_0]
        vmovaps zmm1, zmmword ptr [rip + .LCPI0_1]
.LBB0_1:
        vmovups ymmword ptr [rdi + rax + 64], ymm0
        vmovups zmmword ptr [rdi + rax], zmm1
        vmovups ymmword ptr [rdi + rax + 160], ymm0
        vmovups zmmword ptr [rdi + rax + 96], zmm1
        vmovups ymmword ptr [rdi + rax + 256], ymm0
        vmovups zmmword ptr [rdi + rax + 192], zmm1
        vmovups ymmword ptr [rdi + rax + 352], ymm0
        vmovups zmmword ptr [rdi + rax + 288], zmm1
        vmovups ymmword ptr [rdi + rax + 448], ymm0
        vmovups zmmword ptr [rdi + rax + 384], zmm1
        vmovups ymmword ptr [rdi + rax + 544], ymm0
        vmovups zmmword ptr [rdi + rax + 480], zmm1
        vmovups ymmword ptr [rdi + rax + 640], ymm0
        vmovups zmmword ptr [rdi + rax + 576], zmm1
        vmovups ymmword ptr [rdi + rax + 736], ymm0
        vmovups zmmword ptr [rdi + rax + 672], zmm1
        vmovups ymmword ptr [rdi + rax + 832], ymm0
        vmovups zmmword ptr [rdi + rax + 768], zmm1
        vmovups ymmword ptr [rdi + rax + 928], ymm0
        vmovups zmmword ptr [rdi + rax + 864], zmm1
        vmovups ymmword ptr [rdi + rax + 1024], ymm0
        vmovups zmmword ptr [rdi + rax + 960], zmm1
        vmovups ymmword ptr [rdi + rax + 1120], ymm0
        vmovups zmmword ptr [rdi + rax + 1056], zmm1
        cmp     rax, 19968
        je      .LBB0_2
        vmovups ymmword ptr [rdi + rax + 1216], ymm0
        vmovups zmmword ptr [rdi + rax + 1152], zmm1
        vmovups ymmword ptr [rdi + rax + 1312], ymm0
        vmovups zmmword ptr [rdi + rax + 1248], zmm1
        vmovups ymmword ptr [rdi + rax + 1408], ymm0
        vmovups zmmword ptr [rdi + rax + 1344], zmm1
        vmovups ymmword ptr [rdi + rax + 1504], ymm0
        vmovups zmmword ptr [rdi + rax + 1440], zmm1
        add     rax, 1536
        jmp     .LBB0_1
.LBB0_2:
        vmovaps xmm0, xmmword ptr [rip + .LCPI0_2]
        vmovups xmmword ptr [rdi + 21120], xmm0
        mov     qword ptr [rdi + 21136], 3
        vzeroupper
        ret

gorilskij · May 4, 2021, 2:18pm

I'm very much a beginner when it comes to assembly so I'm not sure I understand what's going on. But it seems to me there should be a much earlier step still within rustc that removes unused values...

steffahn · May 4, 2021, 2:32pm

For the record, that’s UB.

mjbshaw · May 4, 2021, 5:12pm

Since it's not clear from the OP, the pub fn weird() { let _unused = [1_u32; 881]; } will generate the following asm, even with -O:

example::weird:
        mov     eax, 880
.LBB0_1:
        add     rax, -40
        jne     .LBB0_1
        ret

PoignardAzur · May 5, 2021, 10:55am

That's so weird. If the assembly were modifying stack data, I'd get why LLVM fails to optimize it away, but in this case it's just bumping a register and then returning.

It feels like LLVM should be trivially able to say "nope, we're only changing a local and not doing anything with it, this function returns void, throw it all away".

Mhh... dumb thought, but I wonder if LLVM was able to optimize this until recently and the recent removal of the forward progress guarantee inhibited optimizations that would have elided this. (not saying the forward progress guarantee is needed to optimize this code; just that LLVM might have an optimization pass that would have solved this, but was classified under "only works if this function has forward progress guanratee")

steffahn · May 5, 2021, 11:53am

At least my gut feeling is telling me that the only problem that LLVM could have with this code is that it's unable to figure out that it terminates. Since forward progress guarantee is about asserting that certain kinds of code will always terminate, it might very well be related to that. Especially with the focus of LLVM on languages like C++ that do come with a forward progress guarantee requirement, it wouldn't be surprising to me if the lack of other means of termination checking heuristics (or the lack of whatever else is missing in this failed optimization) that could apply in this example is simply due to that it has never been a problem.

gorilskij · May 5, 2021, 12:09pm

Sometimes (especially with u128) it generates quite a bit more assembly, try [1_u128; 300] for example.

gorilskij · May 5, 2021, 12:14pm

I see some assembly generated with basically all versions of rust, from 1.0.0 to nightly. Does rustc not do any optimization of unused values?

mjbshaw · May 5, 2021, 12:57pm

rustc is infamous for not doing any optimization and giving LLVM a crap ton of IR to figure out.

My guess is that it's an issue with the order of optimizations done in LLVM. FWIW these loops terminate, so I don't think this is a forward progress issue.

steffahn · May 5, 2021, 1:10pm

Well… well… the purpose of forward progress isn’t to turn non-terminating code without side effects into UB. That would be a weird motivation. Instead the purpose is to simplify the optimization of side-effect free code that is terminating. The simplification is that such code can be optimized away without the compiler needing to actually figure out whether the code terminates. Which is great, since figuring out that code terminates is hard, and impossible in general.

In a language with a requirement for forward progress, the compiler won’t try to actively find non-terminating code and “turn it into UB”, instead it will optimize terminating code better and as a side-effect it will also optimize some non-terminating code “incorrectly” (except it’s not incorrect when it’s UB, that’s the whole trick of UB). This means the fact that the loop terminates is not an indication for the fact that forward progress can’t be relevant here.

I’m not saying that this specific code example optimizes badly because of a forward progress issue, I’m just saying that it could be related to forward progress.

mcy · May 7, 2021, 8:59pm

The IR emitted by rustc (for the OP's snippet) is

; Function Attrs: nofree nounwind nonlazybind uwtable writeonly
define void @_ZN10playground5weird17h28e31b3814ad431aE() unnamed_addr #0 {
start:
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %start
  %index = phi i64 [ 0, %start ], [ %index.next.4, %vector.body ]
  %index.next.4 = add nuw nsw i64 %index, 40
  %0 = icmp eq i64 %index.next.4, 880
  br i1 %0, label %repeat_loop_body, label %vector.body, !llvm.loop !2

repeat_loop_body:                                 ; preds = %vector.body
  ret void
}

(This is rustc -Copt-level=3 --emit=llvm-ir)

Meanwhile, for identical C++

void Weird() {
  std::array<int, 881> arr;
  arr.fill(1)
}

when passed through clang++ -O3 -S -emit-llvm we get

; Function Attrs: nounwind uwtable writeonly
define dso_local void @_Z5Weirdv() local_unnamed_addr #0 {
  br label %1

1:                                                ; preds = %1, %0
  %2 = phi i64 [ 0, %0 ], [ %3, %1 ]
  %3 = add nuw nsw i64 %2, 40
  %4 = icmp eq i64 %3, 880 
  br i1 %4, label %5, label %1, !llvm.loop !2

5:                                                ; preds = %1
  ret void
}

If I ask Clang and Rustc to emit a .S file, I get, byte for byte,

weird:
  movl  $880, %eax
.LBB0_1:
  addq  $-40, %rax
  jne .LBB0_1
  retq

This looks like an LLVM issue, interestingly.

PoignardAzur · May 17, 2021, 7:09am

I tested some variations on goldbolt:

void weird_2() {
  int array[881];
  int i;
  for (i = 0; i < 881; ++i) {
      array[i] = 1;
  }
}

void weird_3() {
  int array[881];
  int* p;
  for (p = array; p < array + 881; ++p) {
      *p = 1;
  }
}

Some takeaways:

LLVM can't optimize some versions at -O2, even though it logically should.
GCC can optimize all variations I tried from -O2 onwards.
No notable difference between -O2 and -O3 in the cases I tested (this is roughly what I expected; I'm told there isn't much difference between the two levels).
The above snippet is in C++, where the forward progress guarantee still applies; and LLVM fails to optimize it even in previous versions, so my hypothesis that the problem came from the recent remove of the forward-progress guarantee has been falsified.

I wonder if there's some trivial fix the Rust compiler could do to tell LLVM "no, don't worry, this loop always terminates, feel free to elide it". Like an annotation on the break or something; I don't know if LLVM has that kind of annotation though.

system · August 15, 2021, 7:10am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Optional memory alignments for arrays and vectors language design	3	1334	March 25, 2019
`NonMaxUsize` and niche value optimisation libs	20	1408	January 9, 2024
Mem::uninitialized, `!` and trap representations language design	56	6486	March 25, 2019
Const generic array sizes as its own mini-stabilization? libs	5	625	June 1, 2021
Unusual missed optimization for signed integers modulo `1<<k`? compiler	14	1421	December 26, 2022

Initializing large arrays with non-zero values fails to optimize

Related Topics