Initializing large arrays with non-zero values fails to optimize

I was playing around with compiler explorer and I found some very strange optimizing behavior around unused arrays.

The code looks like this:

pub fn weird() {
    let _unused = [1_u32; 881];
}

For arrays with sizes below 881, this generates the expected assembly, namely a single ret, but for sizes >= 881, there is some assembly being generated. This happens at different array sizes for different types (it never happens with u8) and in all cases allocating an array of 0s is always optimized away.

This is a table of maximum sizes of arrays that are optimized away for a few types:

type max. size
*16 2400
*32 880
*64 880
*128 297
&() 900
&str 148
(u{8, 16, 32, 128}, u{8, 16, 32, 128}) 148
(u64, u64) 297
(u8, u8, u8) 99
(u64, u64, u64) 148

It's clear that the optimization depends at least somewhat on the size of the type but the numbers are mysterious to me and in any case the array should be completely optimized away in every situation. I should note that all this data relates to -C opt-level=3.

Interestingly, if the array is initialized with 0s and then filled with 1s in a loop, the optimized acts correctly and ignores the whole thing, the problem really only occurs when using the array literal initialization with Copy types.

pub fn weird() {
    let _unused = [(1u64, 2u64, 3u64); 881];
}

pub fn loopp() {
    let mut arr: [(u64, u64, u64); 881] = unsafe { core::mem::uninitialized() };

    for i in &mut arr {
        *i = (1, 2, 3);
    }
}

if you look at the assembly generated for both of these functions, you'll see that they are equal, giving us insight to what's actually happening

example::loopp:
        mov     eax, 832
        test    rax, rax
        je      .LBB0_2
.LBB0_3:
        add     rax, -64
        test    rax, rax
        jne     .LBB0_3
.LBB0_2:
        ret

using rustc 1.51.0 -Copt-level=3 -C target-cpu=native

EDIT: this breaks everything

pub fn ext(mut arr: [(u64, u64, u64); 881]) {
    for i in &mut arr {
        *i = (1, 2, 3);
    }
}
.LCPI0_0:
        .quad   3
        .quad   1
        .quad   2
        .quad   3
.LCPI0_1:
        .quad   1
        .quad   2
        .quad   3
        .quad   1
        .quad   2
        .quad   3
        .quad   1
        .quad   2
.LCPI0_2:
        .quad   1
        .quad   2
example::ext:
        xor     eax, eax
        vmovaps ymm0, ymmword ptr [rip + .LCPI0_0]
        vmovaps zmm1, zmmword ptr [rip + .LCPI0_1]
.LBB0_1:
        vmovups ymmword ptr [rdi + rax + 64], ymm0
        vmovups zmmword ptr [rdi + rax], zmm1
        vmovups ymmword ptr [rdi + rax + 160], ymm0
        vmovups zmmword ptr [rdi + rax + 96], zmm1
        vmovups ymmword ptr [rdi + rax + 256], ymm0
        vmovups zmmword ptr [rdi + rax + 192], zmm1
        vmovups ymmword ptr [rdi + rax + 352], ymm0
        vmovups zmmword ptr [rdi + rax + 288], zmm1
        vmovups ymmword ptr [rdi + rax + 448], ymm0
        vmovups zmmword ptr [rdi + rax + 384], zmm1
        vmovups ymmword ptr [rdi + rax + 544], ymm0
        vmovups zmmword ptr [rdi + rax + 480], zmm1
        vmovups ymmword ptr [rdi + rax + 640], ymm0
        vmovups zmmword ptr [rdi + rax + 576], zmm1
        vmovups ymmword ptr [rdi + rax + 736], ymm0
        vmovups zmmword ptr [rdi + rax + 672], zmm1
        vmovups ymmword ptr [rdi + rax + 832], ymm0
        vmovups zmmword ptr [rdi + rax + 768], zmm1
        vmovups ymmword ptr [rdi + rax + 928], ymm0
        vmovups zmmword ptr [rdi + rax + 864], zmm1
        vmovups ymmword ptr [rdi + rax + 1024], ymm0
        vmovups zmmword ptr [rdi + rax + 960], zmm1
        vmovups ymmword ptr [rdi + rax + 1120], ymm0
        vmovups zmmword ptr [rdi + rax + 1056], zmm1
        cmp     rax, 19968
        je      .LBB0_2
        vmovups ymmword ptr [rdi + rax + 1216], ymm0
        vmovups zmmword ptr [rdi + rax + 1152], zmm1
        vmovups ymmword ptr [rdi + rax + 1312], ymm0
        vmovups zmmword ptr [rdi + rax + 1248], zmm1
        vmovups ymmword ptr [rdi + rax + 1408], ymm0
        vmovups zmmword ptr [rdi + rax + 1344], zmm1
        vmovups ymmword ptr [rdi + rax + 1504], ymm0
        vmovups zmmword ptr [rdi + rax + 1440], zmm1
        add     rax, 1536
        jmp     .LBB0_1
.LBB0_2:
        vmovaps xmm0, xmmword ptr [rip + .LCPI0_2]
        vmovups xmmword ptr [rdi + 21120], xmm0
        mov     qword ptr [rdi + 21136], 3
        vzeroupper
        ret

I'm very much a beginner when it comes to assembly so I'm not sure I understand what's going on. But it seems to me there should be a much earlier step still within rustc that removes unused values...

For the record, that’s UB.

3 Likes

Since it's not clear from the OP, the pub fn weird() { let _unused = [1_u32; 881]; } will generate the following asm, even with -O:

example::weird:
        mov     eax, 880
.LBB0_1:
        add     rax, -40
        jne     .LBB0_1
        ret
1 Like

That's so weird. If the assembly were modifying stack data, I'd get why LLVM fails to optimize it away, but in this case it's just bumping a register and then returning.

It feels like LLVM should be trivially able to say "nope, we're only changing a local and not doing anything with it, this function returns void, throw it all away".

Mhh... dumb thought, but I wonder if LLVM was able to optimize this until recently and the recent removal of the forward progress guarantee inhibited optimizations that would have elided this. (not saying the forward progress guarantee is needed to optimize this code; just that LLVM might have an optimization pass that would have solved this, but was classified under "only works if this function has forward progress guanratee")

At least my gut feeling is telling me that the only problem that LLVM could have with this code is that it's unable to figure out that it terminates. Since forward progress guarantee is about asserting that certain kinds of code will always terminate, it might very well be related to that. Especially with the focus of LLVM on languages like C++ that do come with a forward progress guarantee requirement, it wouldn't be surprising to me if the lack of other means of termination checking heuristics (or the lack of whatever else is missing in this failed optimization) that could apply in this example is simply due to that it has never been a problem.

Sometimes (especially with u128) it generates quite a bit more assembly, try [1_u128; 300] for example.

I see some assembly generated with basically all versions of rust, from 1.0.0 to nightly. Does rustc not do any optimization of unused values?

rustc is infamous for not doing any optimization and giving LLVM a crap ton of IR to figure out.

My guess is that it's an issue with the order of optimizations done in LLVM. FWIW these loops terminate, so I don't think this is a forward progress issue.

Well… well… the purpose of forward progress isn’t to turn non-terminating code without side effects into UB. That would be a weird motivation. Instead the purpose is to simplify the optimization of side-effect free code that is terminating. The simplification is that such code can be optimized away without the compiler needing to actually figure out whether the code terminates. Which is great, since figuring out that code terminates is hard, and impossible in general.

In a language with a requirement for forward progress, the compiler won’t try to actively find non-terminating code and “turn it into UB”, instead it will optimize terminating code better and as a side-effect it will also optimize some non-terminating code “incorrectly” (except it’s not incorrect when it’s UB, that’s the whole trick of UB). This means the fact that the loop terminates is not an indication for the fact that forward progress can’t be relevant here.

I’m not saying that this specific code example optimizes badly because of a forward progress issue, I’m just saying that it could be related to forward progress.

1 Like