Bad optimization for continuous memory access with redundant early check

Goal: Get a u64 from continuous 8 elements in a &[u8], and check its value

(All following codes can be checked in playground, with additionally worse u128 examples available)

Good one

To avoid unaligned access, I use the following code:

fn u64_fetch1(buf: &[u8], pos: usize) -> bool {
    let Some([b0, b1, b2, b3, b4, b5, b6, b7]) = buf.get(pos..(pos + 8)) else {
        return false;
    };
    let target = u64::from_le_bytes([*b0, *b1, *b2, *b3, *b4, *b5, *b6, *b7]);
    target == 0x1234567812345678
}

The optimization works fine (build with release profile in x86-64):

u64_fetch1:
	cmp	rdx, -8
	setae	al
	lea	rcx, [rdx + 8]
	cmp	rcx, rsi
	seta	cl
	or	cl, al
	je	.LBB2_2
	xor	eax, eax
	ret

.LBB2_2:
	movabs	rax, 1311768465173141112
	cmp	qword ptr [rdi + rdx], rax
	sete	al
	ret

There is only one memory access with qword width.

Bad one

However, if I add a redundant check in the code:

fn u64_fetch2(buf: &[u8], pos: usize) -> bool {
    let Some(b0) = buf.get(pos) else {
        return false;
    };
    // This one is redundant
    if *b0 != 0x78 {
        return false;
    }
    let Some([b0, b1, b2, b3, b4, b5, b6, b7]) = buf.get(pos..(pos + 8)) else {
        return false;
    };
    let target = u64::from_le_bytes([*b0, *b1, *b2, *b3, *b4, *b5, *b6, *b7]);
    target == 0x1234567812345678
}

And the generated assembly is

u64_fetch2:
	cmp	rdx, rsi
	jae	.LBB3_2
	cmp	byte ptr [rdi + rdx], 120
	jne	.LBB3_2
	cmp	rdx, -8
	setae	al
	lea	rcx, [rdx + 8]
	cmp	rcx, rsi
	seta	cl
	or	cl, al
	je	.LBB3_5

.LBB3_2:
	xor	eax, eax
	ret

.LBB3_5:
	movzx	eax, byte ptr [rdi + rdx + 1]
	movzx	ecx, byte ptr [rdi + rdx + 2]
	mov	esi, dword ptr [rdi + rdx + 4]
	shl	rsi, 32
	movzx	edx, byte ptr [rdi + rdx + 3]
	shl	edx, 24
	shl	ecx, 16
	shl	eax, 8
	or	eax, ecx
	or	eax, edx
	or	rax, rsi
	movabs	rcx, 1311768465173140992
	cmp	rax, rcx
	sete	al
	ret

There are five memory accesses in total: four byte-width access and one dword-width access.

In fact, from the above generate codes, the optimizer do know the check is redundant: after checking the first byte, the following three bytes are checked together to form a three-byte-width integer check instead of re-accessing the first byte.

Does the optimizer think the non-volatile access to immutable memory have side effect? If not, why doesn't it remove the first check to allow for only one memory access?

What happens if you move the b0 check after the let/else and remove the first redundant check and buf.get? (even if you have the check, you should probably avoid calling get twice if you can, as that may be what’s confusing the compiler, even though after inlining it shouldn’t really be an issue :person_shrugging:).

Also, what happens if you include the test it in the match? (let [b0 @ 0x78, etc. Also, not directly relevant to your question, but note that you can also give a name for the whole array to avoid repetition when passing to from_bytes)

What happens if you move the b0 check after the let/else and remove the first redundant check and buf.get?

Rust code:

fn u64_fetch3(buf: &[u8], pos: usize) -> bool {
    let Some([b0, b1, b2, b3, b4, b5, b6, b7]) = buf.get(pos..(pos + 8)) else {
        return false;
    };
    if *b0 != 0x78 {
        return false;
    }
    let target = u64::from_le_bytes([*b0, *b1, *b2, *b3, *b4, *b5, *b6, *b7]);
    target == 0x1234567812345678
}

what happens if you include the test it in the match?

Rust code:

fn u64_fetch4(buf: &[u8], pos: usize) -> bool {
    let Some([b0 @ 0x78, b1, b2, b3, b4, b5, b6, b7]) = buf.get(pos..(pos + 8)) else {
        return false;
    };
    let target = u64::from_le_bytes([*b0, *b1, *b2, *b3, *b4, *b5, *b6, *b7]);
    target == 0x1234567812345678
}

Both of the above Rust codes generate the same assembly:

u64_fetch3:
	cmp	rdx, -8
	setae	al
	lea	rcx, [rdx + 8]
	cmp	rcx, rsi
	seta	cl
	or	cl, al
	jne	.LBB4_1
	cmp	byte ptr [rdi + rdx], 120
	jne	.LBB4_1
	movzx	eax, byte ptr [rdi + rdx + 1]
	movzx	ecx, byte ptr [rdi + rdx + 2]
	mov	esi, dword ptr [rdi + rdx + 4]
	shl	rsi, 32
	movzx	edx, byte ptr [rdi + rdx + 3]
	shl	edx, 24
	shl	ecx, 16
	shl	eax, 8
	or	eax, ecx
	or	eax, edx
	or	rax, rsi
	movabs	rcx, 1311768465173140992
	cmp	rax, rcx
	sete	al
	ret

Still five memory accesses.

you can also give a name for the whole array to avoid repetition when passing to from_bytes

Thank you very much for this suggestion! But I rarely use this feature, could you give me an example about how to write such code?

I would suggest you stop using byte-at-a-time code, both to save yourself typing -- especially in those u128 versions -- and because if you load the whole array at once it's more obvious to LLVM to do what you want.

For example, both of these have the cmp qword you're looking for:

#[unsafe(no_mangle)]
fn u64_fetch1_simpler(buf: &[u8], pos: usize) -> bool {
    if let Some(after_pos) = buf.get(pos..)
        && let Some(array) = after_pos.first_chunk()
    {
        let target = u64::from_le_bytes(*array);
        target == 0x1234567812345678
    } else {
        false
    }
}

#[unsafe(no_mangle)]
fn u64_fetch2_simpler(buf: &[u8], pos: usize) -> bool {
    let Some(b0) = buf.get(pos) else {
        return false;
    };
    // This one is redundant
    if *b0 != 0x78 {
        return false;
    }
    if let Some(after_pos) = buf.get(pos..)
        && let Some(array) = after_pos.first_chunk()
    {
        let target = u64::from_le_bytes(*array);
        target == 0x1234567812345678
    } else {
        false
    }
}

https://rust.godbolt.org/z/aMhezxxGa

1 Like

Thank you for this suggestion! I will definitely use this :heart:

Or in a more functional style:

buf.get(pos..)
    // Alternatively:
    // .and_then(<[u8]>::first_chunk)
    .and_then(|buf| buf.first_chunk())
    .map(|&array| u64::from_le_bytes(array) == 0x1234567812345678)
    .unwrap_or(false)
1 Like

Or in nightly, a one-liner:

Some(0x1234567812345678) == try { u64::from_le_bytes(*buf.get(pos..)?.first_chunk()?) }

https://rust.godbolt.org/z/reT59eYxe

1 Like

IMO it's significantly less readable (i.e. requires more time from new readers to understand) and "smart" in a bad way. It certainly would not pass my code review. :slight_smile:

4 Likes