More musings about slices and arrays

This isn't a serious proposal. I've written a little test code:

#![feature(slice_fill)]

pub fn foo1(a: &mut [u8]) {
    for i in 0_u8 .. 10 {
        a[usize::from(i)] = 0;
    }
}

pub fn foo2(a: &mut [u8]) {
    for i in (0_u8 .. 10).rev() {
        a[usize::from(i)] = 0;
    }
}

pub fn foo3(a: &mut [u8]) {
    a[.. 10].fill(0);
}

pub fn foo4(a: &mut [u8; 10]) {
    for i in 0_u8 .. 10 {
        a[usize::from(i)] = 0;
    }
}

The asm produced:

foo1:
    push    rax
    test    rsi, rsi
    je      .LBB5_1
    mov     byte ptr [rdi], 0
    cmp     rsi, 1
    je      .LBB5_3
    mov     byte ptr [rdi + 1], 0
    cmp     rsi, 2
    je      .LBB5_6
    mov     byte ptr [rdi + 2], 0
    cmp     rsi, 3
    je      .LBB5_8
    mov     byte ptr [rdi + 3], 0
    cmp     rsi, 4
    je      .LBB5_10
    mov     byte ptr [rdi + 4], 0
    cmp     rsi, 5
    je      .LBB5_12
    mov     byte ptr [rdi + 5], 0
    cmp     rsi, 6
    je      .LBB5_14
    mov     byte ptr [rdi + 6], 0
    cmp     rsi, 7
    je      .LBB5_16
    mov     byte ptr [rdi + 7], 0
    cmp     rsi, 8
    je      .LBB5_18
    mov     byte ptr [rdi + 8], 0
    cmp     rsi, 9
    je      .LBB5_20
    mov     byte ptr [rdi + 9], 0
    pop     rax
    ret
.LBB5_1:
    xor     edi, edi
    jmp     .LBB5_4
.LBB5_3:
    mov     edi, 1
    jmp     .LBB5_4
.LBB5_6:
    mov     edi, 2
    jmp     .LBB5_4
.LBB5_8:
    mov     edi, 3
    jmp     .LBB5_4
.LBB5_10:
    mov     edi, 4
    jmp     .LBB5_4
.LBB5_12:
    mov     edi, 5
    jmp     .LBB5_4
.LBB5_14:
    mov     edi, 6
    jmp     .LBB5_4
.LBB5_16:
    mov     edi, 7
    jmp     .LBB5_4
.LBB5_18:
    mov     edi, 8
    jmp     .LBB5_4
.LBB5_20:
    mov     edi, 9
.LBB5_4:
    lea     rdx, [rip + .L__unnamed_4]
    call    qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
    ud2


foo2:
	push    rax
	cmp     rsi, 9
	jbe     .LBB1_1
	mov     word ptr [rdi + 8], 0
	mov     qword ptr [rdi], 0
	pop     rax
	ret
.LBB1_1:
	lea     rdx, [rip + .L__unnamed_2]
	mov     edi, 9
	call    qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
	ud2


foo3:
    push    rax
    cmp     rsi, 9
    jbe     .LBB6_1
    mov     word ptr [rdi + 8], 0
    mov     qword ptr [rdi], 0
    pop     rax
    ret
.LBB6_1:
    lea     rdx, [rip + .L__unnamed_5]
    mov     edi, 10
    call    qword ptr [rip + core::slice::index::slice_end_index_len_fail@GOTPCREL]
    ud2


foo4:
    mov     word ptr [rdi + 8], 0
    mov     qword ptr [rdi], 0
    ret

The asm of foo1 is kind of terrible because Rust semantics requires the panics to be generated at the first out of bounds in the same order it should appear. foo2 is more reasonable (and it has the same asm of foo3) because here the programmer required to access the array section (or slice) from its end.

foo4 is the best because the compiler knows statically that out of bounds will not happen.

Another possibility is:

pub fn foo5(a: &mut [u8; 10..]) {
    for i in 0_u8 .. 10 {
        a[usize::from(i)] = 0;
    }
}

The semantics of that code is: foo5 is a regular function, no const generics are present here. foo5 gets as arguments a pair of pointer of a + length of a, just like foo1. But the compiler knows the length of a is >= 10, so no bounds tests are present inside that foo5, so the asm is the same as the asm of foo4, it contains no tests.

At the call point you can't pass any slice to foo5, the compiler inserts an automatic length test at the call point if you give it a run-time slice, but the compiler doesn't insert the test if you give it an array of length >= 10. (So there is an implicit cast. I am not sure this is a good idea for Rust, where implicit casts are very rare).

If inside foo5 you try to index a past length 10 or with an index of value unknown at compile-time, then it inserts a run-time test as it does with regular slices. It omits run-time test only if you access a with indexes statically known to be < 10 (as here).

Note that &[u8; 1..] means a non-empty slice. I find several situations where I don't want an input empty slice/array.

I can think of various situations where such kind of arrays could be useful (parsing, etc) to remove some array bound tests and speed up code, but in practice too many kinds of slices and arrays are unacceptable for a reasonable system language... So this isn't a serious proposal.

5 Likes

Another approach, preserving semantics of foo1 would be a make the common case get a special code path, most easily archieved by just duplicating the code in a macro (I’m honestly slightly surprised that replacing e with if cond { e } else { e } where cond is side-effect-free does make such a difference in the generated assembly):

macro_rules! case_distinction{
    ($cond:expr, $e:expr) => {
        if $cond { $e } else { $e }
    }
}


pub fn foo1(a: &mut [u8]) {
    for i in 0_u8 .. 10 {
        a[usize::from(i)] = 0;
    }
}

pub fn foo2(a: &mut [u8]) {
    case_distinction!(a.len() >= 10, {
        for i in 0_u8 .. 10 {
            a[usize::from(i)] = 0;
        }
    });
}

See how foo2 gets a fast-path in the beginning

example::foo1:
        push    rax
        test    rsi, rsi
        je      .LBB0_1
        mov     byte ptr [rdi], 0
        cmp     rsi, 1
        je      .LBB0_3
        mov     byte ptr [rdi + 1], 0
        cmp     rsi, 2
        je      .LBB0_6
        mov     byte ptr [rdi + 2], 0
        cmp     rsi, 3
        je      .LBB0_8
        mov     byte ptr [rdi + 3], 0
        cmp     rsi, 4
        je      .LBB0_10
        mov     byte ptr [rdi + 4], 0
        cmp     rsi, 5
        je      .LBB0_12
        mov     byte ptr [rdi + 5], 0
        cmp     rsi, 6
        je      .LBB0_14
        mov     byte ptr [rdi + 6], 0
        cmp     rsi, 7
        je      .LBB0_16
        mov     byte ptr [rdi + 7], 0
        cmp     rsi, 8
        je      .LBB0_18
        mov     byte ptr [rdi + 8], 0
        cmp     rsi, 9
        je      .LBB0_20
        mov     byte ptr [rdi + 9], 0
        pop     rax
        ret
.LBB0_1:
        xor     edi, edi
        jmp     .LBB0_4
.LBB0_3:
        mov     edi, 1
        jmp     .LBB0_4
.LBB0_6:
        mov     edi, 2
        jmp     .LBB0_4
.LBB0_8:
        mov     edi, 3
        jmp     .LBB0_4
.LBB0_10:
        mov     edi, 4
        jmp     .LBB0_4
.LBB0_12:
        mov     edi, 5
        jmp     .LBB0_4
.LBB0_14:
        mov     edi, 6
        jmp     .LBB0_4
.LBB0_16:
        mov     edi, 7
        jmp     .LBB0_4
.LBB0_18:
        mov     edi, 8
        jmp     .LBB0_4
.LBB0_20:
        mov     edi, 9
.LBB0_4:
        lea     rdx, [rip + .L__unnamed_1]
        call    qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
        ud2

example::foo2:
        push    rax
        cmp     rsi, 9
        jbe     .LBB1_1
        mov     qword ptr [rdi], 0
        mov     word ptr [rdi + 8], 0
        pop     rax
        ret
.LBB1_1:
        test    rsi, rsi
        jne     .LBB1_4
        xor     eax, eax
        jmp     .LBB1_13
.LBB1_4:
        mov     byte ptr [rdi], 0
        mov     eax, 1
        cmp     rsi, 1
        je      .LBB1_13
        mov     byte ptr [rdi + 1], 0
        mov     eax, 2
        cmp     rsi, 2
        je      .LBB1_13
        mov     byte ptr [rdi + 2], 0
        mov     eax, 3
        cmp     rsi, 3
        je      .LBB1_13
        mov     byte ptr [rdi + 3], 0
        mov     eax, 4
        cmp     rsi, 4
        je      .LBB1_13
        mov     byte ptr [rdi + 4], 0
        mov     eax, 5
        cmp     rsi, 5
        je      .LBB1_13
        mov     byte ptr [rdi + 5], 0
        mov     eax, 6
        cmp     rsi, 6
        je      .LBB1_13
        mov     byte ptr [rdi + 6], 0
        mov     eax, 7
        cmp     rsi, 7
        je      .LBB1_13
        mov     byte ptr [rdi + 7], 0
        mov     eax, 8
        cmp     rsi, 8
        je      .LBB1_13
        mov     byte ptr [rdi + 8], 0
        mov     eax, 9
.LBB1_13:
        lea     rdx, [rip + .L__unnamed_2]
        mov     rdi, rax
        call    qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
        ud2

.L__unnamed_3:
        .ascii  "./example.rs"

.L__unnamed_1:
        .quad   .L__unnamed_3
        .asciz  "\f\000\000\000\000\000\000\000\n\000\000\000\t\000\000"

.L__unnamed_2:
        .quad   .L__unnamed_3
        .asciz  "\f\000\000\000\000\000\000\000\021\000\000\000\r\000\000"

Another option (if you're okay with panicing before entering the loop instead of after, though that does modify the behavior of the program) would be to have rustc rewrite foo1 to be something like:

pub fn foo1(a: &mut [u8]) {
    // Induce a panic early, if necessary. Rewrite metadata so the line
    // info in the panic looks like it comes from below.
    if a.len() <= 10 { a[a.len()]; }
    for i in 0_u8 .. 10 {
        a[usize::from(i)] = 0;
    }
}

The generated assembly is pretty decent:

example::foo1:
        push    rax
        cmp     rsi, 10
        jbe     .LBB0_1
        mov     qword ptr [rdi], 0
        mov     word ptr [rdi + 8], 0
        pop     rax
        ret
.LBB0_1:
        lea     rdx, [rip + .L__unnamed_1]
        mov     rdi, rsi
        call    qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
        ud2

.L__unnamed_2:
        .ascii  "./example.rs"

.L__unnamed_1:
        .quad   .L__unnamed_2
        .asciz  "\f\000\000\000\000\000\000\000\004\000\000\000\030\000\000"
2 Likes

As just an observer, I like the idea similar to foo5 but with rust requiring the programmer to manually add in the check before the call.

I think this is about the same as a simpler assert!(a.len() >= 10);

The main difference between such assert and foo5 is that the test is done at the call point instead and sometimes it's unnecessary.

I've never used something like that, interesting. I think adding a regular assert is simpler.

Yes, for Rust that's a reasonable alternative design.

Prior art in C11:

void Foo(uint8_t a[static 10]) {
  for (uint8_t i = 0; i < 10; ++i) {
    a[i] = 0;
  }
}

The compiler is allowed to assume that a points to a 10-element array. This is basically useless given C's other semantics, but it does allow the compiler to emit some mildly intelligent warnings.

In particular, void Foo(uint8_t a[static 1]) requires that a be non-NULL.

3 Likes

Examine what happens when converting from array to slice, the compile time length is moved into a runtime state. In this sense, what you're proposing is syntax for a more customized tag type for slices that is a subtype of the usual usize used for lengths. You can play with the potential optimizer/performance gains using this crate, although it of course can only offload the index checks and not make more niches available in the representation of the imagined &[u8; 10..]. It's not the most ergonomic, that's something a builtin could address.

use index_ext::tag::{Constant, ConstantSource, Mut};

pub fn foo1(mut a: Mut<u8, Constant<Min10>>) {
    let idx = Constant::<Min10>::EXACT_SIZE.into_len().range_to_self();
    for i in a.get_safe_mut(idx) {
        *i = 0;
    }
}

// Some overhead we have since it's not builtin..
pub struct Min10;
impl ConstantSource for Min10 {
    const LEN: usize = 10;
}

Here's the generated assembly (on a playground). I think this is as small as it can go :slight_smile:

playground::foo1:
	mov	word ptr [rdi + 8], 0
	mov	qword ptr [rdi], 0
	ret

Length must be checked by the caller:

    let mut buffer = [0u8; 12];
    let buffer = Mut::new(&mut buffer[..], Constant::<Min10>::EXACT_SIZE)
        .expect("10 is smaller than 12");
    foo1(buffer);
3 Likes

True. I came to my version of the code by thinking that such an optimization should, in principle, be something that rustc would be allowed to do on itself for the original code. Then I thought, perhaps a feature to hint at the existence of a pre-condition that allows for good code optimization would be useful (in case you want to make sure it happens), and finally I came to the conclusion (after testing) that a simple if-else can serve as such a hint. Of course when you notice this, and don’t actually care about any values being set to 0 in the panicking case, the assert is the better solution.

Maybe, to try formulating an optimization rule: For code containing a lot of (side-effect free) panicking checks (and relatively little code in between those), where one of the checks succeeding implies all the other checks succeed, that one check can be pulled to the front and the code compiled twice, as long as it is not too much code.

1 Like

I'll note that re-slicing -- even to the length that it already is -- is fairly commonly done in rust to help LLVM elide bounds checks. (Especially if you're iterating multiple slices; re-slicing both with the same local variable makes it obvious to LLVM that iterating up to that will be in-bounds for both slices.)

If I wanted to do this with indexing, I would write it as

Also, if you really need "make it checked in the caller, where it's obvious" and the function is more complicated than this so you don't want to inline the whole thing, you can do a transformation like this so that the assert gets inlined but the body of the function doesn't:

But really, I think the actual answer for this specific case is that if you need a 10-item prefix, take a &mut [u8; 10], and make the caller give you that. Right now it needs [..10].try_into().unwrap(), but that panic will be trivially removed by LLVM, and eventually we'll have bounds on const generics so that if you're starting with a [u8; 100] you can get a [u8; 10] prefix infallibly.


EDIT: A simple example of this: x[1] + x[0] generates fewer bounds checks than x[0] + x[1], because of the order they happen and the fact that the bounds error has the index in it. So this is a great example of how adding an assert! at the top of the method can actually make it faster: assert!(x.len() >= 3); x[0] + x[1] + x[2] gets way nicer codegen than the version without the assert. So if you're doing a bunch of indexing, I highly recommend adding an assert at the top when you have something reasonable to check. It means a nicer error for a bad caller and usually-better codegen.

3 Likes

Oh, wow, this is even worse than the other case since this is not even about preserving partial side-effects and avoiding code duplications. The only reason why this is not properly optimized is that 0 or 1 being out of bounds generates different error messages and the compiler seems to not be willing to make the successfull case faster and the panicking case slower.

Of course my macro, used like this:

pub fn foo(x: &[u8]) -> u8 {
    x[0] + x[1]
}
// vs
pub fn bar(x: &[u8]) -> u8 {
    case_distinction!(x.len() >= 2, x[0] + x[1])
}

creates way better code, too. In this case the generated assembly is not even longer, just the fact that on panics, there is always going to be two comparisons unlike in foo with just one comparison for the case of x being empty. The optimization that I describe in my previous post would of course apply, too.

Nit on the example: no need to make the entire inner function (and it's call) unsafe; it's enough to call the unreachable_unchecked in an unsafe block.

But I agree that the right way is to ask for the slice you actually need, if you want the caller to make the check, and assert! otherwise.

In essence, using the case_distinction! macro where the code is known to panic in the other case amounts to fancy debug prints, which isn't necessarily what you want.

I did it the way I did because I'm not a fan of safe functions that can cause UB if mis-called, even when they're inner functions like that.

In more than an example I'd probably set it up something like this:

#[inline]
pub fn foo6(a: &mut [u8]) {
    #[forbid(unsafe-op-in-unsafe-fn)]
    unsafe fn inner(a: &mut [u8]) {
        if a.len() < 10 { unsafe { std::hint::unreachable_unchecked() } }
        for i in 0_u8 .. 10 {
            a[usize::from(i)] = 0;
        }
    }

    unsafe { inner(&mut a[..10]) }
}

So that I can't accidentally put a unintentional unsafe call in there.

Ok, that's fair :slight_smile: