Tensors static typing

I see 2 conditional jumps (ja, jump if above) to .LBB3_3 and .LBB3_4 and they call slice_index_order_fail and slice_end_index_len_fail respectively. Those calls are followed by ud2, which is the same instruction emitted by unreachable_unchecked, i.e. the program must either throw an exception (panic; inside the called functions) or abort. In conclusion, the call to unwrap hasn't been removed, but inlined. That's an important difference.

A never-inlined function can never be optimized in the way you want it to, because you explicitly tell the compiler to not look outside the function to optimize what's inside function¹.

¹ Except maybe for profile-guided optimization, which I'm unfamiliar with, but that wouldn't change the function semantically, IIRC.

Are you sure a failed unwrap doesn't call core::result::unwrap_failed instead?

I have used never inline because that function doesn't need the information outside it to remove the final unwrap.

What your program does:

  1. Create a vector with 34 values
  2. Set i to 0
  3. Loop while i < 30
    1. Call take_array_from_mut with a reference to your vector, i and the constant 3 for N
    2. Print the returned value
    3. Increment i

When looking at your program as a whole, you're right to assume, that there will never be an out-of-bound access and the conditional jumps can be removed. However, when looking only at the function signature, the compiler sees data as a reference to any slice, i.e. it could have a length between 0 and isize, and it sees start as any usize, i.e. a value between 0 and 2⁶⁴-1 (on 64 bit systems). You can easily construct an example where indexing with an arbitrary value into an arbitrary slice can cause an out-of-bounds memory access. In conclusion, the function by itself does not have enough information to optimize away the panics caused by unwrapping.

You could look at the unoptimized code and see, if there are calls to slice_index_order_fail and slice_end_index_len_fail. If they're still there and you see the call to unwrap, you could be right. In that case, post the ASM output here.

take_array_from_mut:
	sub     rsp, 88
	mov     rax, rdx
	add     rax, 3
	setb    cl
	test    cl, 1
	mov     qword ptr [rsp + 64], rdi
	mov     qword ptr [rsp + 56], rsi
	mov     qword ptr [rsp + 48], rdx
	mov     qword ptr [rsp + 40], rax
	jne     .LBB196_5
	lea     rax, [rip + .L__unnamed_9]
	mov     rcx, qword ptr [rsp + 48]
	mov     qword ptr [rsp + 72], rcx
	mov     rdx, qword ptr [rsp + 40]
	mov     qword ptr [rsp + 80], rdx
	mov     rdx, qword ptr [rsp + 72]
	mov     rcx, qword ptr [rsp + 80]
	mov     rdi, qword ptr [rsp + 64]
	mov     rsi, qword ptr [rsp + 56]
	mov     r8, rax
	call    qword ptr [rip + core::slice::index::<impl core::ops::index::IndexMut<I> for [T]>::index_mut@GOTPCREL]
	mov     qword ptr [rsp + 32], rax
	mov     qword ptr [rsp + 24], rdx
	mov     rdi, qword ptr [rsp + 32]
	mov     rsi, qword ptr [rsp + 24]
	call    qword ptr [rip + <T as core::convert::TryInto<U>>::try_into@GOTPCREL]
	mov     qword ptr [rsp + 16], rax
	lea     rax, [rip + .L__unnamed_10]
	mov     rdi, qword ptr [rsp + 16]
	mov     rsi, rax
	call    qword ptr [rip + core::result::Result<T,E>::unwrap@GOTPCREL]
	mov     qword ptr [rsp + 8], rax
	mov     rax, qword ptr [rsp + 8]
	add     rsp, 88
	ret
.LBB196_5:
	lea     rdi, [rip + str.1]
	lea     rdx, [rip + .L__unnamed_11]
	mov     rax, qword ptr [rip + core::panicking::panic@GOTPCREL]
	mov     esi, 28
	call    rax
	ud2

In conclusion, the function by itself does not have enough information to optimize away the panics caused by unwrapping.<

I think you're confusing the slice bounds tests panics with the unwrap panic. The first is still present in the isolated (never inline) function but LLVM is able to remove the second one because once you have taken a [start .. start + N] it's always of length N and LLVM understands this, removing the unwrap branching to panic.

fn foo(a: &mut [u32], i: usize) -> &mut [u32; 3] {
    (&mut a[i + 1 .. i + 4]).into()
}

This can be solved with const-generics by reordering expressions to ensure the length 3 is actually a separate compile time constant not mixed in the dynamic indices. Amazingly, type deduction works.

#![feature(array_chunks)]
fn foo(a: &mut [u32], i: usize) -> &mut [u32; 3] {
    a[i + 1..].array_chunks_mut().next().unwrap()
}

It doesn't require compiler intervention, so I've made it into a crate. I find it more readable..

use index_ext::array::Prefix;
fn foo(a: &mut [u32], i: usize) -> &mut [u32; 3] {
    &mut a[i + 1..][Prefix] 
}
2 Likes

[EDIT: somehow, the thread hadn't showed me a lot of the more recent posts when I made this post, so apologies for repeating some things others have said]

Interesting example… We have both conciseness issues and performance issues to think about here.

I had a look at this in Godbolt [Edit: forgot to include Godbolt link]. Note that the assembly for foo and foo2 are identical (i.e. the runtime check for the try_into was already optimized out), but both versions still have a runtime check for the ordering of the 2 indices, which... hmm. Which can't be optimized out in the language as-is, because i + 4 can do wrapping overflow in release builds.

Maybe what you want is a function like (don't mind the awkward implementation):

impl<T> [T] {
  pub fn get_array_mut<const LEN: usize>(&mut self, i: usize) -> &mut [T; LEN] {
    if i.checked_add(LEN).unwrap() < self.len() {panic!()}
    unsafe{
      match std::slice::from_raw_parts_mut(self.as_mut_ptr().add(i), LEN).try_into() {
        Ok(b) => b,
        Err(_) => std::hint::unreachable_unchecked()
      }
    }
  }
}

I've used this to implement foo3 in my godbolt link, and it indeed appears to save one cmp instruction, even despite the awkward double-check on the first line.

Once the min_const_generics feature stabilizes, it should be straightforward to implement a clean version of this in the standard library.

I also have PR #76635 open which would allow that as .as_chunks().0[0] which is shorter, if not really more obvious.

You could consider a PR to make it just a[i+1..].prefix_array_mut().unwrap() or whatever.

That said, I'm not sure that any of these are much better than the [..3].try_into().unwrap() versions -- they're just moving around the bounds checks and panics.

1 Like

Yes, but it is compile dependent typing something akin to Dlang.

Compile time dependent typing is unfortunately not easier to decide than runtime dependent typing when tied with specialization. It doesn't suffice to evaluate the bounds of each trait method when more than one method applies to the input. In this case you need to solve a constraint system which is maybe of exponential complexity and the user doesn't always see this immediately and wonders why the compiler is so slow.

For the above reason, Dlang doesn't support specialization over value expressions not equal to simple membership expressions (like trait bounds for instance).

The other point is what did you do if your vector may be of length 3 or 4, or it may depends non-deterministally on foreign events requiring compile time parametrized vector lengths.

Hi, wouldn't const-generics allow you to write function signatures you desired?
Provided all the sizes are known at compile time..

I don't know how much fast this is in practice.

I already came up with an idea to solve the speed problem here. Put in an extra compiler flag that specifies how long the compiler can take to execute the SAT solver; once the time is up, no more SAT solving. Instead, you emit less optimized code that is known to be correct, but possibly slower than fully optimized code.

Another problem we could face is operator overloading, e.g. a heavy product computation of two mathematic structures implemented with while unintended including the option to loop forever prolonging compilation time infinitely.

The main problem is, even if you can see your own macros, others cannot view your macros. You compile your lib with calls to a const generic constrained function which are all finitely solvable. Another Guy calls your lib with a different const value breaking the compilation. This Guy may not understand why it doesn't work and may can't view your macro as it is enclosed in a private module.

Turning to runtime checking doesn't help either in case of specialization unless you change the dispatch strategy in this case.

But the compiler still has to execute my macros, even if they are private to my module, correct? That means that the compiler can still emit warnings or errors when the time limit is exceeded, correct? I agree though that even in that case trying to figure out why your own code is causing the compiler to fail is still going to be a real headache...

I'll have to trust you on that; I don't know enough to comment on this one way or another.

Knowing all sizes is too strict in practice. In most cases, we desire our algorithm can process images with adaptive sizes and with fixed number of channels. Also it's not possible to define compile-time restrictions like IsOdd or IsPrime for const generics. It is still far away from building a truely static API.

In contrast to adding static checkers in dynamic typed language, Rust goes on the reverse direction of gradual typing. It relaxes type checks when some information is missing in compile time and leave it to runtime checkers. Though dependent typing can somewhat a solution, we cannot have it because it violates the Rust's design principals, and we don't want type checkers in runtime. Instead, we can combine a Dyn type with Rust's generics. That's the way we can achieve a weak form of gradual typing. The code below illustrates the basic idea.

use typenum::UTerm;  // a number type standing for zero
struct Dyn(usize);

impl Add<UTerm> for Dyn {
    type Output = Dyn;
    fn add(self, _rhs: UTerm) -> Self::Output { /* runtime checker goes here */ }
}

Towards building the static tensor API, the core issues can be arranged into two threads.

  • How to encode complex logic in types. For example, can we statically compute the dimensions of tensor product of two tensors?
  • Dispatch b/w static checkers and runtime checkers (at best, without runtime overheads).

The first issue can be somewhat solved by type operators. Those interested can look at my typ to see how it's possible in Rust.

For the second issue, we could interpolate runtime checkers by dispatching UTerm + UTerm and UTerm + Dyn into different impl blocks, and insert checkers accordingly. I did an attempt in my type-vec. However, once the types become more complex, writing in this way is in fact a tedious process. I would rather think of an approach that we insert runtime provers only when needed in a middle of complex type operations.

Also, there is a thread in tch-rs talking about static tensor types.

Interesting

I think that this will be possible someday with const generics, when the logic only involves addition and multiplication.

I think too, that this will happen someday with const generics but there are other problems incured by const generics and dispatching, see this thread.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.