Optimization opportunity for `<Box dyn Fn>`, which is done for `&dyn Fn` (and regression in nightly?)

Let's consider 3 ways of calculating the sum of Vec

simple

#[no_mangle]
pub fn simple_sum_vec(v: &Vec<i32>) -> i32 {
    v.iter().fold(0i32, |a,b| simple_sum(a, *b))
}

fn simple_sum(a: i32, b: i32) -> i32 {
    a + b
}

curried

Using a sum function implemented as curried function. (Trust me, this isn't just me complicating things for the sake of complication)

#[no_mangle]
pub fn curried_sum_vec(v: &Vec<i32>) -> i32 {
    v.iter().fold(0i32, |a,b| curried_sum(a, *b))
}

fn curried_sum(a: i32, b: i32) -> i32 {
    fn_curry(a)(b)
}

// This could return `impl Fn` and everything would be fine, but impl Fn works only for one argument.
// It uses Box<dyn Fn> to demonstrate problem that any curried 3+ arg function would have
fn fn_curry(a: i32) -> Box<dyn Fn(i32) -> i32> {
    Box::new(move |b| a + b)
}

curried with continuations

Using curried function rewritten as small-step operational semantics and thus avoding Box, but still using dyn Fn:

#[no_mangle]
pub fn continuation_sum_vec(v: &Vec<i32>) -> i32 {
    v.iter().fold(0i32, |a,b| continuation_sum(a, *b))
}

fn continuation_sum(a: i32, b: i32) -> i32 {
    fn_continuation(a, &|after_a| /*_*/ {
        after_a(b)
    })
}

fn fn_continuation(
    a: i32,
    after_a: &dyn Fn(
        &dyn Fn(/*b*/ i32) -> i32, //
    ) -> i32,
) -> i32 {
    after_a(&move |b| /*-> i32 */ {
        a + b
    })
}

The fold helper functions simple_sum, curried_sum, continuation_sum compile to same assembly % __rust_no_alloc_shim_is_unstable (Missed optimization/perf oddity with allocations · Issue #128854 · rust-lang/rust · GitHub) in curried_sum. No dyn calls left.

The simple_sum_vec and continuation_sum_vec compile to the same assembly code under 1.84.0 compiler.

The curried_sum_vec compiles differently and runs much slower (~8x) on my machine.

Question 1

Can this be considered something a compiler would want to optimize?

I would very very much prefer the "curried" to the "continuations" if I can't use the "simple" way. The "continuations" way is much harder to write, especially for functions with more arguments (4-arg example).

Question 2

I wanted to see if Missed optimization/perf oddity with allocations · Issue #128854 · rust-lang/rust · GitHub would help, so I build rustc locally (off 8c61cd4d commit): once unmodified and once patched.

I am probably doing something wrong, but simple_sum_vec and continuation_sum_vec perform same with my local rustc build (both unmodified and patched). But curried_sum_vec becomes 500x slower than the baselines (both unmodified and patched). Also, I see same results with cargo +nightly bench.

1 Like

I can totally understand that the optimization opportunity (the question 1) is not that interesting.

I do hope that such significant alleged performance regression in code generated with nightly (the question2) is more interesting. Is it not? Or am I simply doing something wrong?

i suspected this is something wrong with my benchmark. bisecting with cargo bisect-rustc led me to the commit Auto merge of #132542 - RalfJung:const_panic, r=tgross35 · rust-lang/rust@e3a918e · GitHub. This is not a change in, say, black_box. It's a change in primitive arithmetics. So maybe it's a real regression? I filed a bug report.

1 Like

I doubt that the Box example can reasonably be optimized. Box is just too complex: there's a memory allocation, multiple checks, including possible panics, a further deallocation happening in downstream code, including potential panic paths if you're dealing with arbitrary closures (or just sufficiently complicated functions that they aren't inlined and the compiler can't see the absence of panics).

The &dyn Fn examples have none of that: all closures are stack-allocated, and all pointers are non-owning, so of course it optimizes better. Still, I would expect performance cliffs in more complex cases, where the compiler wouldn't be able to devirtualize the calls.

Would you elaborate on that comment? I don't see what code you have in mind. impl Fn seems to work just fine. It was introduces precisely for the kind of code that you write: simple combinators which must be fully inlined and optimized to straight-line code, without any indirection.

That said, I wonder what is the problem you're trying to solve it? Are you just trying to write Scheme in Rust? In that case I'd suggest you stop. Rust is not a functional language, it doesn't have GC, doesn't optimize tail calls, doesn't have specific types for closures and has too many ownership modes to make writing this kind of code worthwhile. You're gonna hit ownership errors, you'll need the Fn/FnMut/FnOnce distinction, and everything will go downhill.

1 Like

First of all, thank you @afetisov for your response!

It makes sense that Box case is significantly more complicated. I guess we can easily agree it would be nice to optimize if possible, but we can also agree that it's likely not trivial.

// This could return impl Fn and everything would be fine, but impl Fn works only for one argument.

Would you elaborate on that comment?

Sure!

I mean I can't write

fn foo() -> impl Fn(i32) -> impl Fn(i32) -> i32 { ... }

the first impl is obviously possible and the second one is disallowed (for good reasons, it's not a type).

That said, I wonder what is the problem you're trying to solve it? Are you just trying to write Scheme in Rust?

It might be that what I am trying to do is as bad as writing Scheme in Rust, but at least it is not the goal :slight_smile:

I want to take "Rust function signature" and convert it to "SQL function signature". For that I want to know what code patterns I can expect the compiler to optimize and which I should really avoid. If you want the full and rather lengthy context, you can see Simple Functions · Issue #12635 · apache/datafusion · GitHub and associated PR. I guess discussing this here may quickly become out of context for this topic, that's why I didn't bring the full picture into the view here.

With nightly #![feature(impl_trait_in_fn_trait_return)], you can.

On stable, you can write an equivalent signature with a helper trait:

trait FnImplFn: Fn(i32) -> Self::Ret {
    type Ret: Fn(i32) -> i32;
}

impl<F, R> FnImplFn for F
where
    F: Fn(i32) -> R,
    R: Fn(i32) -> i32
{
    type Ret = R;
}

fn adder() -> impl FnImplFn {
    move |a| move |b| a + b
}
5 Likes