Pre-RFC: BinaryHeap Flexibility

I've implemented using @ExpHP's method because I want to run the code on stable compiler.

Almost all test code except unstable ones works in backward-compatible way.

While coding, Clone was difficult but 1.26 added Clone for closures so it compiles on 1.26!

The is_sorted crate needs callables representing binary operations, and it does so in a very different approach than what is suggested here.

pub mod ordering {
    use cmp::Ordering;

    /// Equivalent to `Ord::partial_cmp(a, b)`
    pub struct less_than();

    impl<'a, 'b, T: PartialOrd> FnOnce<(&'a T, &'b T)> for less_than {
        type Output = Option<Ordering>;
        extern "rust-call" fn call_once(
            self, arg: (&'a T, &'b T),
        ) -> Self::Output {
            arg.0.partial_cmp(arg.1)
        }
    }

    impl<'a, 'b, T: PartialOrd> FnMut<(&'a T, &'b T)> for less_than {
        extern "rust-call" fn call_mut(
            &mut self, arg: (&'a T, &'b T),
        ) -> Self::Output {
            arg.0.partial_cmp(arg.1)
        }
    }

    // less_equal, greater_than, greater_equal, equal, not_equal
}

It then defines some consts to make these easier to use:

pub const less_than: ordering::less_than = ordering::less_than();
// less_equal, greater_than, greater_equal, equal, not_equal

This way you can just write:

type MaxHeap = BinaryHeap<T, C = ordering::greater_thanr>;

or directly use them, for example, with sort_by:

foo.sort_by(greater_than);  // works because greater_than is a `const`

One thing these types allow is specializing impls for different comparators. The is_sorted crate specialize the is_sorted_... iterator algorithms for {less,greater}_than to achieve up to 10x speed-ups using std::arch intrinsics. These speed-ups are not possible if the user passes a lambda to the is sorted functions because we can’t know what the lambda does, and there is no way to specialize on that.

I think these types and consts are sufficiently generally useful for them to be on their own RFC. Once an RFC about these is merged, proposing customizable BinaryHeaps can be done in subsequent RFCs.

I also think that trying to do both things in the same RFC would be un-productive because it would mix the discussion about two completely-unrelated concerns: being able to denote certain types of operations at the type level (e.g. less_than, add, …) so that one can do something based on them (customize data-structures, specialize algorithms, etc.), with how to make a particular data-structure more generic.

The C++ <functional> header added a bunch of these Callables to C++ in C++11: https://en.cppreference.com/w/cpp/utility/functional The main mistake they made is not make these Callables generic on their arguments, but they fixed that in C++14.

@gnzlbg

My comment on less_than approach is:

  • less_than() function traditionally returns bool instead of Ordering. The name is confusing.
  • the struct name should be LessThan
  • the const should be LESS_THAN
  • using the same less_than term for the different things (struct name and const name) is confusing.

For RFC purpose, the both the above less_than approach and my Compare<T> trait are too controversial to be accepted widely. I think FnMut approach feels more natural and make people more comfortable. But to use FnMut elegantly, we might want extra unstable features…

As a short-term solution, I published an external crate for those who need various heap NOW (me). Later, RFC may come up. Please note that I’m not familiar with the unstable features and won’t (can’t) write such RFC for the time being.

Don't functions already have distinct types, though? They can't be named right now, but typeof or abstract type ought to make it possible in the future, so you could do something like impl Foo for typeof(<u32 as Ord>::cmp) { ... } or abstract type U32Cmp; fn _dummy() -> U32Cmp { <u32 as Ord>::cmp } impl Foo for U32Cmp { ... }.)

1 Like

The problem is not the type of the function, but that two functions with the same type can do different things.

I can wrap Ord::cmp in a lambda, and coerce that to a function pointer a. I can also coerce Ord::cmp to a function pointer b with the same type as a. What I can’t do is tell what a and b are doing. They are both doing the same thing, just calling Ord::cmp, but there is no way to specialize on “what a function of a certain type actually does”.

I can specialize on the type of the function, but what should I implement there if I don’t know what the function is doing? For example, I can write a lambda that coerces to the same type as a and b but does Ord::cmp(x,y).reverse(). Specializing on the type of a function is not enough.

Thanks for the style nitpicks.

Using Max and Min is an alternative, but I'd like to be able to use PartialOrd as well.

I memoed what I learned from your comments:

  1. Closure in rust is not a function pointer unless you put it in something like Box.

  2. When you create a closure, compiler automatically defines a struct which implement Fn/FnMut/FnOnce family of traits.

  3. So closure is actually a struct! If it does not capture any variable, it is a zero-sized type (ZST).

  4. You could theoretically put a closure on type parameter of generic arguments. By doing that, the implementation is monomorphized into each given closures.

  5. The above has some difficulty in today’s Rust because currently each closure structs don’t have its own name!

  6. Eventually, Rust will have features like typeof and/or abstract type. It will work as a substitute for closure struct name.

That's only true for non-ZST types. So long as you don't have an fn type or a stateful closure, the same type will do the same thing.

Demonstration that u32::gt and u32::lt are ZSTs and distinguishable at the type level: Rust Playground

fn assert_same_type_and_zsts<T: Any, U: Any>(x: T, y: U) {
    assert_eq!(size_of::<T>(), 0);
    assert_eq!(size_of::<U>(), 0);
    assert_eq!(x.get_type_id(), y.get_type_id());
}

fn main() {
    assert_same_type_and_zsts(u32::lt, u32::gt);
}
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `TypeId { t: 15723412695625198684 }`,
 right: `TypeId { t: 3437632669508324879 }`', src/main.rs:8:5

(The fact that these are distinct ZSTs, not just both fn(&u32,&u32)->bool, is core to the perf of things like sort that monomorphize based on a an impl Fn of some kind.)

1 Like

@scottmcm

we still have the problem that two functions, with different types, can do the same thing. And that the user can defined these and we can’t specialize for them.

I still don’t see how this allows us to do anything useful. Either we specialize for Ord::cmp, or a callable type, or… but if the user pass us a different type that just calls Ord::cmp the specialization that we provided won’t trigger because it has a different type.

Or in other words, we can’t specialize for any callable type that is semantically equivalent to Ord::cmp.


Then there is the issue of how we specialize for Ord::cmp(a,b).reverse(). Is there something better than just defining our own types that the user must use for the specializations to trigger?

What specializations are you worried about not being able to apply?

Monomorphizaiton is done, so inlining optimization is done, as has been said multiple times. That leaves default impl specializations, of which there are none today, obviously (as this generic parameter isn’t there and specialization isn’t stable).

1 Like

What specializations are you worried about not being able to apply?

I am already specializing is_sorted_by for {i,u}{8,16,32,64} and f{32,64} for Ord::cmp, Ord::cmp(a,b).reversed(), PartialOrd::cmp(a,b).unwrap() and PartialOrd::cmp(a,b).unwrap().reverse().

This specializations are used by is_sorted as well behind the scenes.

This delivers ~2x and ~10x speed ups in some cases, and ~3-6x speed ups in most cases:

On my laptop (2012 Intel Core i5, AVX, no AVX2) for slices (_baseline is with specialization disabled):

test run_gt_16i_baseline  ... bench:   1,056,679 ns/iter (+/- 143,130)
test run_gt_16i_is_sorted ... bench:     297,737 ns/iter (+/- 73,536)
test run_gt_16u_baseline  ... bench:   1,048,473 ns/iter (+/- 151,361)
test run_gt_16u_is_sorted ... bench:     330,464 ns/iter (+/- 57,180)
test run_gt_32f_baseline  ... bench:   5,859,745 ns/iter (+/- 643,377)
test run_gt_32f_is_sorted ... bench:     665,421 ns/iter (+/- 116,844)
test run_gt_32i_baseline  ... bench:   1,089,292 ns/iter (+/- 124,127)
test run_gt_32i_is_sorted ... bench:     627,826 ns/iter (+/- 85,923)
test run_gt_32u_baseline  ... bench:   1,108,998 ns/iter (+/- 163,938)
test run_gt_32u_is_sorted ... bench:     702,923 ns/iter (+/- 112,332)
test run_gt_64f_baseline  ... bench:   3,824,177 ns/iter (+/- 492,995)
test run_gt_64f_is_sorted ... bench:   1,364,920 ns/iter (+/- 197,156)
test run_gt_64i_baseline  ... bench:   1,310,525 ns/iter (+/- 245,387)
test run_gt_64i_is_sorted ... bench:   1,301,780 ns/iter (+/- 367,669)
test run_gt_64u_baseline  ... bench:   1,313,168 ns/iter (+/- 169,762)
test run_gt_64u_is_sorted ... bench:   1,300,316 ns/iter (+/- 209,528)
test run_gt_8i_baseline   ... bench:   2,010,967 ns/iter (+/- 175,342)
test run_gt_8i_is_sorted  ... bench:     303,082 ns/iter (+/- 68,407)
test run_gt_8u_baseline   ... bench:   2,029,840 ns/iter (+/- 300,009)
test run_gt_8u_is_sorted  ... bench:     337,869 ns/iter (+/- 114,609)
test run_lt_16i_baseline  ... bench:   1,037,078 ns/iter (+/- 104,067)
test run_lt_16i_is_sorted ... bench:     301,939 ns/iter (+/- 89,317)
test run_lt_16u_baseline  ... bench:   1,031,273 ns/iter (+/- 115,523)
test run_lt_16u_is_sorted ... bench:     334,392 ns/iter (+/- 107,146)
test run_lt_32f_baseline  ... bench:   3,201,105 ns/iter (+/- 257,790)
test run_lt_32f_is_sorted ... bench:     681,012 ns/iter (+/- 264,631)
test run_lt_32i_baseline  ... bench:   1,140,243 ns/iter (+/- 587,907)
test run_lt_32i_is_sorted ... bench:     789,255 ns/iter (+/- 890,137)
test run_lt_32u_baseline  ... bench:   1,192,369 ns/iter (+/- 615,291)
test run_lt_32u_is_sorted ... bench:     747,146 ns/iter (+/- 233,472)
test run_lt_64f_baseline  ... bench:   3,635,677 ns/iter (+/- 3,049,427)
test run_lt_64f_is_sorted ... bench:   1,565,629 ns/iter (+/- 521,948)
test run_lt_64i_baseline  ... bench:   1,321,831 ns/iter (+/- 265,014)
test run_lt_64i_is_sorted ... bench:   1,478,014 ns/iter (+/- 412,410)
test run_lt_64u_baseline  ... bench:   1,428,323 ns/iter (+/- 413,818)
test run_lt_64u_is_sorted ... bench:   1,313,273 ns/iter (+/- 223,336)
test run_lt_8i_baseline   ... bench:   2,000,606 ns/iter (+/- 253,341)
test run_lt_8i_is_sorted  ... bench:     299,591 ns/iter (+/- 45,478)
test run_lt_8u_baseline   ... bench:   1,989,555 ns/iter (+/- 247,432)
test run_lt_8u_is_sorted  ... bench:     327,004 ns/iter (+/- 74,697)

On an Xeon(R) CPU E5-2695 v3 @ 2.30GHz:

test run_gt_16i_baseline  ... bench:     530,724 ns/iter (+/- 9,966)
test run_gt_16i_is_sorted ... bench:     164,685 ns/iter (+/- 2,791)
test run_gt_16u_baseline  ... bench:     530,724 ns/iter (+/- 11,917)
test run_gt_16u_is_sorted ... bench:     166,817 ns/iter (+/- 2,649)
test run_gt_32f_baseline  ... bench:   3,870,192 ns/iter (+/- 58,284)
test run_gt_32f_is_sorted ... bench:     270,014 ns/iter (+/- 3,011)
test run_gt_32i_baseline  ... bench:     530,887 ns/iter (+/- 2,091)
test run_gt_32i_is_sorted ... bench:     329,381 ns/iter (+/- 5,795)
test run_gt_32u_baseline  ... bench:     530,927 ns/iter (+/- 19,677)
test run_gt_32u_is_sorted ... bench:     333,764 ns/iter (+/- 6,847)
test run_gt_64f_baseline  ... bench:   2,180,730 ns/iter (+/- 52,635)
test run_gt_64f_is_sorted ... bench:     538,037 ns/iter (+/- 8,761)
test run_gt_64i_baseline  ... bench:     691,102 ns/iter (+/- 14,588)
test run_gt_64i_is_sorted ... bench:     628,104 ns/iter (+/- 15,375)
test run_gt_64u_baseline  ... bench:     690,482 ns/iter (+/- 16,620)
test run_gt_64u_is_sorted ... bench:     690,421 ns/iter (+/- 15,802)
test run_gt_8i_baseline   ... bench:     909,540 ns/iter (+/- 16,632)
test run_gt_8i_is_sorted  ... bench:     136,164 ns/iter (+/- 1,880)
test run_gt_8u_baseline   ... bench:     909,536 ns/iter (+/- 14,981)
test run_gt_8u_is_sorted  ... bench:     140,761 ns/iter (+/- 2,443)
test run_lt_16i_baseline  ... bench:     530,707 ns/iter (+/- 2,364)
test run_lt_16i_is_sorted ... bench:     160,733 ns/iter (+/- 1,780)
test run_lt_16u_baseline  ... bench:     530,713 ns/iter (+/- 9,249)
test run_lt_16u_is_sorted ... bench:     168,386 ns/iter (+/- 2,356)
test run_lt_32f_baseline  ... bench:   1,529,191 ns/iter (+/- 11,369)
test run_lt_32f_is_sorted ... bench:     269,269 ns/iter (+/- 3,844)
test run_lt_32i_baseline  ... bench:     530,920 ns/iter (+/- 18,349)
test run_lt_32i_is_sorted ... bench:     327,124 ns/iter (+/- 5,909)
test run_lt_32u_baseline  ... bench:     530,893 ns/iter (+/- 749)
test run_lt_32u_is_sorted ... bench:     338,442 ns/iter (+/- 3,649)
test run_lt_64f_baseline  ... bench:   1,537,085 ns/iter (+/- 22,255)
test run_lt_64f_is_sorted ... bench:     537,958 ns/iter (+/- 15,249)
test run_lt_64i_baseline  ... bench:     690,423 ns/iter (+/- 21,300)
test run_lt_64i_is_sorted ... bench:     635,785 ns/iter (+/- 17,912)
test run_lt_64u_baseline  ... bench:     690,270 ns/iter (+/- 15,633)
test run_lt_64u_is_sorted ... bench:     690,380 ns/iter (+/- 13,587)
test run_lt_8i_baseline   ... bench:     909,537 ns/iter (+/- 13,486)
test run_lt_8i_is_sorted  ... bench:     131,389 ns/iter (+/- 1,350)
test run_lt_8u_baseline   ... bench:     909,535 ns/iter (+/- 20,730)
test run_lt_8u_is_sorted  ... bench:     140,781 ns/iter (+/- 1,980)

@LukasKalbertodt made some comparisons of this against the std facilities here:

So even if specialization isn't stable, I'd like this to not be a "solution that only works for BinaryHeap", but something that works for me too, given that both things are 99% similar.

You have the right idea here. To refine things slightly:

  1. Closure in rust is not a function pointer unless you put it in something like Box.

Note that "function pointer" in Rust generally refers to the fn types (not be be confused with the Fn trait), like fn(i32, i32)->i32. Functions (and ZST closures) can coerce to their corresponding fn type.

  1. You could theoretically put a closure on type parameter of generic arguments. By doing that, the implementation is monomorphized into each given closures.

This is more than "theoretically" -- it's how most, if not all, things that take closures work in the standard library.

  1. Eventually, Rust will have features like typeof and/or abstract type . It will work as a substitute for closure struct name.

typeof is a reserved keyword, but AFAIK there's no accepted RFC for how it should work, so it may never exist. But abstract type is expected to exist, though its precise syntax may change.

1 Like

Sure, but I don't see how that's any different with any of the possibilities being discussed, because all of them include a version where someone just uses a closure or where they can implement the trait themselves, and thus not get the optimization.

cc @nikomatsakis since he mentioned on gitter that a way to specialize certain algorithms in rayon for certain operations (like sorting in increasing and decreasing order) would be useful.

Thanks for making this. Something like this is exactly what is needed for stateful comparisons.

There are a few things I noticed that might help improve the implementation:

  1. I don’t think you need T and Compare<T> to implement Clone unconditionally. If you remove Clone from most impl blocks except for the ones that actually require it, I think you still can use .clone() when it is available. In particular, if a closure captures a mutable reference, I do not think it can be cloned.

  2. The use of impl Compare<T> for a return type is good for free standing functions, but it does not help traits with functions that will return a BinaryHeap. Those will still need to define a new struct that implements Compare<T> instead of returning an unboxed closure. Unfortunately, there is no public constructor which can create a new BinaryHeap from an arbitrary instance of a type that implements Compare<T>. I think adding such constructors would be very helpful.

  3. I encourage you to add from_min, from_by, etc., and the corresponding versions that take an Iterator. This is currently the only way to efficiently build the heap. When you are using a custom comparison function, right now the only way to populate the heap is with extend() (or, equivalently, element-by-element insertion), which is O(n*log(n)) instead of O(n).

Thanks and good luck with your proposal!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.