What would be your opinion on including an associated list in std::collection?

I've had a discussion over at Rust community discord about what could theoretically be the Dictionary type, and realized that one of the options, that is probably the best for very small data sizes and has an advantage of only requiring PartialEq Eq for keys, is not included in the standard library.

What do you mean by "associated list", what would the API look like, and what are the benefits over the existing collections?

Basically a wrapper over Vec<(K, V)>, with a map-like API for insertion and removal. Benefit over both is not requiring either Hash nor Eq Cmp, and lack of indirection to the data. Tradeoff being, lookup and insertion are O(n) in size.

How would it work without Eq? Do you mean Ord?

Why would it only require PartialEq? Wouldn't that allow it to e.g. fill up with many NaN keys?

1 Like

There's lots of possible spots in the trade-off spectrum, but not all of them need to be in alloc.

I think a FlatMap<K, V> being in a crate -- the same way a SmallVec<T> is just in a crate -- is fine.

Especially since .iter().find(…) works pretty well even without a wrapper.

3 Likes

I think the term VecMap is a little better because “dictionaries” in a lot of languages are our HashMap.

There is a crate that provides this vecmap - Rust

I have definitely used this in the past, more lightweight than a HashMap and less indirection than a BTree. It wouldn’t be bad to have something easy to drop into place.

That being said, the implementation is so basic I’m not sure it merits a place in std. For Ord things, binary_search_by_key already gives you the position to insert if lookup fails. For !Ord things, .iter().find(…) works. It’s trivial to write a few lines of code wrapper around either of these things to suite your application’s needs, without std needing to be opinionated about to Ord or not to Ord.

Maybe an example of these patterns in the book somewhere is a better fit than a new data type? Or maybe we could add methods to a impl Vec<(K, V)> without adding a new type, such as get_or_insert.

6 Likes

As a previous art in adding methods to Vec, there's assoc

For another example implementation to demonstrate different trade offs: https://github.com/clap-rs/clap/blob/master/clap_builder/src/util/flat_map.rs

This is a (Vec<K>, Vec<V>)

  • Better cache locality for key lookups
  • Smaller binary size because you are more likely to share monomorphized Vec code
7 Likes

There's also linear-map which (according to Cargo) is more popular, but old and unmaintained (thus e.g. fn Entry::or_default is missing).

Something I have used, if a little niche. However, it's simple and a decent choice of "map" type when the length is known to be very small, so... either including or not including seems fine IMO.

1 Like

And on another axis this gets very close to various DataFrame/Series implementations.

Nitpick: the term for this in Lisp is "association list" or "pairlist", not "associated list". And I think "associated list" sounds too much like "associated type" for clarity.

"The key type only needs to be Eq, not Ord or Hash" seems like a solid reason to add this to the stdlib: if you know you need a map where the keys only have to be Eq, but you don't know what this is usually called, and you have to hunt for something in crates.io, it could be quite hard to find. I don't know if it's a solid enough reason, especially considering the Vec<(K,V)> vs (Vec<K>, Vec<V>) tradeoffs.

I will note that it's also possible to make a variant which uses a single allocation while holding keys and values in a more optimal way. Essentially Box<([K; n], [V; n])>

1 Like

Im not too sure it's actually more optimal, having worked on and benchmarked a similarly laid out container, my gut tells me it will be within margin of more naive 2 Vec approach. While complexity of bookkeeping forsuch an allocation is not insignificant.

What bookkeeping? All you have to do is store two pointers, one capacity, and one length which is practically the same as the (Vec<K>, Vec<V>) case. All of the tricky stuff you only have to deal with on reallocation.

That said, I don't expect it to be much faster unless you keep the whole structure within a single cache line or something.

Capacity is a bit tricky to decide. Also keys and values could easily have different alignment requirements.

1 Like

It's not that big of a deal. You can do

use std::alloc::{Layout, LayoutError};
pub fn layout<K, V>(cap: usize) -> Result<(Layout, usize), LayoutError> {
    let a = Layout::array::<K>(cap)?;
    a.extend(Layout::array::<V>(cap)?)
}

to get the allocation size & alignment needed and the offset to the array of values.

But this ignores possible over-allocation.

Why? The capacity can just be treated like length. Same for both component arrays.

What do you mean by over-allocation? Rust allocators don't return the size of the actual allocation, so all you can assume is that it's at least the size you asked for.

Unstable Allocator trait in fact does return length.