Traits for common data structures

Collections, JSON, Numbers etc. These common concepts implemented in a multitude of ways. Sometimes, the library you want to use don't use the same implementation as you, and it'd force you into various workarounds(I've stumbled on this, and dug from that rabbit hole all the way to here). There has been repeated attempts at making traits for them, of which none made it into std, presumably because Rust was not ready for it.

Now that GATs are closer to arriving, I would think it's good time to reconsider, but so far I couldn't find any such movement. I'd like to start working on this. What do you think?

4 Likes

Shouldn't it be possible for the library to switch to a trait describing its functionality? Or am I missing what you are getting at.

And example would help, at least me, understand what sort of traits you want.

Experience shows that arbitrary collections cease to be abstracted over usefully, as soon as you actually want to query the data structure, because that is very specific to the concrete type of the data structure. Basically the only useful abstraction is traversing all elements, which the Iterator trait already provides.

Numeric types already correspond to a handful of useful traits in num, while JSON-as-a-value-tree is basically monopolized by serde_json::Value. What value would additional traits add to these types?

7 Likes

There are others, but they get complicated quickly. Relational algebra, for example, does a reasonable job at abstracting over all the different strategies RDBMSs use to actually store data on disk, while retaining the query performance of each.

Translating that into Rust's type system is no easy task, however.

3 Likes

I do frequently want some kind of way of abstracting over HashMap::get though. I think some extractions might be hard to model, but I think we can probably look to see if something like this is possible:

trait Mapping<K: Sized, V: Sized> {
  fn get(&self, k: K) -> Option<&V>;
  fn get_mut(&mut self, k: K) -> Option<&mut V>;
  fn remove(&mut self, k: K) -> Option<V>;
}

impl<'a, K, Q, V> Mapping<&'a Q, V> for HashMap<K, V>
where
  Q + Hash + Eq + ?Sized,
  K: Borrow<Q>,
{ ... }

impl<K, Q, V> Mapping<&'a Q, V> for BTreeMap<K, V>
where
  Q: Ord + ?Sized,
  K: Borrow<Q>,
{ ... }

impl<T> Mapping<usize, T> for Vec<T> { ... }

You would be forced to use Mapping with a HRTB, such as x: impl for<'a> Mapping<&'a str, String>... yuck.

Sure, but the interface and the semantics of a relation is reasonably homogeneous regardless of its representation. It is by definition a set of tuples with operations such as projection, restriction, and Cartesian product. These can be implemented whether e.g. the backing set structure is a HashSet or a BTreeSet. In contrast, I can't imagine why or how anyone would want to abstract over any possible collection. For example, what's common in a red-black tree, a skip list, a hash set, a directed multigraph, and a sparse matrix? Unless one specifies the high-level semantics of what a collection is being used for, I don't see what common interface would it even be possible to define, let alone define in a useful manner.

1 Like

K, V are probably more naturally associated types on the trait there. After all, a HashMap<K, V> implies one specific key type (K), so it can be an associated type on the trait. A common pattern but it can have exceptions.

I think traitifying HashMap::get should be straightforward, but it's a bit puzzling that collection traits are not more widespread on crates.io.

Before it has been said that we'd might need the GAT feature before we could have really good collection traits (for example to be able to include .iter() etc in the collection trait itself).

The benefit of having a common interface is that a collection type can be swapped for a different one with minimal refactoring costs. That, in turn, means that the collection performance can be tuned after the program is complete enough to observe real-world load.

In the general case, a query is a request to visit all records in the collection that satisfy some predicate P, and each collection type optimizes for different predicates. For a common interface to be useful, the collections need to be able to inspect the predicate and choose the appropriate strategy, preferably at compile time.

The obvious choice for specifying these predicates, a closure, doesn’t provide the introspection necessary for the interface to be efficient— Some other formulation is required. It needs to be general enough to support a wide variety of predicates that are actually useful, but regular enought for collections to make good choices based on each particular predicate.

As you may be aware, I’ve been working on something like this for my Master’s thesis. The strategy I’ve been working with is a bit cumbersome but shows some promise. The filter portion of each query is a conjunction of predicate types, represented as an HList. Each collection type can search the list of predicates for terms that identify an easily-iterable subset of its records which contains all of the query results. These are then post-filtered by all of the predicate terms, which gives (unoptimized) support for extension predicate types.

2 Likes

Maybe I don't quite understand what the distinction is. But "a query is a request to visit all records ... satisfy some predicate" really sounds like an iterator + filter.

1 Like

That shouldn't be the case if the collection is any more specialized than an unsorted list. For instance asking a Trie for items starting with "q".

Why would that not work with iterators?

Would the Iterator that you are thinking about change the sequence of iteration based on a predicate?

2 Likes

While I agree some abstractions tend to be there for the sake of it, I didn't mean to make a trait for collections in general. I meant Lists, Maps, and Sets.

I don't find having de facto standard to be good enough long-term, especially without an abstract interface. If you're not making replacement practical, it's not much better than actually having it in std.

3 Likes

Having HashMap and BTreeMap mostly-interchangeable would be nice. C++ developers tend to use BTreeMap-equivalent by default, Java developers tend to use HashMap-equivalent by default, filesystems generally use BTreeMap-equivalents, Lua uses intrusive HashMap-equivalents, etc etc. HashMaps and BTreeMaps have different tradeoffs but it doesn't seem to matter in the general case, or else there wouldn't be such a huge variation in what different developers default to.

In Rust tho you're forced to duplicate your whole API surface and whatnot if you wanna support both. Personally we just throw a BTreeMap at things but we know ppl who swear by HashMaps and the impedance mismatch sticks out like a sore thumb when they wanna use our APIs. Would be nice to solve this some day.

4 Likes

You are not forced to "duplicate the API surface." You can create a trait and expose an API in terms of the trait.

4 Likes

Lists, (Associative) Maps, and Sets are already quite broad.

What are the required list operations, for instance? Probably Iteration and Push/Pop. Extend? Reverse? What about random access patterns? Cursors? Allocation reuse? If you don't have all of these features, you start to suffer pretty significant performance degradation if you hobble your API to a limited List trait. Or alternatively, you have an exponential combination of features spread out over a dozen traits to support all conceivable use cases.

Or you could just define a trait that suits your specific use case, which doesn't really require any brilliant strokes of foresight. (And if you need to redesign it, you don't break the whole world.)

And it gets even worse for things like Maps and Sets, because their complexity leads to a larger number of specialized feature sets.

2 Likes

In summary: basically everyone agrees that having "perfect" collection traits would be great; it's good to be able to abstract over your data store.

But where people disagree is what said collection traits would want to express, and whether an abstract interface can be made which is both useful and general enough for std.

Basically: proposing collection traits needs to come with the proposer showing what they actually want the traits to look like. Use whatever magic :sparkles: future :sparkles: features you need to, but to have a productive discussion, we need to have a concrete API to talk about.

8 Likes

I get that, but this does not work with all collections, because some collections have radically different interfaces. Some of the difference can be attributed to differing taste in API design, but some of it is intrinsic and unavoidable, rooted in the representation of the data structure.

To stick with the same example, it's immediately obvious how/why one would want to swap out a HashMap for a BTreeMap, because they are used for the same thing: key-value lookup. (However, even in the case of these two types, the differing Ord vs Hash bounds pose a problem.)

On the other hand, it's not at all obvious how I would swap out a Vec<T> for a LinkedList<T>. Or even a HashMap<K, V> for a LinkedList<(K, V)>. Sure, I could use a linked list of pairs to implement a map with terrible performance, but how would I generalize the lookup? The linked list type intentionally does not provide a "get value for key" method, because it was not designed for that, and it can't do that efficiently. At this point, the only common property I could rely on is there being a sequence of key-value pairs, which Iterator<Item=(K, V)> already captures.

I'm curious how this works out in the end. I was going to mention the exact problem of not being able to abstract over the differences efficiently – to circle back to my LinkedList<(K, V)> example, this type could also provide a get(&Borrow<K>) -> &V method much like HashMap, and std could put these two methods behind a common trait.

However, I think this sort of generalization would be explicitly undesirable and quite dangerous, because it doesn't only allow swapping a less efficient representation for a more efficient one – it permits the opposite, too. I wouldn't want my code to become, say, 1000 times slower just because I did not get a slap on my wrist from the compiler for exchanging a data structure with a completely different one.

That sounds more reasonable, indeed. In this case, what remains is the more concrete but related question of abstracting over different trait bounds.

I am becoming more and more convinced over time that small, specialized traits (i.e. abstractions that you write for yourself) are an almost universally better solution to this problem, because trying to unify all use cases upfront in the standard library (which is impossible to amend after the fact) is very hard, whereas defining an application-specific (probably private) trait is trivial and trivially extensible.

2 Likes

But why would you make that change? Presumably, it's because you were looking for a performance improvement somewhere else in the program that also uses this collection. If your collection will only ever need to support one shape of query, there's not much benefit to using a cumbersome generalization.

Personally, I want the compiler to enforce correctness and allow writing performant code. Choosing the tradeoffs between performance and maintainability is part of my job as a programmer. Of course replacing one data structure with another has consequences-- If it didn't, there'd be no need to offer so many choices. But because the number of choices is so large, I'd like the ability to experiment with different ones to discover what's best for my application.

The actual situation I've been trying to improve is where your records have several different potential keys, so that you can replace something like a HashMap<A, Vec<T>> with BTreeMap<B, Vec<T>> (given T:AsRef<A> + T:AsRef<B>). Or something more exotic that indexes both A and B at the same time. This is important because it's not always obvious at the start of development which of the potential keys will be more important to optimize. You might only get that information through user complaints about performance issues, after the program is mostly complete.

In Codd's original relational algebra paper, he describes several data models for an inventory-management problem and challenges the reader to write a particular query against these different models. I did that, with both Rust's standard collections and my prototype library. The different models have widely-varying performance characteristics for this query, as you'd expect (several orders of magnitude). Over a wide range of input sizes, my versions consistently take 2-3x longer to run than the corresponding std version.

This slowdown buys the ability to switch to a different model with no change in the query code. This means that you can experiment with different models to find the right one. As the first model you try is likely to be suboptimal, this ability to experiment will produce faster code than you'd likely end up when writing the queries by hand.


This is a complicated enough system that I doubt it will end up in the standard library in anything like its current form, but I hope it can find a place in the wider ecosystem. After we have more experience with this sort of thing in the wild, we may discover some useful patterns that could be brought to std.

2 Likes

They produce equivalent results, but the iterator+filter approach always requires a scan that inspects every item in the collection individually. That negates the advantage of using a collection type that can do things like efficient range queries.

The goal of this abstraction is to allow these sorts of specialized collections to use some property of the filter to restrict which elements the iterator visits.