Could collections hypothetically store keys and values inline?

Speaking only with respect to the default allocator:

Vec<T> and String document that they never use any sort of "small optimization" where some [T] or str data is stored inline in the main Vec or String allocation. They also do not have unique/noalias semantics like Box<T> or &mut T do; although this is technically not documented, because people already rely on this property, it would be considered a breaking change to add noalias semantics to them: https://github.com/rust-lang/rfcs/pull/3712#issuecomment-3715013712

These guarantees (and other guarantees of the language) imply that self-referential structs can soundly use Vec<T>s or Strings as backing data for self-references to the associated [T] or str data; moving a Vec<T> or String does not invalidate references (whether & or &mut) to their heap data.

The data structures in std::collections, however, do not explicitly document that they never store values (or, if applicable, keys) inline; AFAICT, this is technically "just" an undocumented implementation detail of std. Should I start making an RFC requesting this guarantee?

Or is there some reason I'm missing that already prevents std::collections data structures from storing values and (if applicable) keys inline, even hypothetically? In other words, can unsafe code stably and soundly use e.g. a HashMap as backing data for self-references to keys and values?

That's a libs-api question, not a UCG question (so you put this in the wrong section of this forum). And the answer is no, you can't rely on anything that's not explicitly documented.

Alright, recategorized. And at some point in the next two weeks, I'll make an RFC, then.

An RFC seems like overkill. The usual process for libs API changes these days is an ACP. It might also make sense to get some vibes on Zulip.

1 Like

Why would you want to make that guarantee? I could absolutely see the case for HashMap using specialisation in the future to store small keys and values inline. This is not something that is realistic in Rust today given the poor state of specialisation, but a HashMap<u32, u32> could absolutely benefit from inline storage.

Additionally whatever the backing store is, it could get reallocated, so even a guarantee of not inlining doesn't give you address stability.

It gives you address stability so long as you don’t mutate the HashMap.

In particular, the following three operations (in any quantity and order) wouldn’t invalidate references to heap data: moves (by a no-inlining + no-noalias guarantee), operations on a &HashMap (otherwise, stable behavior would break), coercions (basically just a combination of moves and immutable operations, listing this separately just since it feels like a bit of an edge case).

2 Likes

It doesn't seem that useful in practice, so why not give more freedom for future changes to improve the implementation? Hashbrown wasn't the first implementation in std, and we don't know that it will be the last.

Yes, it's substantially less useful. I just doubt that we'd ever use inline optimizations for std::collections; if I'm underestimating that possibility, that's fine too.

String would also benefit from small string optimization, see for example compact_str. The stdlib precluded this optimization however.

Indeed, and given how often I use compact_str, I think std made the wrong choice.

I also think that the stdlib making the promise was wrong, but for a different reason – it precludes optimisations as a result of, e.g., getting the collection types to cooperate more closely with the allocator.

For example, storing the capacity in Vec seems like double-storing to me. Most modern allocators are a combination of a slab allocator for small allocations, and a buddy allocator for large allocations (split up variously between the runtime libraries and operating system); all the high-performance allocator designs I've looked at recently seem to be converging on that design. For a slab allocator, allocations that have similar addresses have the same size, so it is more efficient to store the allocation size (= capacity) once per page rather than once per allocation. Buddy allocators might theoretically benefit from storing the size separately, but for large allocaitons, getting the allocator to track the size has very little cost relative to the cost of the allocation itself.

In cases where the program has control over its virtual memory layout (which I think is true on Linux/Windows/Mac OS, although not always on embedded platforms), and where the virtual address space is much larger than the physical address space (true on most 64-bit platforms with virtual memory), you can go even further and just embed the capacity directly into bits of the pointer by allocating it an appropriate address. This makes reading the capacity potentially faster than with the pointer/capacity/length layout, because you don't actually have to read the capacity from memory, just use a few masks and shifts.

(This discussion got me thinking about other things, like "so what if you get the allocator to always allocate at addresses which aren't valid UTF-8?" – I decided that this wouldn't be costly to do on 64-bit platforms but probably wouldn't be very helpful for efficient strings.)

In any case, it seems wrong to me to preclude potential optimisations like Vec and String have done; some may be good ideas and some may be bad ideas, but it is very likely that at least some might be beneficial.

2 Likes

You can do this in your own code, and build a CompactVec that is two words rather than three (if you are okay with small vecs only, or with capacities that must be power of two, or other restriction like that, it could even be just one word). Then, whenever you need to actually use the Vec, you build it on the stack and call any method. This is sound exactly because the stdlib documented the layout of a Vec (you just have to use ownership to prevent doing this twice).

Maybe that won't be faster but if you have a large number of Vecs stored somewhere, it might make sense, idk.

1 Like

In my experience, having a large number of small hash maps is very rare—in part because a hash map is a bad implementation for tiny sets/maps anyway (faster to do linear search over an array than hash stuff).

By comparison, "a bunch of short strings" is a common use case. I think it is reasonable to say that if the Rust standard library is willing to give up small string optimizations for pointer stability guarantees, giving up a purely hypothetical[1] small hashmap optimization is also worth it.

Even if you take the position that making this choice with String was a mistake (which I find to be a relatively compelling case), that choice has been made and having the same consistency and guarantees in other places at negligible sacrifice seems to make sense to me.


  1. To my knowledge, nobody has bothered to implement this: compact_str exists, compact_map does not, to my knowledge. This is not an optimization C++ bothers to make, nor is it an optimization that Google's C++ hashmap implementation which Hashbrown is based off makes afaik. If you are aware of any case—in any language—where somebody actually does a small hashmap optimization like this, I would be curious to see it. ↩︎

Speed aside, you can't create a &Vec or &mut Vec without literally having a valid Vec in memory and if you have a valid Vec in memory then which frees the allocation when dropped and you need to do CompactVec::into() and then CompactVec::from()—which requires ownership and so requires ugly swapping if you're, for example, working with an array of CompactVecs.

For example, what would have been array_of_arrays[i].push(foo) becomes

{
    let mut temp: Vec<_> = std::mem::take(&mut array_of_arrays[i]).into();
    temp.push(foo);
    array_of_arrays[i] = temp.into();
}

and similarly at every other place of usage. The ergonomics of that are sufficiently bad that I don't think anyone is going to do this outside of situations with very strict memory constraints.

1 Like

Responding to your footnote: such an implementation does exist. small-map gets a few hundred downloads per day.

1 Like

Oh interesting, though I do think this is a somewhat different thing.

This is not a “small map optimization” where if there are few enough key-value pairs, they get stored in the pointers and other stuff you need on the stack anyway (which is what compact_str in Rust and “small string optimizations” in C++ do); rather, it specifically reserves space for some number of key-value pairs on the stack.

Reserving space on the stack is a different data type which you’re not precluded from creating by guaranteeing that when you don’t allocate that space everything will be stored on the heap,

As a note, small {collection} optimization is less necessary in Rust than it is in a language like C++. C++ has default copy semantics, so passing around std::string as a function argument is safe (protecting you from shared mutation and/or use-after-move errors) but results in a lot of extra string copies. When you have a bunch of copies and most strings are ≤22 bytes, thus fitting in the stack space necessary for the string already (on 64bit targets), you get a huge benefit from making those string copies into trivial copies instead of heap duplication.

But in Rust, &str existed from day one, gets used by default instead of passing String by value, and does not suffer from requiring developer enforcement of lifetimes. Furthermore, when String is passed by value because indefinite ownership is required, it's an ownership transferring trivial move, and duplicating the heap memory requires a .clone() call[1].

Even comparing to a runtime for a scripting language like JavaScript or Lua, there roughly everything is a hashtable, so optimizing simple objects to not use the full dynamic machinery is a large cross-cutting performance improvement. But in Rust, those simple types won't be maps, they'll be their own static types, so collection types don't have to be optimized for the tiny use case.

It would be great if we could have e.g. Vec<T, SmallStorage<23>> and such so those cases that do benefit from small {collection} optimization can do so without abandoning the standard library's well tested API.

While this is a minor storage optimization, note that this likely defeats most peephole (i.e. LLVM) optimization, as now optimization can't easily track and know capacity information. Since the allocator is dynamically linked, its reported capacity might even change between queries, as far as the optimizations know!

Furthermore, Rust has made a conscious (and imho correct) choice for global general purpose de/allocation to always provide the size/align of the allocated slot. Any cooperation with the allocator would need to happen with some Vec<T, A> which is not Vec<T, Global>, and the guarantee only applies to the latter instantiation.

It is a form of it, yes. When you exceed the allocated inline space, all of the map is transferred to the heap. The Rust enum overlaps the Heap and Inline case, using an invalid value of one to mark when the other is in use so the size equals that of the larger variant.

Because Rust is statically typed, types must have a fixed stack size. HashMap is currently 6Ă—usize. You'll be hard-pressed to fit more than one key/value pair in that space. If you have a use case where most of your maps only have one int-string entry... maybe you shouldn't be using the general purpose stdlib HashMap.

As for layout optimization within the heap, hashbrown (the backing hashtable implementation for std) implements the SwissTable implementation, storing roughly Box<dyn<N: usize> ([(K, V); N], [CtrlByte; N])>. Rust was built with a default capability for trivial destructive moves, so there's nowhere near the default present in e.g. the C++ STL of defensively boxing things to provide address stability and guarantee cheap relocation (of the pointer in the collection).


  1. Or also whatever the final form of “use closures” ends up being. Think move closures except clone-capturing instead of move-capturing any handle-like (e.g. reference-counting) captures. ↩︎

2 Likes

Yes, sure, it is a form of optimization for small maps specifically. But it is not a direct analog of the "small string optimization" implemented by C++ and compact_str. It is a different way of optimizing for small maps by putting them on the stack[1] and so while you could reasonably call it a "small map optimization" in a vacuum, it is not the same thing and not what @Vorpal was originally referring to.

The direct analog is specifically storing things in those 6 usize-shaped values, which as you point out, wouldn't be very useful, which was my point.

Also, this is further besides the point, but static typing has little to do with requiring types to have a fixed size on the stack. The only major language I'm aware of opting out of fixed-sized types on the stack as a matter of policy is Swift, which is statically typed and uses alloca heavily in its implementation of dynamically-linked generic types and such (How Swift Achieved Dynamic Linking Where Rust Couldn't - Faultlore is a very cool article on the topic, for anyone unfamiliar).

I think it's reasonable that Rust defaults to the position that everything must have a static stack size,[2] but it's not a fundamental property of static typing, nor does dynamic typing make it easier.


  1. It doesn't seem like a particularly good implementation either—if your hash map is 16 items you should probably be doing linear search over an array. It also includes a 48 byte overhead over storing key-value pairs in an array and doing linear search, which seems fairly significant if you're optimizing for the case of small maps. ↩︎

  2. Though maybe having more support for dynamically-sized objects would be a benefit to the language. It would be nice to be able to place dynamically-sized objects on the stack and to be able to express things like dyn<T: Trait> Vec<T> where all the elements have the same underlying type (and thus stride) and so you can store them contiguously, only storing their vtable pointer once with the Vec and avoiding a bunch of unnecessary allocations, compared to something like Vec<Box<dyn T>>. ↩︎

Nothing stops a hash map implementation from doing linear search (i.e. using a single bucket) for tiny sets of keys.

This is true, you could do this, but if you have fixed inline size for structs and you're not doing "small map optimization" then this is somewhere in the 48 to 24 bytes of wasted space (depending on what exactly you know) per map. Further, if you merely do things like putting all the keys and values in a single bucket, your code is liable to still end up pointlessly hashing the key.

However, this is all besides the point, which is that regardless of whether this is a thing that you could do (yes, though presumably with at least a bit of overhead), I don't believe any notable implementation of a hash map actually does (afaict, even JavaScript—which is the most optimized of the "everything's a hashmap" dynamic languages and has JIT-compilation for maps with consistent key sets—doesn't seem to have anything in particular for handling small maps in general).

Sure, in some sense this is a self-perpetuating problem—if hash maps weren't suboptimal for small sets then people would be used to using them for that in performant code and there'd be pressure to do these optimizations—but it really doesn't seem worth the complexity when people can just choose the appropriate data structure for their use case. If somebody should be trailblazing this optimization, it's languages like Go or Python or JavaScript, not Rust.

Rust has much worse performance footguns for much more common use-cases than "don't use a HashMap when the expect size is 4 pairs if you care about performance" and expects programmers to be knowledgable enough to avoid them.

For example, Rust is the only major programming language I'm aware of not to buffer file IO in its default standard library implementation (even C's fread/fwrite are buffered by default!). I don't think this is a flaw in Rust, it's just part of a philosophy that making the default behavior simple and predictable is more important than making it optimized for all plausible use cases.