Currently, rustc spends 10% of no-opt time creating and destroying Substs structs. More specifically, Substs structs are interned, but they each contain 2 Vec-s, which have to be allocated and then freed when an already-interned Substs is re-interned, so I would like some way to intern Substs without allocating.
This could be done relatively easily with find_equiv: create a variant of Substs whose Vec-s are allocated on the stack, and look it up. If it isn’t found, create a real Substs and intern it.
However, the core team thinks find_equiv isn’t a good API for a hash-table. I think a good solution would be to add a HashInterner. I would like to use parts of std::collections::hash for it - so it should probably be in std.
The intent here is that you could re-use the dead-but-memory-backed objects rather than repeatedly dropping and re-allocating them. It is a different solution to the problem that you actually have, that churning the allocator is expensive.
As I understand your post, repeatedly acquiring and releasing new “vectors of a bunch of different sizes” from the allocator is what makes your code slow at the moment. If, rather than ask the allocator to hand you two new fresh vectors for your Substs, you take two empty but allocated vectors from a Recycler, use them exactly as if you had allocated fresh vectors, and return them to the Recycler rather than let them drop, your code may go faster.
Incidentally, this sort of situation is where an inline allocator could shine. Since Rust has sized deallocation, both allocation and deallocation could in theory be optimized based on object sizes known at compile time, and they could turn into a handful of instructions each in the fast path. (Especially if you had constant offsets from %fs/%gs, but I guess that’s just fantasy in code that can be called from C.)
Though I suppose even that would find it hard to compete with stack allocation.
Sorry, I wasn’t clear; I just mean an allocator whose code gets inlined - the fast path, anyway. Like some modern mallocs, it would probably give each size class its own free list (singly linked?), cached in thread local storage, so there wouldn’t be a need for free slot merging, synchronization, etc in that path; it would just be popping and pushing an entry from the appropriate TLS linked list.
I don’t know if anyone has tried this in C or C++; maybe there’s some dealbreaker I haven’t thought of.