The fact that rustc zeros structures on drop is a serious performance regression, and it makes rust unnecessarily difficult to deal with for performance sensitive work. In this post, I will show you why, and propose that drop reform happen before 1.0.
I have been spending the past couple weeks working on a zero-copy fork of html5ever. The core data structure used in an Iobuf which looks something like this:
// 24 bytes on x86_64. Same size as `Vec`.
struct Iobuf {
buf: *mut u8,
lo_min: u32,
lo: u32,
hi: u32,
hi_max: u32,
}
where lo_min and hi_max specify the window over bytes accessible to the Iobuf, and lo and hi specify the window that we’re currently dealing with for the task at hand. In this case, that task is parsing.
Iobuf has a Drop implementation associated with it, which deals with non-atomically refcounting the buffer. This implies that Iobufs are cheap to clone (a 24-byte memcpy, a pointer chase, and an inc) and cheap to destroy (a pointer chase and a dec).
html5ever-zerocopy is structured to receive input one Iobuf at a time. It then, in pseudocode:
- Clones the first Iobuf in the queue onto the stack.
- Resizes that buffer to cover a single utf-8 char.
- Advances the original, cloned buffer to no longer contain that char.
3a. Drop that buffer if it’s now empty.
- Returns the buffer representing a single char.
- Decides which callback to call based on the current state and that character, and calls it.
- Goto 1 until out of buffers.
This design is sound in a language with non-zero-ing drops, but is totally inadequate in rust. Here’s my issue in the codegen of a single function:
fn process_token(&mut self, t: Token) { // Token is big. Maybe 128 bytes?
if self.opts.profile {
let (_, dt) = time!(self.sink.process_token(t));
self.time_in_sink += dt;
} else {
self.sink.process_token(t);
}
}
The codegen for this looks something like:
fn process_token(*mut self, t: [u8, ..128]) {
if self.opts.profile {
let t_stack: [u8, ..128];
memcpy(t_stack, t, 128);
memset(t, 0, 128);
// start timer
call self.sink.process_token // expects its `t` to be on the stack
// stop timer
self->time_in_sink += dt;
if t.drop_flag {
drop_glue_Token(&mut t)
}
} else {
let t_stack: [u8, ..128];
memcpy(t_stack, t, 128);
memset(t, 0, 128);
call self.sink.process_token // expects its `t` to be on the stack
if t.drop_flag {
drop_glue_Token(&mut t)
}
}
}
That little prologue of memcpy + memzero shows up every. single. time. I pass Token by-value to a function. Whoever calls process_token has to do it, too! There’s also an unnecessary memcpy when returning, but that’s just because rustc doesn’t have NRVO yet.
These are not theoretical concerns. The constant memcpy + memset is the single biggest source of slowdown in html5ever-zerocopy at the moment. Copying a few hundred bytes for every token (themselves usually less than a hundred bytes in length) is slower than just mallocing and passing around boxes. That’s terrifying.
Another issue with the current rustc implementation of Drop is that the main parsing function uses an order of magnitude more stack than necessary. All this is from is stack-allocating < 128 bytes in each arm of the outer match (all mutually-disjoint in lifetimes), where the structure had drop glue that would never need to run (consumed in each arm of the sub-match), but rustc wasn’t smart enough to detect. Walking around tens of kilobytes of stack on every state change was a huge waste of precious L1d, so I had to move from the “giant embedded match” design to “giant match which calls functions marked #[inline(never)] design”.
I’ve done my best to eliminate a few these memcpys them with a custom Option (which doesn’t zero the payload when set to None, and take is just ptr::read) type and taking care to never return structures large than a word or two. These transformations are making my code look less and less like C/C++ and more and more like my allocation-free Ocaml. It’s ugly, and should be unnecessary.
process_token is still slow, because I’m morally against restructuring my code around the fact that “pass by value in rust is slow”. That’s horrible. It shouldn’t be slow.
Please schedule Drop reform for 1.0.