The fact that rustc zeros structures on drop is a serious performance regression, and it makes rust unnecessarily difficult to deal with for performance sensitive work. In this post, I will show you why, and propose that drop reform happen before 1.0.
I have been spending the past couple weeks working on a zero-copy fork of html5ever. The core data structure used in an Iobuf which looks something like this:
// 24 bytes on x86_64. Same size as `Vec`.
struct Iobuf {
buf: *mut u8,
lo_min: u32,
lo: u32,
hi: u32,
hi_max: u32,
}
where lo_min
and hi_max
specify the window over bytes accessible to the Iobuf, and lo
and hi
specify the window that weâre currently dealing with for the task at hand. In this case, that task is parsing.
Iobuf has a Drop
implementation associated with it, which deals with non-atomically refcounting the buffer. This implies that Iobufs are cheap to clone (a 24-byte memcpy
, a pointer chase, and an inc
) and cheap to destroy (a pointer chase and a dec
).
html5ever-zerocopy is structured to receive input one Iobuf at a time. It then, in pseudocode:
- Clones the first Iobuf in the queue onto the stack.
- Resizes that buffer to cover a single utf-8 char.
- Advances the original, cloned buffer to no longer contain that char. 3a. Drop that buffer if itâs now empty.
- Returns the buffer representing a single char.
- Decides which callback to call based on the current state and that character, and calls it.
- Goto 1 until out of buffers.
This design is sound in a language with non-zero-ing drops, but is totally inadequate in rust. Hereâs my issue in the codegen of a single function:
fn process_token(&mut self, t: Token) { // Token is big. Maybe 128 bytes?
if self.opts.profile {
let (_, dt) = time!(self.sink.process_token(t));
self.time_in_sink += dt;
} else {
self.sink.process_token(t);
}
}
The codegen for this looks something like:
fn process_token(*mut self, t: [u8, ..128]) {
if self.opts.profile {
let t_stack: [u8, ..128];
memcpy(t_stack, t, 128);
memset(t, 0, 128);
// start timer
call self.sink.process_token // expects its `t` to be on the stack
// stop timer
self->time_in_sink += dt;
if t.drop_flag {
drop_glue_Token(&mut t)
}
} else {
let t_stack: [u8, ..128];
memcpy(t_stack, t, 128);
memset(t, 0, 128);
call self.sink.process_token // expects its `t` to be on the stack
if t.drop_flag {
drop_glue_Token(&mut t)
}
}
}
That little prologue of memcpy + memzero shows up every. single. time. I pass Token
by-value to a function. Whoever calls process_token
has to do it, too! Thereâs also an unnecessary memcpy when returning, but thatâs just because rustc doesnât have NRVO yet.
These are not theoretical concerns. The constant memcpy + memset is the single biggest source of slowdown in html5ever-zerocopy at the moment. Copying a few hundred bytes for every token (themselves usually less than a hundred bytes in length) is slower than just malloc
ing and passing around boxes. Thatâs terrifying.
Another issue with the current rustc implementation of Drop is that the main parsing function uses an order of magnitude more stack than necessary. All this is from is stack-allocating < 128 bytes in each arm of the outer match (all mutually-disjoint in lifetimes), where the structure had drop glue that would never need to run (consumed in each arm of the sub-match), but rustc wasnât smart enough to detect. Walking around tens of kilobytes of stack on every state change was a huge waste of precious L1d, so I had to move from the âgiant embedded matchâ design to âgiant match which calls functions marked #[inline(never)]
designâ.
Iâve done my best to eliminate a few these memcpy
s them with a custom Option
(which doesnât zero the payload when set to None
, and take
is just ptr::read
) type and taking care to never return structures large than a word or two. These transformations are making my code look less and less like C/C++ and more and more like my allocation-free Ocaml. Itâs ugly, and should be unnecessary.
process_token
is still slow, because Iâm morally against restructuring my code around the fact that âpass by value in rust is slowâ. Thatâs horrible. It shouldnât be slow.
Please schedule Drop reform for 1.0.