The sad state of zero-on-drop

The fact that rustc zeros structures on drop is a serious performance regression, and it makes rust unnecessarily difficult to deal with for performance sensitive work. In this post, I will show you why, and propose that drop reform happen before 1.0.

I have been spending the past couple weeks working on a zero-copy fork of html5ever. The core data structure used in an Iobuf which looks something like this:

// 24 bytes on x86_64. Same size as `Vec`.
struct Iobuf {
  buf:    *mut u8,
  lo_min: u32,
  lo:     u32,
  hi:     u32,
  hi_max: u32,
}

where lo_min and hi_max specify the window over bytes accessible to the Iobuf, and lo and hi specify the window that we’re currently dealing with for the task at hand. In this case, that task is parsing.

Iobuf has a Drop implementation associated with it, which deals with non-atomically refcounting the buffer. This implies that Iobufs are cheap to clone (a 24-byte memcpy, a pointer chase, and an inc) and cheap to destroy (a pointer chase and a dec).

html5ever-zerocopy is structured to receive input one Iobuf at a time. It then, in pseudocode:

  1. Clones the first Iobuf in the queue onto the stack.
  2. Resizes that buffer to cover a single utf-8 char.
  3. Advances the original, cloned buffer to no longer contain that char. 3a. Drop that buffer if it’s now empty.
  4. Returns the buffer representing a single char.
  5. Decides which callback to call based on the current state and that character, and calls it.
  6. Goto 1 until out of buffers.

This design is sound in a language with non-zero-ing drops, but is totally inadequate in rust. Here’s my issue in the codegen of a single function:

fn process_token(&mut self, t: Token) { // Token is big. Maybe 128 bytes?
    if self.opts.profile {
        let (_, dt) = time!(self.sink.process_token(t));
        self.time_in_sink += dt;
    } else {
        self.sink.process_token(t);
    }
}

The codegen for this looks something like:

fn process_token(*mut self, t: [u8, ..128]) {
    if self.opts.profile {
        let t_stack: [u8, ..128];
        memcpy(t_stack, t, 128);
        memset(t, 0, 128);
        // start timer
        call self.sink.process_token // expects its `t` to be on the stack
        // stop timer
        self->time_in_sink += dt;
        if t.drop_flag {
            drop_glue_Token(&mut t)
        }
    } else {
        let t_stack: [u8, ..128];
        memcpy(t_stack, t, 128);
        memset(t, 0, 128);
        call self.sink.process_token // expects its `t` to be on the stack
        if t.drop_flag {
            drop_glue_Token(&mut t)
        }
    }
}

That little prologue of memcpy + memzero shows up every. single. time. I pass Token by-value to a function. Whoever calls process_token has to do it, too! There’s also an unnecessary memcpy when returning, but that’s just because rustc doesn’t have NRVO yet.

These are not theoretical concerns. The constant memcpy + memset is the single biggest source of slowdown in html5ever-zerocopy at the moment. Copying a few hundred bytes for every token (themselves usually less than a hundred bytes in length) is slower than just mallocing and passing around boxes. That’s terrifying.

Another issue with the current rustc implementation of Drop is that the main parsing function uses an order of magnitude more stack than necessary. All this is from is stack-allocating < 128 bytes in each arm of the outer match (all mutually-disjoint in lifetimes), where the structure had drop glue that would never need to run (consumed in each arm of the sub-match), but rustc wasn’t smart enough to detect. Walking around tens of kilobytes of stack on every state change was a huge waste of precious L1d, so I had to move from the “giant embedded match” design to “giant match which calls functions marked #[inline(never)] design”.

I’ve done my best to eliminate a few these memcpys them with a custom Option (which doesn’t zero the payload when set to None, and take is just ptr::read) type and taking care to never return structures large than a word or two. These transformations are making my code look less and less like C/C++ and more and more like my allocation-free Ocaml. It’s ugly, and should be unnecessary.

process_token is still slow, because I’m morally against restructuring my code around the fact that “pass by value in rust is slow”. That’s horrible. It shouldn’t be slow.

Please schedule Drop reform for 1.0.

11 Likes

Is it related to the eager/late/manual drop RFCs?

If somebody would like to work on optimizing drops I imagine @pnkfelix can point you in the right direction.

There is progress, for example PR#18234

“In progress” doesn’t mean “required for 1.0”, which is what I hope to change with this post.

It’s not totally clear to me why “zero-on-drop can cause bad perf” would make it a 1.0 blocker. What’s the negative impact of it rolling out in 1.1 or whatever a month or two after 1.0? Adding more stuff to 1.0 increases the chance that 1.0 slips, and you end up with 1.0 with drop optimizations coming out at the same time that 1.1 would have been coming out with drop optimizations.

1 Like

As far as I know, the most relevant issue, in terms of backwards compatibility for 1.0 (not as-implemented performance), is whether we can support dynamic drop (either in its current form or in a hypothetical "idealized" form) at the same time as removing zeroing-on-drop.

The RFC that I see as most likely to be approved for 1.0 is RFC PR 320, "Non-zeroing dynamic drops":

But there are a few gotchas that I still need to incorporate into that RFC. The handling of moving out of arrays is one such gotcha (see RFC comments). Another one that we haven't decided how to finalize is this: http://is.gd/XXBi78

Anyway, we do need to make sure the language semantics is compatible with non-zeroing drop for 1.0. We generally have been assuming that this would not necessitate actually implement non-zeroing drop before 1.0, but as more examples like the ones in the previous paragraph arise, that assumption becomes increasingly questionable.

@nikomatsakis and I have been following up on some of the details here.

It makes code which uses destructors impossible to get on-par-with-C++ performance without restructuring in horrible ways. Also, most code relying on #[unsafe_no_drop_flag] today will subtly break by leaking memory, when drop reform is implemented.

I agree with @sfackler. The threshold for 1.0 is “is it worth making everyone who is waiting for an API-stable Rust wait longer?” Rust version 1.0 is about fundamentally about making the people who are saying “I’d like to use Rust, but I want to wait until it’s stable so I don’t have to update my code” happy. The intersection of the set of people who want stability and the set of people who would happily use Rust even with codegen issues relating to destructor performance is likely quite large.

3 Likes

Fair.

Can it at least be high priority after 1.0?

Sure, it’s annoyed me a lot too.

1 Like

Semi-related Java factoid: HotSpot creator Cliff Click claims managed language runtimes waste up to a third of the host memory bandwidth zeroing out objects:

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.