Extend io::BufRead to read multiple lines at once

A common task is to read newline-delimited source from io. io::BufRead provides a method lines(self) -> impl Iterator<Item=io::Result<String>> to do just that. While the semantics of returning individually allocated strings - one per line of input - is very convenient when using iterator-adaptors, the performance suffers considerably if millions of lines are read from the underlying reader. In my case:

  • The added overhead of allocating (usually small) strings hurts performance because we do an allocation on every single line of input. Besides: String is 24 bytes, a line is not much larger than this.
  • UTF8-decoding is done inline, which means the thread doing the io is also the thread doing the decoding. This limits the options for multithreading.
  • As the basic unit of iteration is String, using channels or other locking primitives locks on very small amount of data, adding to the overhead (this can be relieved in part by buffering multiple strings, yet the problem still persists to a large degree)

In my example runtime was around 9 seconds with ~140% cpu. Profiling shows that runtime is quenched by small buffer copies and utf-decoding in the io-thread.

I’ve written a small gadget that extends BufRead with two new methods. It reads the entire buffer from the underlying BufRead and does one additional read up to the next newline. This allows reading newline-delimited chunks, passing those buffers to other threads and doing the utf8-decoding there.

Plugging this into my use-case and doing very minimal changes on the threads receiving input from the io-reader, my runtime is now close to one second with >650% cpu.

Benchmarks show that reading a Vec<u8> using chunked-lines is ~3x faster than using BufRead::lines().

Since this is such an easy performance fix for a lot of use cases where line-delimited io is used, I put it up here for consideration :slight_smile:

Here is the synthetic benchmark, comparing iterator of String to comparing sub-iterators of &str on a larger buffer.

1 Like

Do you have any guesses at what fraction of the performance cost is due to this part specifically? Because this one we know can do better once we get GATs and thus streaming iterators, so that a new method can return &str to the inside of the iterator and thus not need to re-do the allocations.

In my use-case the lines get send to a crossbeam::channel, so its a new allocation every single time. I can’t tell what the exact runtime-fraction on the allocation itself is, the problem is the high-paced handoff via the channel.

The chunked-iterator’s semantic are “read one buffer’s worth of data and then up to the next newline”, which is a much bigger chunk of data, it’s size usually unde the control of the BufRead's creator; if the next newline is not too far aways, we can do this in one allocation. My io-thread is now doing io all the time, because it does not get stuck feeding a channel.

My thought is that is an easy performance trap, even without the threading, which BufRead can easily solve. The gist being "if you just need lines, use .lines(). If you need lots and lots and lots of lines, use .chunks().lines()":

This is not to diminish what you’ve done, but it seems like a shame if we can’t make stream.lines() be the fastest way to process a file line by line no matter how big it is.

2 Likes