A common task is to read newline-delimited source from io. io::BufRead
provides a method lines(self) -> impl Iterator<Item=io::Result<String>>
to do just that. While the semantics of returning individually allocated strings - one per line of input - is very convenient when using iterator-adaptors, the performance suffers considerably if millions of lines are read from the underlying reader. In my case:
- The added overhead of allocating (usually small) strings hurts performance because we do an allocation on every single line of input. Besides:
String
is 24 bytes, a line is not much larger than this. - UTF8-decoding is done inline, which means the thread doing the io is also the thread doing the decoding. This limits the options for multithreading.
- As the basic unit of iteration is
String
, using channels or other locking primitives locks on very small amount of data, adding to the overhead (this can be relieved in part by buffering multiple strings, yet the problem still persists to a large degree)
In my example runtime was around 9 seconds with ~140% cpu. Profiling shows that runtime is quenched by small buffer copies and utf-decoding in the io-thread.
I’ve written a small gadget that extends BufRead
with two new methods. It reads the entire buffer from the underlying BufRead
and does one additional read up to the next newline. This allows reading newline-delimited chunks, passing those buffers to other threads and doing the utf8-decoding there.
Plugging this into my use-case and doing very minimal changes on the threads receiving input from the io-reader, my runtime is now close to one second with >650% cpu.
Benchmarks show that reading a Vec<u8>
using chunked-lines is ~3x faster than using BufRead::lines()
.
Since this is such an easy performance fix for a lot of use cases where line-delimited io is used, I put it up here for consideration
Here is the synthetic benchmark, comparing iterator of String
to comparing sub-iterators of &str
on a larger buffer.