Since tokio
seems set to replace the way we do IO in Rust, I think we should
take the time to reflect on the mistakes of the past and see if we can use this
as an opportunity to correct them.
To me, there is one thing in particular that sticks out as a mistake -
io::Error
is a horrible error type that is extremely difficult to handle
correctly, and it spreads horrible error-handling through anything it touches.
To justify this assertion, let’s look at Rust when error-handling is done right.
Error handling in Rust - the ideal
Rust’s enum
s offer a perfect solution for describing the return types of
functions that can fail. More generally, if a function can result in any of n
different kinds of thing, you should use an enum with n
variants for its
return type. For an example, take HashMap
's get
method:
fn get(&self) -> Option<&T>
When we try to get something from a hash map, there are two different things
that can result. Either we have an item corresponding to the given key (and we
return Some(&item)
), or we dont (and we return None
). The Option<&T>
return type faithfully represents the semantics of the function and forces us
to handle all, and only, those possibilities that can result from calling it.
To appreciate how beautiful this is, have a look at this method in action:
match hash_map_of_ints.get(key) {
Some(23) => ..,
Some(..) => ..,
None => ..,
};
Now contrast this with javascript:
switch(hash_map_of_ints[key]) {
case 23: ...
case "lol i'm a string": ...
}
Javascript doesn’t force us to consider the possibility that the map doesn’t contain the key. It also allows us to handle absurd cases such as a map of ints containing a string. This is called dynamic typing, and it sucks,
Another Rust type that does error handling properly is SyncSender
.
Suppose I try to send some value val
on a SyncSender
. There’s 3 things this
can return:
Ok(())
Err(TrySendError::Full(val))
Err(TrySendError::Disconnected(val))
Note that, even though this a two-tiered enum where the error is a separate
type, the return type still manages to capture the semantics of the function.
In the event that sending fails, it could either be because the channel is full
or because it is disconnected. These correspond precisely to the two Err
possibilities that we’re given.
It would rarely make sense to ignore the value of a TrySendError
. Whether the
channel is temporarily full or permanently disconnected is usually going to
determine what we do next, so most code will match on this error and respond
accordingly. Horrible code, however, might just stick a ?
on the Result
and
send the TrySendError
up the stack where all the higher-level code with no
insight into the lower-level channel-based implementation will be completely
unable to make sense of it. Thankfully, the design of TrySendError
guides
users away from doing this.
The truism that bears repeating here is that exceptions are not exceptional. If
a function call can fail in n
different ways then you should generally have
n
different branches to handle those cases. It only makes sense to forward an
error if something higher-level is expected to be able to handle it, or if its
critical enough that it needs to kill the whole program or be reported to the
user.
Unfortunately, types like HashMap
and SyncSender
are the exceptions to the
rule…
Useless error types - Rust in practice
Now that we’ve had a look at how error handling can be done right, let’s look at how it’s usually done instead.
io::Error
is, in some sense, the canonical useless error type in Rust. It’s
canonical in the sense that it both the most common example of, and a major
cause of, Rust’s useless error types.
One function that uses io::Error
is File::open
. Let’s see it in action:
let file = match File::open(path) {
Ok(file) => file,
Err(e) => ... // what goes here?
}
The problem is: how the hell are we meant to handle this error? io::Error
's
error variants include things like ConnectionRefused
and InvalidInput
which
are clearly nonsensical in this context. Does that mean it’s safe to use
unreachable!()
on them? Are you brave enough to do that? Others, like
NotFound
clearly can occur and we can explicitly handle them, but how do you
know once you’ve handled all the cases that are actually possible? The fact
that this question makes sense shows that io::Error
is not doing its job as a
type.
In practice what nearly all Rust code does is treat io::Error
as opaque and
throw it up the stack. Even if people check for specific cases (which they
usually don’t) they still propogate an io::Error
in order to handle any mystery
error cases that they didn’t anticipate. Often this means wrapping io::Error
in some
other error type with a variant called Io
or something. When this happens
it spreads the disease of opaque non-handle-able-ness into this wrapper type as
well and, at this point, most crate authors give up and use this single opaque
type throughout their library. The alternative is to have a plethora of error
types for different functions, but since they would all be somewhat opaque and
un-handle-able, it’s hardly worth the effort to not just combine them into a
single type. While it’s true that these types usually allow you to manually
test for the relevant error conditions, they don’t tell you which condtions
these are, the type system doesn’t force you to handle them, and the type
system does force you to either plumb around errors that can never occur or add
a lot of risky panics to your code. When it turns our there really are errors conditions
that the programmer didn’t anticipate, these can often crash the entire program as
they get propogated all the way back up to main
(since nowhere else in the
stack knew how to handle them either).
In many cases though it’s not even true that you can test for the error
conditions you need. Look at the signature of tokio::io::copy
:
pub fn copy<R, W>(reader: R, writer: W) -> Copy<R, W>
where R: AsyncRead,
W: AsyncWrite;
type Copy: Future<Item = _, Error = io::Error>;
This function doesn’t give you any way to tell whether it was the reader or
the writer that failed. Because both the reader and the writer throw the
same useless, opaque error, the tokio
authors decided to just squash these
errors together, throwing away important and relevant information in the
process. This would have been less inviting, or even impossible, if the reader and
writer had more specific error types. Again, this shows how opaque error
handling tends to spread like a virus.
io::Error
is worse than just opaque though. In some cases it doesn’t even do
it’s job as a programming abstraction. For example, the whole point of the
standard library networking APIs is to provide a portable abstraction over the
networking APIs of the various platforms it runs on. But there are cases where
the same error can cause different io::Error
s to be returned on different
platforms, leading to platform specific code (and platform-specific programmer
knowledge) to be needed for even the most simple error handling.
tokio
can and should strive to do better than this.
Sane error handling for tokio
Earlier I used File::open
as an example of poor error design. In a sane
world, what might the error type of File::open
look like?
What follows is my first approximation of an answer. Note that this was put
together by just looking at the docs for open(2)
and CreateFileW
on Linux and
Windows only. I’m aware that these docs are likely to be incomplete and so some
cases are bound to be missing, and that tokio
needs to support other
operating systems as well. The idea here is to have a starting point for the
sake of argument.
fn File::open<P: AsRef<Path>>(path: P) -> Result<File, OpenFileError>;
#[derive(Debug, Fail)]
pub enum OpenFileError {
#[fail(display = "{}", _0)]
FileAccess(FileAccessError),
#[fail(display = "{}", _0)]
ResourceLimit(ResourceLimitError),
}
#[derive(Debug, Fail)]
pub enum FileAccessError {
#[fail(display = "{}", _0)]
NotFound(NotFoundError),
#[fail(display = "{}", _0)]
FileUnreadable(FileUnreadableError),
}
#[cfg(target_platform = "unix")]
#[derive(Debug, Fail)]
pub enum NotFoundError {
#[fail(display = "is a directory")]
IsADirectory,
#[fail(display = "too many symbolic links")]
TooManySymbolicLinks,
#[fail(display = "filename too long")]
NameTooLong,
#[fail(display = "parent directory is not a directory")]
NotADirectory,
}
#[cfg(target_platform = "windows")]
#[derive(Debug, Fail)]
#[fail(display = "file not found")]
pub struct NotFoundError;
#[derive(Debug, Fail)]
pub enum FileUnreadableError {
#[fail(display = "permission denied")]
PermissionDenied,
#[fail(display = "{}", _0)]
Other(OtherFileUnreadableError),
}
#[derive(Debug, Fail)]
#[cfg(target_platform = "unix")]
pub enum OtherFileUnreadableError {
#[fail(display = "no such device")]
NoDevice,
}
#[derive(Debug, Fail)]
#[cfg(target_platform = "windows")]
pub enum OtherFileUnreadableError {
#[fail(display = "sharing violation")]
SharingViolation,
}
#[derive(Debug, Fail)]
pub enum ResourceLimitError {
#[fail(display = "out of memory")]
OutOfMemory,
#[fail(display = "{}", _0)]
FileDescriptorLimit(FileDescriptorLimitError),
}
#[cfg(target_platform = "unix")]
#[derive(Debug, Fail)]
pub enum FileDescriptorLimitError {
#[fail(display = "process file descriptor limit hit")]
ProcessLimitHit,
#[fail(display = "system file descriptor limit hit")]
SystemLimitHit,
}
#[cfg(target_platform = "windows")]
#[derive(Debug, Fail)]
#[fail(display = "process file handle limit hit")]
pub enum FileDescriptorLimitError;
The first thing to note that is that there is, broadly, two ways opening a file can fail. Either the file can’t be accessed on the filesystem, or we’ve hit a system resource limit. This difference is important. In the former case we probably want to notify the user, in the latter we need to crash or shed load.
By presenting the user with just two variants we can encourage them to think about how to handle this error and not just mindlessly propogate it. Since resource limit problems are common to lots of IO operations this even enables us to come up with generic ways of handling them. For example, we could write a function that retries operations with exponential back-off waiting for file descriptors to become available:
trait MaybeResourceLimitError {
type OtherError;
fn try_into_resource_limit_error(self) -> Result<ResourceLimitError, Self::OtherError>;
}
impl MaybeResourceLimitError for OpenFileError {
type OtherError = FileAccessError;
fn try_into_resource_limit_error(self) -> Result<ResourceLimitError, FileAccessError> {
match self {
OpenFileError::FileAccess(e) => Err(e),
OpenFileError::ResourceLimit(e) => Ok(e),
}
}
}
async fn retry_waiting_for_available_file_descriptors<T, E, F>(f: F)
-> Result<T, <E as MaybeResourceLimitError>::OtherError>
where
F: FnMut() -> TryFuture<T, E>,
E: MaybeResourceLimitError,
{
let mut duration = Duration::from_millis(1);
loop {
match await!(f()) {
Ok(t) => return Ok(t),
Err(e) => match e.try_into_resource_limit_error() {
Ok(ResourceLimitError::OutOfMemory) => panic!("out of memory"),
Ok(ResourceLimitError::FileDescriptorLimit(..)) => {
await!(tokio::sleep(duration));
duration *= 2;
},
Err(e) => return Err(e),
}
}
}
}
This is an example of how the structure of FileOpenError
allows us to
de-structure it and extract a more specific error once we handle a specific
case. And to do so in a way that is generic and applicable to other errors.
If the file cannot be accessed on the filesystem, FileAccessError
again
subdivides the error into two possible cases - either no file could be found at
that location or the file is unreadable. If we want to know in what sense no
file could be found at that location, or why the file is unreadable, we can
drill down even further into NotFoundError
or FileUnreadableError
, though
at this point we’re getting into operating-system-specific errors.
Generally we should try to make available all the error information provided by the operating system. In cases where OSes differ in the level of information they provide we should bury this information behind a least-common-denominator error type so that people only need to write platform-specific code when they actually care about platform-specific behaviour.
As I’ve said, the error spec given above is definitely incomplete. For as long
as this is the case tokio
will need to panic and request a bug report
whenever it encounters an OS error code that it doesn’t expect. While these
holes get plugged, and as support for other OSes gets added, the error spec
will expand and evolve. By stratifying the errors into lots of different levels
by specificity we can keep these changes away from the top-level errors that
people are likely to use in practice. As such, we should be able to start with
something less-than-perfect and evolve it while causing minimal damage to the
tokio
ecosystem (and whatever we start with is bound to be better than what
we currently have). In this way we can finally replace the god-awful,
non-portable, operating system C API integer error codes with something sane,
and by doing so create a knock-on effect that causes saner error handling to
spread throughout the rest of the Rust ecosystem.
I expect identifying and classifying all these possible error conditions to be
an annoying, laborious, and never-ending task. However, by taking on the
responsibility of this task the tokio
authors can spare every other user of
tokio
from partially and haphazardly going through the same processs, or
worse, not going through it and instead writing buggy code and exporting their
own useless error types.
I urge the Rust/tokio devs to consider this. io::Error
is a wart, let’s take
this one opportunity we have to freeze it off.