Pre-RFC: String from ASCII (not allowing UTF-8)

I would appreciate feedback for this idea. It's not really anything big but it could potentially be helpful. Some explanations of things may not be quite complete yet.

  • Feature Name: string_from_ascii
  • Start Date: Pre-RFC 05/02/21

Summary

Create the methods String::from_ascii and str::from_ascii{,_mut} which create a String or a str (respectively) from bytes containing only valid ASCII characters.

Motivation

UTF-8 is helpful, but for some cases we don't need it. When it comes to parsing strings from bytes, there are times when you do not need the extra complexity of validating UTF-8. While String::from_utf8_unchecked and str::from_utf8_unchecked exist, they do not check that it is valid ASCII. Thus, other than the user doing their own ASCII validation and running the unsafe variants, there is no way to turn an ASCII bytestring into a UTF-8 string without potentially allowing UTF-8 characters, which may not be helpful for some domains.

Guide-level explanation

These explanations are fairly similar to str::from_utf8 and String::from_utf8, as they are very similar functions.

core::str::from_ascii

Converts a slice of bytes to a string slice, validating whether or not they are valid ASCII. As all ASCII is valid UTF-8, this string will be valid UTF-8 and thus a valid string slice (&str). Do note, however, that not all UTF-8 is valid ASCII.

If you are sure that a byte slice is valid UTF-8, and you don't need to validate if the string is valid ASCII, there is an unsafe version of from_utf8, from_utf8_unchecked, which has the same behavior but skips the check. If you need to check that a byte slice is valid UTF-8 and not just valid ASCII, use from_utf8.

If you need a String instead of a &str, consider String::from_ascii.

Because you can stack-allocate a [u8; N], and you can take a &[u8] of it, this function is one way to have a stack-allocated string. There is an example of this in the examples section below.

Errors

Returns Err if the slice is not ASCII with a description as to why the provided slice is not ASCII.

Examples

Basic usage:

use std::str;

// some bytes, in a vector
let hello = vec![104, 101, 108, 108, 111];

// We know these bytes are valid ASCII, so just use `unwrap()`.
let hello = str::from_ascii(&sparkle_heart).unwrap();

assert_eq!("hello", hello);

Incorrect ASCII, but valid UTF-8:

use std::str;

// some bytes, in a vector
let sparkle_heart = vec![240, 159, 146, 150];

// We know these bytes are valid UTF-8, but they are not valid ASCII.
assert!(str::from_ascii(&sparkle_heart).is_err())

Incorrect ASCII as well as UTF-8:

use std::str;

// some invalid bytes, in a vector
let sparkle_heart = vec![0, 159, 146, 150];

assert!(str::from_ascii(&sparkle_heart).is_err());

See the docs for AsciiError for more details on the kinds of errors that can be returned.

A "stack allocated string":

use std::str;

// some bytes, in a stack-allocated array
let hello = [104, 101, 108, 108, 111];

// We know these bytes are valid, so just use `unwrap()`.
let hello = str::from_ascii(&sparkle_heart).unwrap();

assert_eq!("hello", hello);

TODO: String::from_ascii, core::str::AsciiError, alloc::string::FromAsciiError

Reference-level explanation

These new functions can simply call the existing from_utf8_unchecked functions, after having checked that every byte in the slice is less than or equal to 127 (the limit of US-ASCII).

Drawbacks

Checking that characters are valid ASCII is already simple, and it may just be possible for users to do this themselves, then calling from_utf8_unchecked to prevent further overhead.

Rationale and alternatives

This design is relatively simple to implement compared to something like the ascii crate, which has string types which are only ASCII. Not doing this would probably have low impact, but it would make code for things such as binary formats (which sometimes only support ASCII byte strings) more complex (and/or more unsafe) in order to be more efficient, when the code to do so could just be abstracted out into the standard library quite easily.

Prior art

Using an example string of ABCD ([65, 66, 67, 68]), the following languages are just a few popular languages which have this feature:

  • Java: new String(new byte[] {65, 66, 67, 68}, "US-ASCII");. In the case of an invalid character, U+FFFD will be added.
  • Python 3: bytes([65, 66, 67, 68]).decode('ascii'). In the case of an invalid character, an exception is raised.
  • NodeJS: Buffer.from(new Uint8Array([65, 66, 67, 68])).toString('ascii'). Bytes are interpreted modulo 128.
  • C/C++ do not have natively supported UTF-8 string types like Rust does. Instead, they just use byte strings which are usually interpreted as only ASCII.

Unresolved questions

  • Should these functions be in alloc::string::String and core::str or inside of core::ascii?
  • Should there be a str::from_ascii_unchecked, as it does the same thing as str::from_utf8_unchecked?
  • Should we create our own versions of core::str::Utf8Error and alloc::string::FromUtf8Error, or simply use the existing ones?

Future possibilities

If this RFC is merged and these functions are added as part of alloc::string::String and core::str, it may be considered that core::str::from_utf8_unchecked should be renamed to something like core::str::from_bytes_unchecked. It may also be considered that core::str::Utf8Error and std::string::FromUtf8Error could be converted into a general FromBytesError type (although there would have to be one in core and another in alloc due to FromUtf8Error having methods dealing with alloc::vec::Vecs).

My instinct here is that if it's important that it's just ASCII, then it might as well use an AsciiStr so you can know that on the consumption side. And if it's not important on the consumption side, this feels like a very small improvement -- UTF-8 validation tends to already have a fast path for ASCII, so it's not clear to me that this would even be materially faster in the normal case.

14 Likes

Seconding this. I'd rather we just include the ascii crate in core if anything.

I don't think I'm quite convinced by the motivation. Or at least, the section doesn't seem complete. Is this about performance? If so, then I would expect to see benchmarks. And also an exploration of why this performance problem cannot be fixed in existing APIs.

5 Likes

A minor note, I think your first and fourth examples have a typo: they both define hello but refer to sparkle_heart in the call to from_ascii.

Yeah, on further thought I guess this isn't really that helpful.