Pre-RFC: String from ASCII (not allowing UTF-8)

I would appreciate feedback for this idea. It's not really anything big but it could potentially be helpful. Some explanations of things may not be quite complete yet.

  • Feature Name: string_from_ascii
  • Start Date: Pre-RFC 05/02/21

Summary

Create the methods String::from_ascii and str::from_ascii{,_mut} which create a String or a str (respectively) from bytes containing only valid ASCII characters.

Motivation

UTF-8 is helpful, but for some cases we don't need it. When it comes to parsing strings from bytes, there are times when you do not need the extra complexity of validating UTF-8. While String::from_utf8_unchecked and str::from_utf8_unchecked exist, they do not check that it is valid ASCII. Thus, other than the user doing their own ASCII validation and running the unsafe variants, there is no way to turn an ASCII bytestring into a UTF-8 string without potentially allowing UTF-8 characters, which may not be helpful for some domains.

Guide-level explanation

These explanations are fairly similar to str::from_utf8 and String::from_utf8, as they are very similar functions.

core::str::from_ascii

Converts a slice of bytes to a string slice, validating whether or not they are valid ASCII. As all ASCII is valid UTF-8, this string will be valid UTF-8 and thus a valid string slice (&str). Do note, however, that not all UTF-8 is valid ASCII.

If you are sure that a byte slice is valid UTF-8, and you don't need to validate if the string is valid ASCII, there is an unsafe version of from_utf8, from_utf8_unchecked, which has the same behavior but skips the check. If you need to check that a byte slice is valid UTF-8 and not just valid ASCII, use from_utf8.

If you need a String instead of a &str, consider String::from_ascii.

Because you can stack-allocate a [u8; N], and you can take a &[u8] of it, this function is one way to have a stack-allocated string. There is an example of this in the examples section below.

Errors

Returns Err if the slice is not ASCII with a description as to why the provided slice is not ASCII.

Examples

Basic usage:

use std::str;

// some bytes, in a vector
let hello = vec![104, 101, 108, 108, 111];

// We know these bytes are valid ASCII, so just use `unwrap()`.
let hello = str::from_ascii(&sparkle_heart).unwrap();

assert_eq!("hello", hello);

Incorrect ASCII, but valid UTF-8:

use std::str;

// some bytes, in a vector
let sparkle_heart = vec![240, 159, 146, 150];

// We know these bytes are valid UTF-8, but they are not valid ASCII.
assert!(str::from_ascii(&sparkle_heart).is_err())

Incorrect ASCII as well as UTF-8:

use std::str;

// some invalid bytes, in a vector
let sparkle_heart = vec![0, 159, 146, 150];

assert!(str::from_ascii(&sparkle_heart).is_err());

See the docs for AsciiError for more details on the kinds of errors that can be returned.

A "stack allocated string":

use std::str;

// some bytes, in a stack-allocated array
let hello = [104, 101, 108, 108, 111];

// We know these bytes are valid, so just use `unwrap()`.
let hello = str::from_ascii(&sparkle_heart).unwrap();

assert_eq!("hello", hello);

TODO: String::from_ascii, core::str::AsciiError, alloc::string::FromAsciiError

Reference-level explanation

These new functions can simply call the existing from_utf8_unchecked functions, after having checked that every byte in the slice is less than or equal to 127 (the limit of US-ASCII).

Drawbacks

Checking that characters are valid ASCII is already simple, and it may just be possible for users to do this themselves, then calling from_utf8_unchecked to prevent further overhead.

Rationale and alternatives

This design is relatively simple to implement compared to something like the ascii crate, which has string types which are only ASCII. Not doing this would probably have low impact, but it would make code for things such as binary formats (which sometimes only support ASCII byte strings) more complex (and/or more unsafe) in order to be more efficient, when the code to do so could just be abstracted out into the standard library quite easily.

Prior art

Using an example string of ABCD ([65, 66, 67, 68]), the following languages are just a few popular languages which have this feature:

  • Java: new String(new byte[] {65, 66, 67, 68}, "US-ASCII");. In the case of an invalid character, U+FFFD will be added.
  • Python 3: bytes([65, 66, 67, 68]).decode('ascii'). In the case of an invalid character, an exception is raised.
  • NodeJS: Buffer.from(new Uint8Array([65, 66, 67, 68])).toString('ascii'). Bytes are interpreted modulo 128.
  • C/C++ do not have natively supported UTF-8 string types like Rust does. Instead, they just use byte strings which are usually interpreted as only ASCII.

Unresolved questions

  • Should these functions be in alloc::string::String and core::str or inside of core::ascii?
  • Should there be a str::from_ascii_unchecked, as it does the same thing as str::from_utf8_unchecked?
  • Should we create our own versions of core::str::Utf8Error and alloc::string::FromUtf8Error, or simply use the existing ones?

Future possibilities

If this RFC is merged and these functions are added as part of alloc::string::String and core::str, it may be considered that core::str::from_utf8_unchecked should be renamed to something like core::str::from_bytes_unchecked. It may also be considered that core::str::Utf8Error and std::string::FromUtf8Error could be converted into a general FromBytesError type (although there would have to be one in core and another in alloc due to FromUtf8Error having methods dealing with alloc::vec::Vecs).

1 Like

My instinct here is that if it's important that it's just ASCII, then it might as well use an AsciiStr so you can know that on the consumption side. And if it's not important on the consumption side, this feels like a very small improvement -- UTF-8 validation tends to already have a fast path for ASCII, so it's not clear to me that this would even be materially faster in the normal case.

17 Likes

Seconding this. I'd rather we just include the ascii crate in core if anything.

I don't think I'm quite convinced by the motivation. Or at least, the section doesn't seem complete. Is this about performance? If so, then I would expect to see benchmarks. And also an exploration of why this performance problem cannot be fixed in existing APIs.

7 Likes

A minor note, I think your first and fourth examples have a typo: they both define hello but refer to sparkle_heart in the call to from_ascii.

Yeah, on further thought I guess this isn't really that helpful.

The motivation I see is for cases where one needs to guarantee ASCII validity, not as much for speeding up UTF-8 validation:

fn new(bytes: Vec<u8>) -> Result<Self, ()> {
    if !bytes.is_ascii() {
        Err(())
    } else {
        Ok(Self(String::from_utf8(bytes).unwrap()))
    }
}

Using from_utf8_unchecked here allows us to skip the unnecessary utf-8 validation entirely. For example, the http-types crate does this when creating http header names/values. Given that this is pretty common, I think it is reasonable to include String::from_ascii in the standard library.

Can you say more about why you think this is common? Like, I do not accept that as a given at all. I don't think I've once had to do it.

6 Likes

I think this still triggers the "but what if you want to keep that ascii-ness knowledge"?

If the header is always ASCII, then keeping them in AsciiStrings instead -- with infallible O(1) conversion to Strings if needed -- seems helpful for being able to re-use that name later without needing to re-validate that it's ASCII.

6 Likes

You already guarantee ASCII validity on creation, why would you have to re-validate it?

Because you have a String, so if you pass that to somebody they don't know it's only ASCII.

4 Likes

The utf validation has a fast path for ascii and you are already saying you want it to be validated as ascii so I don't think there is any performance overheads here. It would be more explicit in the code as to what form of data your parsing, but that's about it.

1 Like

String as a type doesn't guarantee it remains ASCII. It could be invalidated at any time with string.push('☠️').

3 Likes

Yes, in some cases. But in others, after parsing those operations are not exposed. I guess maybe this is a niche thing that I ran into, and is not so common after all.

The utf validation has a fast path for ascii and you are already saying you want it to be validated as ascii so I don't think there is any performance overheads here.

Of course there is performance overhead. I have to validate it as ASCII, and then (without unsafe) again validate it as UTF-8 to convert it to a string. String::from_utf8(bytes) doesn't guarantee that bytes is valid ASCII, because ASCII is strictly a subset of UTF-8.

why not use something like ascii - Rust

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.