I would appreciate feedback for this idea. It's not really anything big but it could potentially be helpful. Some explanations of things may not be quite complete yet.
- Feature Name:
string_from_ascii
- Start Date: Pre-RFC 05/02/21
Summary
Create the methods String::from_ascii
and str::from_ascii{,_mut}
which create a String
or a str
(respectively) from bytes containing only valid ASCII characters.
Motivation
UTF-8 is helpful, but for some cases we don't need it. When it comes to parsing strings from bytes, there are times when you do not need the extra complexity of validating UTF-8. While String::from_utf8_unchecked
and str::from_utf8_unchecked
exist, they do not check that it is valid ASCII. Thus, other than the user doing their own ASCII validation and running the unsafe variants, there is no way to turn an ASCII bytestring into a UTF-8 string without potentially allowing UTF-8 characters, which may not be helpful for some domains.
Guide-level explanation
These explanations are fairly similar to str::from_utf8
and String::from_utf8
, as they are very similar functions.
core::str::from_ascii
Converts a slice of bytes to a string slice, validating whether or not they are valid ASCII. As all ASCII is valid UTF-8, this string will be valid UTF-8 and thus a valid string slice (&str
). Do note, however, that not all UTF-8 is valid ASCII.
If you are sure that a byte slice is valid UTF-8, and you don't need to validate if the string is valid ASCII, there is an unsafe version of from_utf8
, from_utf8_unchecked
, which has the same behavior but skips the check. If you need to check that a byte slice is valid UTF-8 and not just valid ASCII, use from_utf8
.
If you need a String
instead of a &str
, consider String::from_ascii
.
Because you can stack-allocate a [u8; N]
, and you can take a &[u8]
of it, this function is one way to have a stack-allocated string. There is an example of this in the examples section below.
Errors
Returns Err
if the slice is not ASCII with a description as to why the provided slice is not ASCII.
Examples
Basic usage:
use std::str;
// some bytes, in a vector
let hello = vec![104, 101, 108, 108, 111];
// We know these bytes are valid ASCII, so just use `unwrap()`.
let hello = str::from_ascii(&sparkle_heart).unwrap();
assert_eq!("hello", hello);
Incorrect ASCII, but valid UTF-8:
use std::str;
// some bytes, in a vector
let sparkle_heart = vec![240, 159, 146, 150];
// We know these bytes are valid UTF-8, but they are not valid ASCII.
assert!(str::from_ascii(&sparkle_heart).is_err())
Incorrect ASCII as well as UTF-8:
use std::str;
// some invalid bytes, in a vector
let sparkle_heart = vec![0, 159, 146, 150];
assert!(str::from_ascii(&sparkle_heart).is_err());
See the docs for AsciiError
for more details on the kinds of errors that can be returned.
A "stack allocated string":
use std::str;
// some bytes, in a stack-allocated array
let hello = [104, 101, 108, 108, 111];
// We know these bytes are valid, so just use `unwrap()`.
let hello = str::from_ascii(&sparkle_heart).unwrap();
assert_eq!("hello", hello);
TODO: String::from_ascii
, core::str::AsciiError
, alloc::string::FromAsciiError
Reference-level explanation
These new functions can simply call the existing from_utf8_unchecked
functions, after having checked that every byte in the slice is less than or equal to 127 (the limit of US-ASCII).
Drawbacks
Checking that characters are valid ASCII is already simple, and it may just be possible for users to do this themselves, then calling from_utf8_unchecked
to prevent further overhead.
Rationale and alternatives
This design is relatively simple to implement compared to something like the ascii
crate, which has string types which are only ASCII. Not doing this would probably have low impact, but it would make code for things such as binary formats (which sometimes only support ASCII byte strings) more complex (and/or more unsafe) in order to be more efficient, when the code to do so could just be abstracted out into the standard library quite easily.
Prior art
Using an example string of ABCD
([65, 66, 67, 68]
), the following languages are just a few popular languages which have this feature:
- Java:
new String(new byte[] {65, 66, 67, 68}, "US-ASCII");
. In the case of an invalid character, U+FFFD will be added. - Python 3:
bytes([65, 66, 67, 68]).decode('ascii')
. In the case of an invalid character, an exception is raised. - NodeJS:
Buffer.from(new Uint8Array([65, 66, 67, 68])).toString('ascii')
. Bytes are interpreted modulo128
. - C/C++ do not have natively supported UTF-8 string types like Rust does. Instead, they just use byte strings which are usually interpreted as only ASCII.
Unresolved questions
- Should these functions be in
alloc::string::String
andcore::str
or inside ofcore::ascii
? - Should there be a
str::from_ascii_unchecked
, as it does the same thing asstr::from_utf8_unchecked
? - Should we create our own versions of
core::str::Utf8Error
andalloc::string::FromUtf8Error
, or simply use the existing ones?
Future possibilities
If this RFC is merged and these functions are added as part of alloc::string::String
and core::str
, it may be considered that core::str::from_utf8_unchecked
should be renamed to something like core::str::from_bytes_unchecked
. It may also be considered that core::str::Utf8Error
and std::string::FromUtf8Error
could be converted into a general FromBytesError
type (although there would have to be one in core
and another in alloc
due to FromUtf8Error
having methods dealing with alloc::vec::Vec
s).