[Pre-RFC] ASCII Type & Literals

This RFC builds off of ACP#179, which just proposed the ascii::Char type.

Click to see ASCII table ASCII Table

Summary

Add an Ascii (:bike::house:) type, representing a valid ASCII character (0x00-0x7F).
Add a'.', a"...", and ar#"..."# ASCII literals.

ASCII string slices are type &[Ascii], and owned ASCII strings are type Vec<Ascii> (or Box<[Ascii]>).
ASCII string literals (a"") are type &'static [Ascii; N].

Motivation

See ACP#179.

Sometimes, you want to work with bytes that you know are valid ASCII, and you want to avoid littering your code with unsafe from_utf8_unchecked conversions, or .unwrap() calls.

  • Avoiding "string".as_ascii().unwrap()
  • TODO

Guide-level explanation

TODO

Reference-level explanation

Ascii Type

(already implemented)

// core::ascii

#[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
#[rustc_layout_scalar_valid_range_start(0)]
#[rustc_layout_scalar_valid_range_end(128)]
#[repr(transparent)]
pub struct Ascii(u8);

Guarantees

The Ascii type is guaranteed to have the same size, align, and ABI as u8.
The Ascii type is guaranteed to be in the range 0..=127 (0x00-0x7F). Values in the range 128..=255 are UB.

The [Ascii] type is guaranteed to have the same layout/ABI as str and [u8].
The [Ascii] type is always valid UTF-8.

Matching

The compiler allows exhaustive matching on Ascii.

match ascii {
    a'\0'..=a'\x7F' => println!("yay")
}

Conversions

Safe conversions from Ascii types to strings, chars, and bytes are provided.
These conversions are zero-cost.

  • Ascii -> char
  • Ascii -> u8
  • [Ascii] -> str / [u8]
  • [Ascii; N] -> [u8; N]
  • Box<[Ascii]> -> Box<str> / Box<[u8]>
  • Vec<Ascii> -> String / Vec<u8>

&mut [Ascii] -> &mut str / &mut [u8] is unsafe (just like str::as_bytes_mut).

Checked and unchecked conversions from strings, chars, and bytes to Ascii types are provided.
The checked conversions only incur the cost of an is_ascii check, and the unchecked conversion are zero-cost (but unsafe).

  • char -> Ascii
  • u8 -> Ascii
  • str / [u8] -> Ascii
  • [u8; N] -> [Ascii; N]
  • Box<str> / Box<[u8]> -> Box<[Ascii]>
  • String / Vec<u8> -> Vec<Ascii>

Methods

https://github.com/rust-lang/rust/issues/110998#issuecomment-1836101837

Trait Impls

Formatting

Ascii implements Debug. Behavior matches that of char, except with \x hex escapes instead of unicode escapes for non-printable characters. This is already implemented.

Ascii implements Display. Behavior matches that of char and str.

Ascii implements Octal, LowerHex, UpperHex, and Binary. Behavior matches that of u8. ((Is this correct?))

Formatting for &[Ascii] is an unresolved question.

Associated Constants

Associated constants are provided for all 128 ASCII characters. This is currently implemented as an enum with 128 variants.
An enum-based design is still possible.

Additionally, MIN and MAX constants are provided. (0x00 NUL and 0x7F DEL respectively).

Ascii Literals

Three new literal types are added:

  • ASCII Character: a'A' -> Ascii
  • ASCII String: a"123456789" -> &'static [Ascii; N]
  • Raw ASCII String: ar#"raw ascii literal "hi" \ :)"# -> &'static [Ascii; N]

a'.' and a"..." literals accept Quote and ASCII escape codes. Raw string literals do not accept escape codes.

The following entries are added to the reference page on tokens:

Example # sets Characters Escapes
ASCII character a'H' 0 All ASCII Quote & ASCII
ASCII string a"hello" 0 All ASCII Quote & ASCII
Raw ASCII string ar#"hello"# <256 All ASCII N/A

Interaction with string-related macros:

  • The concat! macro will accept these literals. They will be treated as regular chars/strings.
  • The format_args! macro will not accept ASCII string literals as the format string.

Drawbacks

  • More complexity
  • ?

Rationale and alternatives

Prior art

Unresolved questions

Future possibilities

  • include_ascii! and other ASCII specific versions of string macros (proposed here).
4 Likes

Seems like an easy win to me if the additional bits are worth the complexity. How about include_ascii!?

2 Likes

I don't see a reason to have literals when the full character set is representable with normal string literals (c and b strings are necessary because they're strict supersets of UTF-8 rather than subsets). Instead, just use the const conversion functions that already exist.

A single ascii! macro could convert both char and string literals:

macro_rules! ascii {
    ($a:literal) => {{
        match ($a).as_ascii() {
            Some(a) => a,
            None => panic!("invalid ASCII literal"),
        }
    }}
}
1 Like

I have been making changes to winnow, a parser combinator library, and I really dislike how sloppy I am with matching u8 / char with &[u8] and &str. I was thinking it'd be really nice to have an AsciiChar with either variants or literal syntax so users could ergonomically construct them and use them in my API in a more type-strict manner than what I do today.

Ascii strings and string literal could be interesting but are less important for my use case.

1 Like

A const based approach suffers from post mono error issues (can anyone from lang team confirm this?). Plus, this is something that is quite easy for the compiler to check. A potential macro would need to check validity as part of the macro expansion (ie. would need to be a built-in macro). I don't know if there is precedent for macros transforming literals like this, and the syntax wouldn't match other string literal syntax in the language (like c and b strings). Plus, we have string literal prefixes reserved for things like this.

I'll admit that the case for ASCII strings is stronger than that of ASCII chars, but I believe that if we are going to include one, we should include both for consistency (and to avoid new users wondering why they can do a"" but not a'').

1 Like

Given that the motivation is essentially ergonomics, perhaps the following could be cleared up:

  1. Why won't newtyping String / char do the job?
  2. Assuming that newtyping as an approach is feasible, why does this need to live in stdlib rather than on crates.io?

These comments outline why a dedicated type isn't ideal:

1 Like

Assuming this includes &mut [Ascii] -> &mut str these can't both be true. Is there really a requirement that Ascii has a validity invariant on the wrapped byte rather than just a safety invariant?

EDIT: Actually, that can't be a safe conversion since it allows putting in multi-byte utf8 characters, but similar to having an unsafe &mut str -> &mut [u8] it could be useful as an unsafe conversion.

Good catch! &mut str -> &mut [Ascii] (checked and unchecked) should(?) be safe because ASCII is a subset of UTF-8, and provided the string is ASCII, there's no way to make it not valid UTF-8 (because if it's valid ASCII, it is valid UTF-8, and it's always valid ASCII).

We do have str::as_bytes_mut (unsafe only), so a similar unsafe only method to go from &mut [Ascii] to &mut str / &mut [u8] should™ be ok.

I'll add a note the the rfc text.

Also, enforcing 0..=127 as a validity (type-level) invariant allows niche optimization for types like Option<Ascii>, and matches the behavior of other types like NonZero*.

1 Like

Adding exhaustive matching on the type would be amazing. I do with there was a way that could be done more generally -- it's already technically possible for the one in nightly because it's an enum, though I wish that one could have ranges in patterns to make that far more useful. But since c"strings" have set the pattern here, I can imagine that a"strings" existing would probably be the straight-forward enough.


One thing I've been pondering is whether there should be a AsciiStr type of some sort. Not for any different invariants, but because Debug and Display for &ascii::Str would be different from &[ascii::Char].

4 Likes

Could we get libs team input on if specializing Debug and implementing Display for &[Ascii] would be possible? Is there precedent for specializing formatting traits?

EDIT: A quick search shows the only(?) formatting specialization is for Zip with the ZipFmt trait, but that's because of safety invariants.

Worth noting that &mut str -> &mut [Ascii] is safe but must be checked and fallible (for non-ASCII characters)

1 Like

Thoughts on AsciiStr being DerefMut<Target = [AsciiChar]>?

Absolutely it would be, I think. If we're going to have it, it would be used very sparingly, and we definitely wouldn't want to force all the methods (like .make_lowercase()) to be written twice.