[Pre-RFC] ASCII Type & Literals

Sky9 · December 2, 2023, 3:01am

Feature Name: ascii?
Start Date: 2023-12-01
RFC PR: ???
Rust Issue: rust-lang/rust#110998

This RFC builds off of ACP#179, which just proposed the ascii::Char type.

Click to see ASCII table

Summary

Add an Ascii () type, representing a valid ASCII character (0x00-0x7F).
Add a'.', a"...", and ar#"..."# ASCII literals.

ASCII string slices are type &[Ascii], and owned ASCII strings are type Vec<Ascii> (or Box<[Ascii]>).
ASCII string literals (a"") are type &'static [Ascii; N].

Motivation

See ACP#179.

Sometimes, you want to work with bytes that you know are valid ASCII, and you want to avoid littering your code with unsafe from_utf8_unchecked conversions, or .unwrap() calls.

Avoiding "string".as_ascii().unwrap()
TODO

Guide-level explanation

TODO

Reference-level explanation

Ascii Type

(already implemented)

// core::ascii

#[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
#[rustc_layout_scalar_valid_range_start(0)]
#[rustc_layout_scalar_valid_range_end(128)]
#[repr(transparent)]
pub struct Ascii(u8);

Guarantees

The Ascii type is guaranteed to have the same size, align, and ABI as u8.
The Ascii type is guaranteed to be in the range 0..=127 (0x00-0x7F). Values in the range 128..=255 are UB.

The [Ascii] type is guaranteed to have the same layout/ABI as str and [u8].
The [Ascii] type is always valid UTF-8.

Matching

The compiler allows exhaustive matching on Ascii.

match ascii {
    a'\0'..=a'\x7F' => println!("yay")
}

Conversions

Safe conversions from Ascii types to strings, chars, and bytes are provided.
These conversions are zero-cost.

Ascii -> char
Ascii -> u8
[Ascii] -> str / [u8]
[Ascii; N] -> [u8; N]
Box<[Ascii]> -> Box<str> / Box<[u8]>
Vec<Ascii> -> String / Vec<u8>

&mut [Ascii] -> &mut str / &mut [u8] is unsafe (just like str::as_bytes_mut).

Checked and unchecked conversions from strings, chars, and bytes to Ascii types are provided.
The checked conversions only incur the cost of an is_ascii check, and the unchecked conversion are zero-cost (but unsafe).

char -> Ascii
u8 -> Ascii
str / [u8] -> Ascii
[u8; N] -> [Ascii; N]
Box<str> / Box<[u8]> -> Box<[Ascii]>
String / Vec<u8> -> Vec<Ascii>

Methods

https://github.com/rust-lang/rust/issues/110998#issuecomment-1836101837

Trait Impls

core::str::pattern::Pattern (proposed here).

Formatting

Ascii implements Debug. Behavior matches that of char, except with \x hex escapes instead of unicode escapes for non-printable characters. This is already implemented.

Ascii implements Display. Behavior matches that of char and str.

Ascii implements Octal, LowerHex, UpperHex, and Binary. Behavior matches that of u8. ((Is this correct?))

Formatting for &[Ascii] is an unresolved question.

Associated Constants

Associated constants are provided for all 128 ASCII characters. This is currently implemented as an enum with 128 variants.
An enum-based design is still possible.

Additionally, MIN and MAX constants are provided. (0x00 NUL and 0x7F DEL respectively).

Ascii Literals

Three new literal types are added:

ASCII Character: a'A' -> Ascii
ASCII String: a"123456789" -> &'static [Ascii; N]
Raw ASCII String: ar#"raw ascii literal "hi" \ :)"# -> &'static [Ascii; N]

a'.' and a"..." literals accept Quote and ASCII escape codes. Raw string literals do not accept escape codes.

The following entries are added to the reference page on tokens:

	Example	`#` sets	Characters	Escapes
ASCII character	`a'H'`	0	All ASCII	Quote & ASCII
ASCII string	`a"hello"`	0	All ASCII	Quote & ASCII
Raw ASCII string	`ar#"hello"#`	<256	All ASCII	`N/A`

Interaction with string-related macros:

The concat! macro will accept these literals. They will be treated as regular chars/strings.
The format_args! macro will not accept ASCII string literals as the format string.

Drawbacks

More complexity
?

Rationale and alternatives

No ASCII literals, just the Ascii type.
- Means code is littered with "string".as_ascii().unwrap().
ASCII char literals can be replaced with the variants/constants.
- The literals are shorter and nicer to look at.
- Doesn't solve the ASCII string problem.
AsciiStr and AsciiString dedicated types
- bstr 0.1 had these but moved away from them because of conversion hell
- https://github.com/rust-lang/libs-team/issues/179#issuecomment-1426922212
- https://github.com/rust-lang/libs-team/issues/179#issuecomment-1527900570
An ascii! macro instead of dedicated literals (proposed here)
- Could work
- C string literals had the same alternative.

Prior art

Unresolved questions

Formatting for [Ascii] (Debug and Display). We likely want to specialize those impls to behave like str, but I'm not sure what the status/feasibility is on that.
- https://github.com/rust-lang/libs-team/issues/179#issuecomment-1527900570
Use an enum instead of a struct? This is how it is currently implemented.
AsciiStr type alias? (proposed here)

Future possibilities

include_ascii! and other ASCII specific versions of string macros (proposed here).

mathstuf · December 2, 2023, 3:55am

Seems like an easy win to me if the additional bits are worth the complexity. How about include_ascii!?

pitaj · December 2, 2023, 6:54am

I don't see a reason to have literals when the full character set is representable with normal string literals (c and b strings are necessary because they're strict supersets of UTF-8 rather than subsets). Instead, just use the const conversion functions that already exist.

A single ascii! macro could convert both char and string literals:

macro_rules! ascii {
    ($a:literal) => {{
        match ($a).as_ascii() {
            Some(a) => a,
            None => panic!("invalid ASCII literal"),
        }
    }}
}

epage · December 2, 2023, 4:09pm

I have been making changes to winnow, a parser combinator library, and I really dislike how sloppy I am with matching u8 / char with &[u8] and &str. I was thinking it'd be really nice to have an AsciiChar with either variants or literal syntax so users could ergonomically construct them and use them in my API in a more type-strict manner than what I do today.

Ascii strings and string literal could be interesting but are less important for my use case.

Sky9 · December 2, 2023, 5:15pm

A const based approach suffers from post mono error issues (can anyone from lang team confirm this?). Plus, this is something that is quite easy for the compiler to check. A potential macro would need to check validity as part of the macro expansion (ie. would need to be a built-in macro). I don't know if there is precedent for macros transforming literals like this, and the syntax wouldn't match other string literal syntax in the language (like c and b strings). Plus, we have string literal prefixes reserved for things like this.

I'll admit that the case for ASCII strings is stronger than that of ASCII chars, but I believe that if we are going to include one, we should include both for consistency (and to avoid new users wondering why they can do a"" but not a'').

jjpe · December 2, 2023, 8:05pm

Given that the motivation is essentially ergonomics, perhaps the following could be cleared up:

Why won't newtyping String / char do the job?
Assuming that newtyping as an approach is feasible, why does this need to live in stdlib rather than on crates.io?

Sky9 · December 2, 2023, 8:18pm

These comments outline why a dedicated type isn't ideal:

Add an "ascii character" type to reduce `unsafe` needs in conversions · Issue #179 · rust-lang/libs-team · GitHub
Add an "ascii character" type to reduce `unsafe` needs in conversions · Issue #179 · rust-lang/libs-team · GitHub

Nemo157 · December 2, 2023, 8:25pm

Assuming this includes &mut [Ascii] -> &mut str these can't both be true. Is there really a requirement that Ascii has a validity invariant on the wrapped byte rather than just a safety invariant?

EDIT: Actually, that can't be a safe conversion since it allows putting in multi-byte utf8 characters, but similar to having an unsafe &mut str -> &mut [u8] it could be useful as an unsafe conversion.

Sky9 · December 2, 2023, 8:42pm

Good catch! &mut str -> &mut [Ascii] (checked and unchecked) should(?) be safe because ASCII is a subset of UTF-8, and provided the string is ASCII, there's no way to make it not valid UTF-8 (because if it's valid ASCII, it is valid UTF-8, and it's always valid ASCII).

We do have str::as_bytes_mut (unsafe only), so a similar unsafe only method to go from &mut [Ascii] to &mut str / &mut [u8] should™ be ok.

I'll add a note the the rfc text.

Also, enforcing 0..=127 as a validity (type-level) invariant allows niche optimization for types like Option<Ascii>, and matches the behavior of other types like NonZero*.

scottmcm · December 3, 2023, 3:48am

Adding exhaustive matching on the type would be amazing. I do with there was a way that could be done more generally -- it's already technically possible for the one in nightly because it's an enum, though I wish that one could have ranges in patterns to make that far more useful. But since c"strings" have set the pattern here, I can imagine that a"strings" existing would probably be the straight-forward enough.

One thing I've been pondering is whether there should be a AsciiStr type of some sort. Not for any different invariants, but because Debug and Display for &ascii::Str would be different from &[ascii::Char].

Sky9 · December 3, 2023, 4:05am

Could we get libs team input on if specializing Debug and implementing Display for &[Ascii] would be possible? Is there precedent for specializing formatting traits?

EDIT: A quick search shows the only(?) formatting specialization is for Zip with the ZipFmt trait, but that's because of safety invariants.

kupiakos · December 7, 2023, 10:39pm

Worth noting that &mut str -> &mut [Ascii] is safe but must be checked and fallible (for non-ASCII characters)

kupiakos · December 7, 2023, 10:40pm

Thoughts on AsciiStr being DerefMut<Target = [AsciiChar]>?

scottmcm · December 7, 2023, 11:12pm

Absolutely it would be, I think. If we're going to have it, it would be used very sparingly, and we definitely wouldn't want to force all the methods (like .make_lowercase()) to be written twice.

system · March 6, 2024, 11:12pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fn char::as_ascii(self) -> Option<u8>	4	722	April 20, 2020
[Pre-RFC] Hex literals language design	6	2826	March 25, 2019
ASCII methods for u16	17	2480	April 11, 2021
[pre-RFC] custom string literals language design	7	3953	March 25, 2019
Pre-RFC: Extended array literal syntax language design	5	894	September 21, 2019