This is my new favorite thing. Thank you.
The following is probably insane:
I wonder how bad of an idea it would be to change String to be String<E: Encoding = Utf8>. I donāt believe this would be a breaking change, though it would be absolutely awful to deal with making str generic in the encoding, since itās stuck as a magic builtin for obnoxious historical reasons.
Iām going to go ahead and work out the Encoding trait, mostly to convince myself this is insane.
/// Represents an encoding of Unicode.
trait Encoding {
/// The type representing a code unit for this
/// encoding. For UTF-8, this is a `u8`.
type CodeUnit;
/// Wrap a stream of bytes encoding valid UTF-8
/// into a stream of whatever this encoding's codepoints are.
fn from_utf8(stream: impl Iterator<Item = u8>)
-> impl Iterator<Item = Self::CodeUnit>;
/// Wrap a stream using this encoding into a stream of UTF-8 bytes.
fn to_utf8(stream: impl Iterator<Item = Self::CodeUnit>)
-> impl Iterator<Item = u8>;
/// Wrap a stream of Unicode codepoints into a stream encoding
/// using this encoding.
fn from_codepoints(stream: impl Iterator<Item = char>)
-> impl Iterator<Item = Self::CodeUnit>;
/// Wrap a stream using this encoding into a stream of Unicode
/// codepoints.
fn to_codepoints(stream: impl Iterator<Item = Self::CodeUnit>)
-> impl Iterator<Item = char>;
/// Validate that this stream is, in fact, a stream using this encoding.
fn validate(stream: impl Iterator<Item = Self::CodeUnit>) -> bool;
}
struct Utf8;
impl Encoding for Utf8 {
type CodeUnit = u8;
// ..
}
Ok I think I might have accidentally written down something reasonable. I really really donāt actually think String should be generic in encoding; making it this easy to reach for alternate encodings is going to just confuse people, instead of driving use towards the One True Encodingā¦
Edit: apparently I reinvented std::basic_string by accident? Probably worth disregarding this whole post.