Strings and UTF-8

Nokel81 · December 18, 2018, 10:11pm

I was wondering why the string documentation specifically mentions that they are UTF-8 encoded. Is there a specific reason why an encoding method was locked down?

mcy · December 18, 2018, 10:17pm

UTF-8 is layout-compatible with ASCII (as opposed to, say, UTF-16 in Java and JavaScript) and (to my knowledge) the single most common encoding out there. There’s nothing particularly special about the string types; if you really need to handle something weird like latin-1, I’m sure someone’s sorted out a crate for it.

Tom-Phinney · December 18, 2018, 10:26pm

The Rust ecosystem is built around UTF-8. Any other default encoding would create compatibility problems.

notriddle · December 18, 2018, 10:42pm

It means you can convert a UTF-8 encoded byte-blob to a str without having to copy it, guaranteed.

H2CO3 · December 18, 2018, 11:27pm

UTF-8 is basically the best option we have for string encoding today. There are many reasons, so I won't repeat them here, because there's a great piece of writing about all the arguments as to why everyone should use UTF-8 (by default and wherever possible): https://utf8everywhere.org/

burntsushi · December 18, 2018, 11:58pm

See also this document by Mark Davis, which surveys different string models and their trade offs: https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw

scottmcm · December 19, 2018, 9:14am

Because it affects the possible APIs; most importantly which ones can be no-fail and non-reallocating.

josh · December 19, 2018, 6:19pm

Without an encoding specified, you don't have a string, you have bytes. If you want bytes, you can always use [u8] or Vec<u8>. str and String exist to contain Unicode characters, not just bytes. Of the ways to encode Unicode characters, UTF-8 provides a far more efficient encoding than UCS-4.

burntsushi · December 19, 2018, 6:37pm

FWIW, I’ve been working on a new string library that mirrors Rust’s String/&str types, except instead of “strings are guaranteed to be valid UTF-8,” they are instead “conventionally UTF-8.” The motivation of the library is to address a very common pain point that I’ve come across over the years where I want to treat bytes as primarily UTF-8 without doing an upfront UTF-8 validation check. Most of the APIs remain the same, with the exception that the Unicode replacement codepoint features more prominently in operations that are only defined on Unicode codepoints (e.g., UTF-8 decoding, grapheme clusters and so on). In particular, The Vec<u8>/&[u8] types are insufficient to address this use case, since they lack a lot of the common operations one associates with strings.

josh · December 19, 2018, 7:05pm

That sounds like a type I’d love to see in the standard library, eventually.

mcy · December 20, 2018, 1:05am

This is my new favorite thing. Thank you.

The following is probably insane:

I wonder how bad of an idea it would be to change String to be String<E: Encoding = Utf8>. I don't believe this would be a breaking change, though it would be absolutely awful to deal with making str generic in the encoding, since it's stuck as a magic builtin for obnoxious historical reasons.

I'm going to go ahead and work out the Encoding trait, mostly to convince myself this is insane.

/// Represents an encoding of Unicode.
trait Encoding {
  /// The type representing a code unit for this
  /// encoding. For UTF-8, this is a `u8`.
  type CodeUnit;

  /// Wrap a stream of bytes encoding valid UTF-8
  /// into a stream of whatever this encoding's codepoints are.
  fn from_utf8(stream: impl Iterator<Item = u8>)
    -> impl Iterator<Item = Self::CodeUnit>;

  /// Wrap a stream using this encoding into a stream of UTF-8 bytes.
  fn to_utf8(stream: impl Iterator<Item = Self::CodeUnit>)
    -> impl Iterator<Item = u8>;
  
  /// Wrap a stream of Unicode codepoints into a stream encoding
  /// using this encoding. 
  fn from_codepoints(stream: impl Iterator<Item = char>)
    -> impl Iterator<Item = Self::CodeUnit>;
  
  /// Wrap a stream using this encoding into a stream of Unicode
  /// codepoints.
  fn to_codepoints(stream: impl Iterator<Item = Self::CodeUnit>)
    -> impl Iterator<Item = char>;
  
  /// Validate that this stream is, in fact, a stream using this encoding.
  fn validate(stream: impl Iterator<Item = Self::CodeUnit>) -> bool;
}

struct Utf8;
impl Encoding for Utf8 {
  type CodeUnit = u8;
  // ..
}

Ok I think I might have accidentally written down something reasonable. I really really don't actually think String should be generic in encoding; making it this easy to reach for alternate encodings is going to just confuse people, instead of driving use towards the One True Encoding...

Edit: apparently I reinvented std::basic_string by accident? Probably worth disregarding this whole post.

luben · December 20, 2018, 2:21am

@burntsushi, do you intend to make the storage pluggable à la string crate? I think there is definetely a need for more flexible strings in both axes of correctness guarantees and storage.

burntsushi · December 20, 2018, 2:44am

No. Its internal representation is Vec<u8>/&[u8].

hsivonen · December 20, 2018, 3:15pm

Do you have examples now that make_ascii_lowercase(), etc., exist?

burntsushi · December 20, 2018, 3:38pm

If you take the APIs of String/&str and subtract the APIs of Vec<u8>/&[u8], then I think whatever you have leftover would be the starting set of examples. e.g., chars (Unicode aware), replace, to_lowercase (Unicode aware), to_uppercase (Unicode aware), the various split routines (Unicode aware), lines, find, rfind, the various trim routines, and so on. Then there are the various ecosystem Unicode crates, such as detection of various boundaries (lines, grapheme clusters, words, sentences). All of this stuff only works on strings that are guaranteed to be valid UTF-8, but all of these operations can be implemented for strings that are only conventionally UTF-8 by making choices about what to do in the presence of invalid UTF-8.

hsivonen · December 20, 2018, 4:38pm

I think it provides significant ecosystem value that crates can focus on the guaranteed-valid case and don't need to also implement the potentially-invalid case.

That said, &[u8] could pretty easily provide an iterator that yields chars from potentially-invalid UTF-8 with WHATWG Encoding Standard-compliand U+FFFD generation and splitting to lines based on LF and CRLF.

H2CO3 · December 20, 2018, 5:17pm

To part 1, you’re welcome, and to part 2, please don’t

That said, there’s absolutely no doubt on my part that we need to handle many other sorts of string encodings, but the reason why subtyping or generics doesn’t feel the right solution is that many encodings are just so much different (from UTF-8 and from each other), they have different typical/idiomatic use cases, wildly varying performance characteristics, etc. that bringing them under the same name sounds like asking for trouble.

This is one of the very rare cases where I think “overgeneralization” could be considered a real problem, and we should instead consider making UTF16String, Latin1String, etc., most probably in small, pluggable, external (non-std) crates. A trait describing the true/sensible intersection of the sets of common operations couldn’t hurt either (e.g. iteration over Unicode code points or grapheme clusters), but it would indeed be a pain to make String retroactively generic.

mcy · December 20, 2018, 5:20pm

You and me, too.

mcy · December 20, 2018, 5:27pm

I don't think it would be totally insane to do the C++ std::basic_string thing (i,i std::string::EncodedString for us?) and make type String = EncodedString<Utf8>. Since EncodedString would consist of a Vec of code units, we keep the "String is a byte vec" guarantee. What I want to avoid is having to reimplement all of the string methods every time.

burntsushi · December 20, 2018, 6:07pm

Yes... I agree. My crate is for the case where it is inconvenient or sub-optimal to require valid UTF-8 because the world we live in is not valid UTF-8 100% of the time. (For example, ripgrep, or any other tool that needs or wants to read the contents of files on a system that does not guarantee anything about the encoding of said files.)

Topic		Replies	Views
Wild idea: deprecating APIs that conflate str and [u8] libs	59	3564	November 12, 2020
ASCII methods for u16	17	2590	April 11, 2021
[pre-RFC] Deprecate and replace CStr/CString language design	51	6733	March 25, 2019
Byte-string formatting libs	7	7972	March 25, 2019
UTF-8 BOM Handling libs	18	1485	August 20, 2024

Strings and UTF-8

Related topics