Strings and UTF-8


#1

I was wondering why the string documentation specifically mentions that they are UTF-8 encoded. Is there a specific reason why an encoding method was locked down?


#2

UTF-8 is layout-compatible with ASCII (as opposed to, say, UTF-16 in Java and JavaScript) and (to my knowledge) the single most common encoding out there. There’s nothing particularly special about the string types; if you really need to handle something weird like latin-1, I’m sure someone’s sorted out a crate for it.


#3

The Rust ecosystem is built around UTF-8. Any other default encoding would create compatibility problems.


#4

It means you can convert a UTF-8 encoded byte-blob to a str without having to copy it, guaranteed.


#5

UTF-8 is basically the best option we have for string encoding today. There are many reasons, so I won’t repeat them here, because there’s a great piece of writing about all the arguments as to why everyone should use UTF-8 (by default and wherever possible): https://utf8everywhere.org/


#6

See also this document by Mark Davis, which surveys different string models and their trade offs: https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw


#7

Because it affects the possible APIs; most importantly which ones can be no-fail and non-reallocating.


#8

Without an encoding specified, you don’t have a string, you have bytes. If you want bytes, you can always use [u8] or Vec<u8>. str and String exist to contain Unicode characters, not just bytes. Of the ways to encode Unicode characters, UTF-8 provides a far more efficient encoding than UCS-4.


#9

FWIW, I’ve been working on a new string library that mirrors Rust’s String/&str types, except instead of “strings are guaranteed to be valid UTF-8,” they are instead “conventionally UTF-8.” The motivation of the library is to address a very common pain point that I’ve come across over the years where I want to treat bytes as primarily UTF-8 without doing an upfront UTF-8 validation check. Most of the APIs remain the same, with the exception that the Unicode replacement codepoint features more prominently in operations that are only defined on Unicode codepoints (e.g., UTF-8 decoding, grapheme clusters and so on). In particular, The Vec<u8>/&[u8] types are insufficient to address this use case, since they lack a lot of the common operations one associates with strings.


#10

That sounds like a type I’d love to see in the standard library, eventually.


#11

This is my new favorite thing. Thank you.


The following is probably insane:

I wonder how bad of an idea it would be to change String to be String<E: Encoding = Utf8>. I don’t believe this would be a breaking change, though it would be absolutely awful to deal with making str generic in the encoding, since it’s stuck as a magic builtin for obnoxious historical reasons.

I’m going to go ahead and work out the Encoding trait, mostly to convince myself this is insane.

/// Represents an encoding of Unicode.
trait Encoding {
  /// The type representing a code unit for this
  /// encoding. For UTF-8, this is a `u8`.
  type CodeUnit;

  /// Wrap a stream of bytes encoding valid UTF-8
  /// into a stream of whatever this encoding's codepoints are.
  fn from_utf8(stream: impl Iterator<Item = u8>)
    -> impl Iterator<Item = Self::CodeUnit>;

  /// Wrap a stream using this encoding into a stream of UTF-8 bytes.
  fn to_utf8(stream: impl Iterator<Item = Self::CodeUnit>)
    -> impl Iterator<Item = u8>;
  
  /// Wrap a stream of Unicode codepoints into a stream encoding
  /// using this encoding. 
  fn from_codepoints(stream: impl Iterator<Item = char>)
    -> impl Iterator<Item = Self::CodeUnit>;
  
  /// Wrap a stream using this encoding into a stream of Unicode
  /// codepoints.
  fn to_codepoints(stream: impl Iterator<Item = Self::CodeUnit>)
    -> impl Iterator<Item = char>;
  
  /// Validate that this stream is, in fact, a stream using this encoding.
  fn validate(stream: impl Iterator<Item = Self::CodeUnit>) -> bool;
}

struct Utf8;
impl Encoding for Utf8 {
  type CodeUnit = u8;
  // ..
}

Ok I think I might have accidentally written down something reasonable. I really really don’t actually think String should be generic in encoding; making it this easy to reach for alternate encodings is going to just confuse people, instead of driving use towards the One True Encoding…

Edit: apparently I reinvented std::basic_string by accident? Probably worth disregarding this whole post.


#12

@burntsushi, do you intend to make the storage pluggable Ă  la string crate? I think there is definetely a need for more flexible strings in both axes of correctness guarantees and storage.


#13

No. Its internal representation is Vec<u8>/&[u8].


#14

Do you have examples now that make_ascii_lowercase(), etc., exist?


#15

If you take the APIs of String/&str and subtract the APIs of Vec<u8>/&[u8], then I think whatever you have leftover would be the starting set of examples. e.g., chars (Unicode aware), replace, to_lowercase (Unicode aware), to_uppercase (Unicode aware), the various split routines (Unicode aware), lines, find, rfind, the various trim routines, and so on. Then there are the various ecosystem Unicode crates, such as detection of various boundaries (lines, grapheme clusters, words, sentences). All of this stuff only works on strings that are guaranteed to be valid UTF-8, but all of these operations can be implemented for strings that are only conventionally UTF-8 by making choices about what to do in the presence of invalid UTF-8.


#16

I think it provides significant ecosystem value that crates can focus on the guaranteed-valid case and don’t need to also implement the potentially-invalid case.

That said, &[u8] could pretty easily provide an iterator that yields chars from potentially-invalid UTF-8 with WHATWG Encoding Standard-compliand U+FFFD generation and splitting to lines based on LF and CRLF.


#17

To part 1, you’re welcome, and to part 2, please don’t :smiley:

That said, there’s absolutely no doubt on my part that we need to handle many other sorts of string encodings, but the reason why subtyping or generics doesn’t feel the right solution is that many encodings are just so much different (from UTF-8 and from each other), they have different typical/idiomatic use cases, wildly varying performance characteristics, etc. that bringing them under the same name sounds like asking for trouble.

This is one of the very rare cases where I think “overgeneralization” could be considered a real problem, and we should instead consider making UTF16String, Latin1String, etc., most probably in small, pluggable, external (non-std) crates. A trait describing the true/sensible intersection of the sets of common operations couldn’t hurt either (e.g. iteration over Unicode code points or grapheme clusters), but it would indeed be a pain to make String retroactively generic.


#18

You and me, too.


#19

I don’t think it would be totally insane to do the C++ std::basic_string thing (i,i std::string::EncodedString for us?) and make type String = EncodedString<Utf8>. Since EncodedString would consist of a Vec of code units, we keep the "String is a byte vec" guarantee. What I want to avoid is having to reimplement all of the string methods every time.


#20

Yes… I agree. My crate is for the case where it is inconvenient or sub-optimal to require valid UTF-8 because the world we live in is not valid UTF-8 100% of the time. (For example, ripgrep, or any other tool that needs or wants to read the contents of files on a system that does not guarantee anything about the encoding of said files.)