Too many words on a `from_utf8_lossy` variant that takes a `Vec<u8>`

I've often wanted a variant of String::from_utf8_lossy which goes from Vec<u8> => String, instead of &'a [u8] to Cow<'a, str>. This happens when:

  • I want a String and not a Cow as output.
  • I have no use for the Vec beyond turning it into a string.

(An example case where I've wanted this include: when getting a String from Command::output's stdout/stderr buffers. But, really any other api that produces a "probably-utf8" Vec<u8>)

As it is, if you're in this position, you generally call to_string/to_owned on the result of String::from_utf8_lossy, which ends up reallocating and making a copy the input string in the case where it was already valid UTF-8. That is, this penalizes the already-utf8 case (and I assert with 0 evidence or intent to provide evidence: this is probably the more common case).

Anyway, there are basically two three options here:

1. Add a new function to String

This seems like a good call (at least when compared to option 2) for several reasons, such as:

  • Keeping API surface area down
  • Simpler implementation
  • No chance of breakage due to new type inference failures (either for types which are Deref<Target=[u8]> or for &[u8; N], more on this later).
  • Easier to document
  • Same location as the other String::from_utf8* functions.

But has the big downside of: I have no idea what to call it.

My best idea is from_utf8_vec_lossy, and I've put an implementation of it below so that we're all on the same page in regards to what I'm talking about.

impl String {
    // ...
    #[inline]
    pub fn from_utf8_vec_lossy(vec: Vec<u8>) -> String {
        if let Cow::Owned(string) = String::from_utf8_lossy(&vec) {
            // See footnote 1 if your immediate thought is: "we should be reusing vec's memory in this case"
            string
        } else {
            // SAFETY: `String::from_utf8_lossy`'s contract ensures that if
            // it returns a `Cow::Borrowed`, it is identical to the input.
            unsafe { String::from_utf8_unchecked(vec) }
        }
    }
    // ...
}

I think the API suffers from having a very clunky name and being hard to discover. That said, if people think it's fine as is, I could have a PR up for adding it (as an unstable libs addition) pretty quickly.

2. Make String::from_utf8_lossy generic, and able to accept both &[u8] and Vec<u8>.

I don't really love this approach since it complicates the API. I also wasn't able to figure out a way to do it that avoids type inference failures in all cases.

This would look like:

impl String {
    #[inline]
    pub fn from_utf8_lossy<I: IntoUtf8Lossy>(bytes: I) -> Input::String {
        bytes.into_utf8_lossy()
    }
}

#[unstable(for a long time or forever idc)]
pub trait IntoUtf8Lossy {
    type String;
    fn into_utf8_lossy(self) -> Self::String;
}

impl IntoUtf8Lossy for Vec<u8> {
    type String = String;
    fn into_utf8_lossy(self) -> Self::String { ... }
}

impl<'a> IntoUtf8Lossy for &'a [u8] {
    type String = Cow<'a, str>;
    fn into_utf8_lossy(self) -> Self::String { ... }
}
// More impls (&Vec<u8>, &Box<[u8]>, &[u8; N], ...) to try to keep type
// inference working in as many cases as possible...

See https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=b2c6f6bb639ee42f29bd5bc95785afc4 for a more complete example I wrote when experimenting with this.

The problem I mentioned about type inference is: the current version will work for any &T where T: Deref<Target = [u8]>, as well as any &[u8; N] (surprisingly, &[u8; N] => &[u8] doesn't happen via Deref. TIL). Either of these could be solved via a blanket impl (either impl<'a, T: Deref<Target=[u8]>> for &'a T or impl<'a, const N: usize> for &'a [u8; N]), but AFAIK both cannot (idk, maybe specialization can make it happen, I tried and failed).

This falls under the category of inference failures caused by generalization to generics (https://rust-lang.github.io/rfcs/1105-api-evolution.html#minor-change-generalizing-to-generics-1), which are considered minor changes (so long as the bustage is sufficiently rare, which... idk).

Even ignoring this though, this api is a lot more complicated than the "just add a new function", even if IntoUtf8Lossy would not really be a public trait.... It seems like a lot of trouble either way, but I could be convinced to try and make this work.

That said, I've mostly mentioned it because it's an semi-obvious solution that I suspected would get suggested if I asked what to name the first thing. The approach did work better than I expected, but still is a lot of hassle.

3. Add a new function to Vec<u8>

I think this probably has the fewest downsides (it has all the benefits I mentioned for case 1 except for being with the other from_utf8s, and then it also doesn't have the "what to name it" issue).

That said, I initially discounted it (and promptly totally forgot about it) because AFAICT, I don't think we have any impl blocks Vec<ConcreteType> currently, and I didn't know if that's a deliberate choice.

impl Vec<u8> {
    #[inline]
    pub fn into_utf8_lossy(self) -> String {
        if let Cow::Owned(string) = String::from_utf8_lossy(&self) {
            // see footnote 1 about reusing memory.
            string
        } else {
            // SAFETY: `String::from_utf8_lossy`'s contract ensures that if
            // it returns a `Cow::Borrowed`, it is identical to the input.
            unsafe { String::from_utf8_unchecked(self) }
        }
    }
}

That said I mostly forgot that this was an option when writing this, and wouldn't have written as much about all the tradeoffs here had I realized this option was available.

This still has discoverability issues, and perhaps Vec::<u8>::into_string_lossy would be a better name (I like that this makes the encoding clear, but it's not entirely consistent with the current naming scheme :frowning:... Maybe I'm talking myself out of this one being the best option...)... But seems like the best, unless there's a better name for option number 1 that I'm just not thinking of.

So, I'm open for thoughts, and now that I've written this I might as well post this overthinking of the problem space, in order to head off any bikeshedding that might occur if I submitted a PR adding a Vec::<u8>::into_utf8_lossy without discussing first.

Prior art

  • https://docs.rs/bstr/0.2/bstr/trait.ByteVec.html#method.into_string_lossy, although I think this still allocates in the success case, which might be because it's unclear that it's okay for an external crate to rely on the safety invariant I cite in the // SAFETY comment (although, it feels like it would be, so maybe it's just a missed optimization).
  • Probably more...

Footnote: Reusing memory in the invalid case

In theory we could reuse the vec's memory for invalid inputs, but naively doing so would take O(input.len() * replacements_needed), as invalid sequences can (and likely will) have different lengths from the utf8-encoded replacement character. That means unless it's done carefully, this is a lot worse than just allocating.

So, yeah. I'm mentioning it only because it's been brought up to me twice, once as a benefit to this API (IMO not really, as mentioned), and once as a reason this API was a bad idea (and I agree doing this would be a bad idea, but that's not what I'm suggesting).

Anyway all the ways I'm suggesting of this permit doing this someday, if it's deemed worthwhile (possibly based on heuristics like: is the invalid sequence at the end of the Vec<u8> (or close enough)? Are the invalid sequence the same length of "\u{fffd}"? Etc.).

And so, it's irrelevant I think.

8 Likes

Maybe I missed it, is the discussion happening elsewhere?

No, I think I just wrote too much to be easily consumed :pensive:

@tcsc There's an issue for this to be added: