Too many words on a `from_utf8_lossy` variant that takes a `Vec<u8>`

tcsc · September 6, 2020, 1:18am

I've often wanted a variant of String::from_utf8_lossy which goes from Vec<u8> => String, instead of &'a [u8] to Cow<'a, str>. This happens when:

I want a String and not a Cow as output.
I have no use for the Vec beyond turning it into a string.

(An example case where I've wanted this include: when getting a String from Command::output's stdout/stderr buffers. But, really any other api that produces a "probably-utf8" Vec<u8>)

As it is, if you're in this position, you generally call to_string/to_owned on the result of String::from_utf8_lossy, which ends up reallocating and making a copy the input string in the case where it was already valid UTF-8. That is, this penalizes the already-utf8 case (and I assert with 0 evidence or intent to provide evidence: this is probably the more common case).

Anyway, there are basically ~~two~~ three options here:

1. Add a new function to String

This seems like a good call (at least when compared to option 2) for several reasons, such as:

Keeping API surface area down
Simpler implementation
No chance of breakage due to new type inference failures (either for types which are Deref<Target=[u8]> or for &[u8; N], more on this later).
Easier to document
Same location as the other String::from_utf8* functions.

But has the big downside of: I have no idea what to call it.

My best idea is from_utf8_vec_lossy, and I've put an implementation of it below so that we're all on the same page in regards to what I'm talking about.

impl String {
    // ...
    #[inline]
    pub fn from_utf8_vec_lossy(vec: Vec<u8>) -> String {
        if let Cow::Owned(string) = String::from_utf8_lossy(&vec) {
            // See footnote 1 if your immediate thought is: "we should be reusing vec's memory in this case"
            string
        } else {
            // SAFETY: `String::from_utf8_lossy`'s contract ensures that if
            // it returns a `Cow::Borrowed`, it is identical to the input.
            unsafe { String::from_utf8_unchecked(vec) }
        }
    }
    // ...
}

I think the API suffers from having a very clunky name and being hard to discover. That said, if people think it's fine as is, I could have a PR up for adding it (as an unstable libs addition) pretty quickly.

2. Make `String::from_utf8_lossy` generic, and able to accept both `&[u8]` and `Vec<u8>`.

I don't really love this approach since it complicates the API. I also wasn't able to figure out a way to do it that avoids type inference failures in all cases.

This would look like:

impl String {
    #[inline]
    pub fn from_utf8_lossy<I: IntoUtf8Lossy>(bytes: I) -> Input::String {
        bytes.into_utf8_lossy()
    }
}

#[unstable(for a long time or forever idc)]
pub trait IntoUtf8Lossy {
    type String;
    fn into_utf8_lossy(self) -> Self::String;
}

impl IntoUtf8Lossy for Vec<u8> {
    type String = String;
    fn into_utf8_lossy(self) -> Self::String { ... }
}

impl<'a> IntoUtf8Lossy for &'a [u8] {
    type String = Cow<'a, str>;
    fn into_utf8_lossy(self) -> Self::String { ... }
}
// More impls (&Vec<u8>, &Box<[u8]>, &[u8; N], ...) to try to keep type
// inference working in as many cases as possible...

See https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=b2c6f6bb639ee42f29bd5bc95785afc4 for a more complete example I wrote when experimenting with this.

The problem I mentioned about type inference is: the current version will work for any &T where T: Deref<Target = [u8]>, as well as any &[u8; N] (surprisingly, &[u8; N] => &[u8] doesn't happen via Deref. TIL). Either of these could be solved via a blanket impl (either impl<'a, T: Deref<Target=[u8]>> for &'a T or impl<'a, const N: usize> for &'a [u8; N]), but AFAIK both cannot (idk, maybe specialization can make it happen, I tried and failed).

This falls under the category of inference failures caused by generalization to generics (https://rust-lang.github.io/rfcs/1105-api-evolution.html#minor-change-generalizing-to-generics-1), which are considered minor changes (so long as the bustage is sufficiently rare, which... idk).

Even ignoring this though, this api is a lot more complicated than the "just add a new function", even if IntoUtf8Lossy would not really be a public trait.... It seems like a lot of trouble either way, but I could be convinced to try and make this work.

That said, I've mostly mentioned it because it's an semi-obvious solution that I suspected would get suggested if I asked what to name the first thing. The approach did work better than I expected, but still is a lot of hassle.

3. Add a new function to `Vec<u8>`

I think this probably has the fewest downsides (it has all the benefits I mentioned for case 1 except for being with the other from_utf8s, and then it also doesn't have the "what to name it" issue).

That said, I initially discounted it (and promptly totally forgot about it) because AFAICT, I don't think we have any impl blocks Vec<ConcreteType> currently, and I didn't know if that's a deliberate choice.

impl Vec<u8> {
    #[inline]
    pub fn into_utf8_lossy(self) -> String {
        if let Cow::Owned(string) = String::from_utf8_lossy(&self) {
            // see footnote 1 about reusing memory.
            string
        } else {
            // SAFETY: `String::from_utf8_lossy`'s contract ensures that if
            // it returns a `Cow::Borrowed`, it is identical to the input.
            unsafe { String::from_utf8_unchecked(self) }
        }
    }
}

That said I mostly forgot that this was an option when writing this, and wouldn't have written as much about all the tradeoffs here had I realized this option was available.

This still has discoverability issues, and perhaps Vec::<u8>::into_string_lossy would be a better name (I like that this makes the encoding clear, but it's not entirely consistent with the current naming scheme ... Maybe I'm talking myself out of this one being the best option...)... But seems like the best, unless there's a better name for option number 1 that I'm just not thinking of.

So, I'm open for thoughts, and now that I've written this I might as well post this overthinking of the problem space, in order to head off any bikeshedding that might occur if I submitted a PR adding a Vec::<u8>::into_utf8_lossy without discussing first.

Prior art

https://docs.rs/bstr/0.2/bstr/trait.ByteVec.html#method.into_string_lossy, although I think this still allocates in the success case, which might be because it's unclear that it's okay for an external crate to rely on the safety invariant I cite in the // SAFETY comment (although, it feels like it would be, so maybe it's just a missed optimization).
Probably more...

Footnote: Reusing memory in the invalid case

In theory we could reuse the vec's memory for invalid inputs, but naively doing so would take O(input.len() * replacements_needed), as invalid sequences can (and likely will) have different lengths from the utf8-encoded replacement character. That means unless it's done carefully, this is a lot worse than just allocating.

So, yeah. I'm mentioning it only because it's been brought up to me twice, once as a benefit to this API (IMO not really, as mentioned), and once as a reason this API was a bad idea (and I agree doing this would be a bad idea, but that's not what I'm suggesting).

Anyway all the ways I'm suggesting of this permit doing this someday, if it's deemed worthwhile (possibly based on heuristics like: is the invalid sequence at the end of the Vec<u8> (or close enough)? Are the invalid sequence the same length of "\u{fffd}"? Etc.).

And so, it's irrelevant I think.

amosonn · September 16, 2020, 11:28am

Maybe I missed it, is the discussion happening elsewhere?

tcsc · September 16, 2020, 6:17pm

No, I think I just wrote too much to be easily consumed

dylni · October 5, 2020, 1:41am

@tcsc There's an issue for this to be added:

system · January 3, 2021, 1:41am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cow<str> to Cow<Path> libs	4	1562	March 25, 2019
Why not impl TryFrom<Vec<u8>> for String? libs	11	903	March 24, 2024
`into_string_lossy` methods for `CString` and `OsString`? libs	4	629	August 19, 2024
Micro RFC: `String::from_utf8_with` for different handling of invalid UTF8 libs	20	1698	June 4, 2023
Needs more Cow-bell?	9	4252	March 25, 2019

Too many words on a `from_utf8_lossy` variant that takes a `Vec<u8>`

1. Add a new function to String

2. Make String::from_utf8_lossy generic, and able to accept both &[u8] and Vec<u8>.

3. Add a new function to Vec<u8>

Prior art

Footnote: Reusing memory in the invalid case

Related topics

2. Make `String::from_utf8_lossy` generic, and able to accept both `&[u8]` and `Vec<u8>`.

3. Add a new function to `Vec<u8>`