I've often wanted a variant of String::from_utf8_lossy
which goes from Vec<u8>
=> String
, instead of &'a [u8]
to Cow<'a, str>
. This happens when:
- I want a
String
and not aCow
as output. - I have no use for the Vec beyond turning it into a string.
(An example case where I've wanted this include: when getting a String from Command::output's stdout/stderr buffers. But, really any other api that produces a "probably-utf8" Vec<u8>
)
As it is, if you're in this position, you generally call to_string
/to_owned
on the result of String::from_utf8_lossy
, which ends up reallocating and making a copy the input string in the case where it was already valid UTF-8. That is, this penalizes the already-utf8 case (and I assert with 0 evidence or intent to provide evidence: this is probably the more common case).
Anyway, there are basically two three options here:
1. Add a new function to String
This seems like a good call (at least when compared to option 2) for several reasons, such as:
- Keeping API surface area down
- Simpler implementation
- No chance of breakage due to new type inference failures (either for types which are
Deref<Target=[u8]>
or for&[u8; N]
, more on this later). - Easier to document
- Same location as the other
String::from_utf8*
functions.
But has the big downside of: I have no idea what to call it.
My best idea is from_utf8_vec_lossy
, and I've put an implementation of it below so that we're all on the same page in regards to what I'm talking about.
impl String {
// ...
#[inline]
pub fn from_utf8_vec_lossy(vec: Vec<u8>) -> String {
if let Cow::Owned(string) = String::from_utf8_lossy(&vec) {
// See footnote 1 if your immediate thought is: "we should be reusing vec's memory in this case"
string
} else {
// SAFETY: `String::from_utf8_lossy`'s contract ensures that if
// it returns a `Cow::Borrowed`, it is identical to the input.
unsafe { String::from_utf8_unchecked(vec) }
}
}
// ...
}
I think the API suffers from having a very clunky name and being hard to discover. That said, if people think it's fine as is, I could have a PR up for adding it (as an unstable libs addition) pretty quickly.
2. Make String::from_utf8_lossy
generic, and able to accept both &[u8]
and Vec<u8>
.
I don't really love this approach since it complicates the API. I also wasn't able to figure out a way to do it that avoids type inference failures in all cases.
This would look like:
impl String {
#[inline]
pub fn from_utf8_lossy<I: IntoUtf8Lossy>(bytes: I) -> Input::String {
bytes.into_utf8_lossy()
}
}
#[unstable(for a long time or forever idc)]
pub trait IntoUtf8Lossy {
type String;
fn into_utf8_lossy(self) -> Self::String;
}
impl IntoUtf8Lossy for Vec<u8> {
type String = String;
fn into_utf8_lossy(self) -> Self::String { ... }
}
impl<'a> IntoUtf8Lossy for &'a [u8] {
type String = Cow<'a, str>;
fn into_utf8_lossy(self) -> Self::String { ... }
}
// More impls (&Vec<u8>, &Box<[u8]>, &[u8; N], ...) to try to keep type
// inference working in as many cases as possible...
See https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=b2c6f6bb639ee42f29bd5bc95785afc4 for a more complete example I wrote when experimenting with this.
The problem I mentioned about type inference is: the current version will work for any &T
where T: Deref<Target = [u8]>
, as well as any &[u8; N]
(surprisingly, &[u8; N]
=> &[u8]
doesn't happen via Deref
. TIL). Either of these could be solved via a blanket impl (either impl<'a, T: Deref<Target=[u8]>> for &'a T
or impl<'a, const N: usize> for &'a [u8; N]
), but AFAIK both cannot (idk, maybe specialization can make it happen, I tried and failed).
This falls under the category of inference failures caused by generalization to generics (https://rust-lang.github.io/rfcs/1105-api-evolution.html#minor-change-generalizing-to-generics-1), which are considered minor changes (so long as the bustage is sufficiently rare, which... idk).
Even ignoring this though, this api is a lot more complicated than the "just add a new function", even if IntoUtf8Lossy
would not really be a public trait.... It seems like a lot of trouble either way, but I could be convinced to try and make this work.
That said, I've mostly mentioned it because it's an semi-obvious solution that I suspected would get suggested if I asked what to name the first thing. The approach did work better than I expected, but still is a lot of hassle.
3. Add a new function to Vec<u8>
I think this probably has the fewest downsides (it has all the benefits I mentioned for case 1 except for being with the other from_utf8
s, and then it also doesn't have the "what to name it" issue).
That said, I initially discounted it (and promptly totally forgot about it) because AFAICT, I don't think we have any impl blocks Vec<ConcreteType>
currently, and I didn't know if that's a deliberate choice.
impl Vec<u8> {
#[inline]
pub fn into_utf8_lossy(self) -> String {
if let Cow::Owned(string) = String::from_utf8_lossy(&self) {
// see footnote 1 about reusing memory.
string
} else {
// SAFETY: `String::from_utf8_lossy`'s contract ensures that if
// it returns a `Cow::Borrowed`, it is identical to the input.
unsafe { String::from_utf8_unchecked(self) }
}
}
}
That said I mostly forgot that this was an option when writing this, and wouldn't have written as much about all the tradeoffs here had I realized this option was available.
This still has discoverability issues, and perhaps Vec::<u8>::into_string_lossy
would be a better name (I like that this makes the encoding clear, but it's not entirely consistent with the current naming scheme ... Maybe I'm talking myself out of this one being the best option...)... But seems like the best, unless there's a better name for option number 1 that I'm just not thinking of.
So, I'm open for thoughts, and now that I've written this I might as well post this overthinking of the problem space, in order to head off any bikeshedding that might occur if I submitted a PR adding a Vec::<u8>::into_utf8_lossy
without discussing first.
Prior art
-
https://docs.rs/bstr/0.2/bstr/trait.ByteVec.html#method.into_string_lossy, although I think this still allocates in the success case, which might be because it's unclear that it's okay for an external crate to rely on the safety invariant I cite in the
// SAFETY
comment (although, it feels like it would be, so maybe it's just a missed optimization). - Probably more...
Footnote: Reusing memory in the invalid case
In theory we could reuse the vec's memory for invalid inputs, but naively doing so would take O(input.len() * replacements_needed)
, as invalid sequences can (and likely will) have different lengths from the utf8-encoded replacement character. That means unless it's done carefully, this is a lot worse than just allocating.
So, yeah. I'm mentioning it only because it's been brought up to me twice, once as a benefit to this API (IMO not really, as mentioned), and once as a reason this API was a bad idea (and I agree doing this would be a bad idea, but that's not what I'm suggesting).
Anyway all the ways I'm suggesting of this permit doing this someday, if it's deemed worthwhile (possibly based on heuristics like: is the invalid sequence at the end of the Vec<u8>
(or close enough)? Are the invalid sequence the same length of "\u{fffd}"
? Etc.).
And so, it's irrelevant I think.