Micro RFC: `String::from_utf8_with` for different handling of invalid UTF8

Currently, there is only one way to handle invalid UTF8: String::from_utf8_lossy and {OsStr, CStr, Path}::to_string_lossy, which replace any invalid unicode with the replacement character . However, it is sometimes useful to be able to do something else with invalid characters: either hide them completely, or display them in some usable way.

This is mainly useful for C/OS interfaces, as well as WASM (javascript strings can contain invalid UTF16). However, it is also valuable while debugging to be able to see what exactly is contained in a mostly-utf8 byte slice. Something like this has been indirectly asked for before: utf 8 - Is it possible to decode bytes to UTF-8, converting errors to escape sequences in Rust? - Stack Overflow

Python provides a good reference implementation: codecs — Codec registry and base classes — Python 3.11.2 documentation.

This proposal would add a new associated function to accompany any *_lossy string-related methods:

String::from_utf8_with<H: EncodingHandler>(v: &[u8], H) -> Cow<'_, str>;
String::from_utf16_with<H: EncodingHandler>(v: &[u8], H) -> Cow<'_, str>;
OsStr::to_string_with<H: EncodingHandler>(&self, H) -> Cow<'_, str>;
CStr::to_string_with<H: EncodingHandler>(&self, H) -> Cow<'_, str>;
Path::to_string_with<H: EncodingHandler>(&self, H) -> Cow<'_, str>;

All of which would take an enum variant that describes how to handle invalid utf8 encoding:

#[non_exhaustive]
enum Handler {
    /// Remove invalid unicode
    Ignore,
    /// Replace invalid unicode with �.
    /// This is the current behavior of `from_utf8_lossy`
    Replace,
    /// Replace invalid unicode with e.g. "\x9f". This is the equivilent to Python's
    /// `backslashreplace`
    HexEscape, // BackslashReplace would also work
    /// Exact result TBD, some form of "\u{dcf0}", or "\udcf0"
    /// This version may not be as useful
    SurrogateEscape,
}

impl EncodingHandler for Handler {}

/// Allow user-implemented function, best signature TBD. The byte
/// slice would be the invalid chunk, and it should return the intended
/// replacement.
impl<'a, F> EncodingHandler for F where F: Fn(&'a [u8]) -> Cow<'a, str> {}

Implementation is fairly trivial - it uses the same basis of from_utf8_lossy with slight adjustments to what to do with invalid character ranges. Naming is of course nonfinal.

So, what are thoughts about an official RFC to add these methods?

Is there a reason not to take a closure that could "map" the invalid characters instead of using an enum?

9 Likes

These days, this should be submitted as API Change Proposal rather than an RFC.

1 Like

I would rather see a function like this added to std: decode_utf8 in bstr - Rust

Then you can pretty easily implement whatever semantics you want.

(I believe it's true you can use std::str::from_utf8 to actually implement a decode_utf8 function today, but it's pretty awkward.)

3 Likes

For String we could add something like valid_prefix() and remainder() to FromUtf8Error. This could also be helpful in other contexts, e.g. when parsing something that's partially utf8, partially binary data.

I think we get that by virtue of Utf8Error, which you can in turn get from FromUtf8Error::utf8_error.

Does the (unstable) Utf8Chunks API enable what you want? That appears to be how the lossy conversion is currently implemented.

3 Likes

Alternatively, maybe a decode function that accepts a closure that deals with the invalid bytes?

How much does that design — yielding (at most) one char at a time, with iteration controlled by the caller — limit the ability to apply SIMD? (I ask as someone who is not familiar with SIMD but recalls that you are.)

I imagine it would inhibit SIMD or pretty much any other throughput related optimizations. The API is meant for convenience when you just want to pluck a codepoint off of the beginning of a byte slice. In that sense, it is quite versatile.

bstr does still use SIMD for UTF-8 decoding in APIs that permit it. (Such as ByteSlice::is_utf8.)

Instead of a non-exhaustive enum for behaviors, could this be captured in a trait and types implementing it?

Is there a reason not to take a closure that could "map" the invalid characters instead of using an enum?

Alternatively, maybe a decode function that accepts a closure that deals with the invalid bytes?

Instead of a non-exhaustive enum for behaviors, could this be captured in a trait and types implementing it?

These are all good points - I updated the sample to be trait-based so the user can write their own closure. The reason for providing the enum at all was just to provide some commonly used defaults.

These days, this should be submitted as API Change Proposal rather than an RFC.

Thanks for the clarification, I suppose it's still good to collect feedback here before actually proposing something.

I would rather see a function like this added to std: decode_utf8 in bstr - Rust

Then you can pretty easily implement whatever semantics you want.

That is also not a bad idea in any case, and also a good option if this isn't seen as fit for std. I don't think it's a far stretch for std though, since surface area is fairly minimal and use cases aren't uncommon (e.g., wanting to display a path without � or wanting to retain some information about what characters are invalid)

(I believe it's true you can use std::str::from_utf8 to actually implement a decode_utf8 function today, but it's pretty awkward.)

That's exactly what I did for our regex wasm thingy, but it's clunky (there's room for cleanup, but I don't think that much). That's what led me to write this up :slight_smile:

I think if you're adding traits and what not, then the API surface is not something I personally would consider minimal. (In case you didn't know, I am on libs-api.) For something like that, my first response to any such proposal would be to prototype it in a crate first and get people to use it. Because there's no specific reason it needs to live in std.

The nice thing about bstr::decode_utf8 is that it's extremely flexible and basically lets anyone implement whatever they want at the very lowest layer. That's what I would consider minimal. :slight_smile:

Does the (unstable) Utf8Chunks API enable what you want? That appears to be how the lossy conversion is currently implemented.

It does allow for the implementation! At least something less clunky than what I linked in this reply. There may be a better place for this API within/using Utf8Chunks somehow, but adding some default implementations for the most common cases is what I'm hoping for.

I think if you're adding traits and what not, then the API surface is not something I personally would consider minimal. (In case you didn't know, I am on libs-api.) For something like that, my first response to any such proposal would be to prototype it in a crate first and get people to use it. Because there's no specific reason it needs to live in std.

The nice thing about bstr::decode_utf8 is that it's extremely flexible and basically lets anyone implement whatever they want at the very lowest layer. That's what I would consider minimal. :slight_smile:

I suppose I meant moreso that added maintenance area is fairly minimal, since it would consume the current implementation of from_utf8_lossy.

Is your thought for bstr something like a decode_utf8_with that returns (Result<char, u8>, usize) or Result<(char, usize), u8>, to parallel decode_utf8 but allow special handling of the invalid byte?

I'm not sure what you mean exactly. bstr::decode_utf8, as it currently exists, should be able to handle all of the use cases you've outlined. I don't know why a decode_utf8_with would be needed.

Ah, I just misunderstood the below quote, thought you were suggesting somehow adding the function from this proposal to the bstr crate. But this makes more sense - I'll write up an ACP that goes this route.

I would rather see a function like this added to std: decode_utf8 in bstr - Rust

Leaving aside the question of whether this belongs in the stdlib (now or eventually), I have some feedback on the API:

  • Instead of having an enum, and a callback function signature, both of which implement a trait, wouldn't it be simpler to have a callback function signature, plus a module (if in the stdlib, something like std::str::encoding_errors perhaps) containing a bunch of free functions with that signature?

  • The callback function signature you suggested doesn't seem right to me. Shouldn't it be something more like this?

    /// Callback function to handle invalid byte sequences encountered
    /// when converting an external representation to Unicode.  The first
    /// argument is a subslice of the byte sequence being converted, with
    /// index 0 the first byte of an invalid sequence.  The second argument
    /// may be called zero or more times to append characters to the string
    /// under construction.  Returns the number of bytes consumed from the
    /// input byte sequence; this should advance the conversion operation
    /// to the beginning of the next valid character.
    type ToUnicodeErrorHandler =
        Fn(&'a [u8], FnMut(char)) -> NonZeroUsize;
    
  • Naming bikeshed: I think ..._lossy_with is not a very helpful name suffix and suggest ..._with_recovery instead.

  • Instead of having an enum, and a callback function signature, both of which implement a trait, wouldn't it be simpler to have a callback function signature, plus a module (if in the stdlib, something like std::str::encoding_errors perhaps) containing a bunch of free functions with that signature?

That's also a possibility, and would be more or less what the enum would have to desugar to. I just couldn't think of anywhere in std that uses this kind of module, so it seemed like an enum was a more concise way of namespacing.

The callback function signature you suggested doesn't seem right to me. Shouldn't it be something more like this?

type ToUnicodeErrorHandler =
   Fn(&'a [u8], FnMut(char)) -> NonZeroUsize;

This seems a bit like the inverse of bstr::decode_utf8, and be nice to avoid copying. Perhaps something impl Write would be more idiomatic? This wouldn't allow for non-allocating implementations though...

As a nit, I'd think it would be better if the &[u8] points just to the invalid chunk rather than the entire slice, so the user doesn't have to use other utf8 logic to determine where the invalid slice ends.

Hm... that does remind me that there could likely be a fn(&mut [u8]) -> usize version to work with &mut [u8], which would allow for Ignore or same-size replacements without allocating.

  • Naming bikeshed: I think ..._lossy_with is not a very helpful name suffix and suggest ..._with_recovery instead.

Agreed, especially since they're not always actually lossy. Maybe _with is enough since it's likely easy enough to infer. I'll update the main post