Currently, there is only one way to handle invalid UTF8: String::from_utf8_lossy
and {OsStr, CStr, Path}::to_string_lossy
, which replace any invalid unicode with the replacement character �
. However, it is sometimes useful to be able to do something else with invalid characters: either hide them completely, or display them in some usable way.
This is mainly useful for C/OS interfaces, as well as WASM (javascript strings can contain invalid UTF16). However, it is also valuable while debugging to be able to see what exactly is contained in a mostly-utf8 byte slice. Something like this has been indirectly asked for before: utf 8 - Is it possible to decode bytes to UTF-8, converting errors to escape sequences in Rust? - Stack Overflow
Python provides a good reference implementation: codecs — Codec registry and base classes — Python 3.11.2 documentation.
This proposal would add a new associated function to accompany any *_lossy
string-related methods:
String::from_utf8_with<H: EncodingHandler>(v: &[u8], H) -> Cow<'_, str>;
String::from_utf16_with<H: EncodingHandler>(v: &[u8], H) -> Cow<'_, str>;
OsStr::to_string_with<H: EncodingHandler>(&self, H) -> Cow<'_, str>;
CStr::to_string_with<H: EncodingHandler>(&self, H) -> Cow<'_, str>;
Path::to_string_with<H: EncodingHandler>(&self, H) -> Cow<'_, str>;
All of which would take an enum variant that describes how to handle invalid utf8 encoding:
#[non_exhaustive]
enum Handler {
/// Remove invalid unicode
Ignore,
/// Replace invalid unicode with �.
/// This is the current behavior of `from_utf8_lossy`
Replace,
/// Replace invalid unicode with e.g. "\x9f". This is the equivilent to Python's
/// `backslashreplace`
HexEscape, // BackslashReplace would also work
/// Exact result TBD, some form of "\u{dcf0}", or "\udcf0"
/// This version may not be as useful
SurrogateEscape,
}
impl EncodingHandler for Handler {}
/// Allow user-implemented function, best signature TBD. The byte
/// slice would be the invalid chunk, and it should return the intended
/// replacement.
impl<'a, F> EncodingHandler for F where F: Fn(&'a [u8]) -> Cow<'a, str> {}
Implementation is fairly trivial - it uses the same basis of from_utf8_lossy
with slight adjustments to what to do with invalid character ranges. Naming is of course nonfinal.
So, what are thoughts about an official RFC to add these methods?