`into_string_lossy` methods for `CString` and `OsString`?

The borrowed "FFI string" types CStr and OsStr have a pair of methods to convert them into UTF-8 strings, to_str and to_string_lossy. Both of them can borrow the original data, the latter returns Cow in case loss occurs and the string must be modified.

I noticed that the owned FFI strings CString and OsString are missing the same pair of functions to convert them into owned String. There is into_string, which is an owned counterpart to to_str, but there is no owned equivalent to_string_lossy. Yes, these two types have all the methods of CStr and OsStr, but that means you are creating a borrowed str from an owned FFI string. To get an owned String you have to use .to_string_lossy().into_owned(), which clones the data, so then you have both the original string allocation and a new String.

What about adding into_string_lossy to do this? It would lossily convert an owned FFI string to an owned String. If the conversion is lossless, it just takes ownership of the internal Vec<u8> and puts that in the String, so it's not allocating. If the conversion is lossy, then it could still reuse the original allocation if the changes can be made in place, and would only make a new allocation if that isn't possible. In practice, the conversion will usually be lossless, making it more efficient than .to_string_lossy().into_owned().

An instance where this would be useful, is for functions that query the OS and give you an OsString, like std::env::var_os. I want a String here, but generally with a lossy conversion rather than a fallible one, so std::env::var is not suitable.

6 Likes

You can accomplish the ownership transfer with the admittedly clunky .into_string().unwrap_or_else(|e| e.into_cstring().to_string_lossy().into_owned()). This does a copy in the lossy case, but arguably you will usually want to do so because shifting the buffer after the replacement splice is similarly expensive if not more so than a full buffer copy to the fixed string.

If I had to conjecture, that the function doesn't already exist feels like a victim of "don't consume ownership you aren't going to use" API design. A "better" consumer in that school of thought might instead do the conversion as something like

match s.to_str() {
    Ok(_) => s.into_string().unwrap(),
    Err(_) => s.to_string_lossy().into_owned(),
}

to avoid unnecessary ownership transfer... but since that leaves drop flag reliant inaccessible state consuming stack space for no reason, it's not really an ideal thing to write, even ignoring the likely repeated UTF-8 check.

So tldr I'm +1 on this probably fitting right in.

1 Like

As a bonus, into_string_lossy can be done in a single pass, but all the others require two passes, which can hurt if the non-Unicode is at the end of the string. (That’s not an inherent limitation, I guess, but given the shape of the existing API.) This is unlikely to matter for the lengths of strings that usually end up in CString and OsString, but even so.

3 Likes

This would also be beneficial for code size, since the multi-step chains repeat the allocation and data copying code.