PreRFC: trait converting functions for OsString

Given that env::var_os produces a std::ffi::OsString (as it should) it would be nice to be able to write functions that rely on the implementation of TryFrom<OsString>.

Now, for most types this would be best done by hand, but I think a simpler stop gap could be to have impl TryFrom<OsString> for String since if I remember correctly, from and try_from can be executed several times.

TryFrom::try_from takes the argument by value, so that exact impl would be only usable once per OsString. That said, it also matches the signature of OsString::into_string, so is a sensible impl.

impl TryFrom<OsString> for String {
    type Error = OsString;
    fn try_from(value: OsString) -> Result<String, OsString> {
        value.into_string()
    }
}

Makes sense, this probably can be just a PR.

However, it does seem weird that OsString cannot be converted to/from String directly (or by checking to see if it is already valid). Was that intentional?

string is not platform-specific, whereas OsString inherently is. Perhaps that is the reason.

Yes, I guess there was a desire not to have to implement the conversion per platform (for some definition of platform).

That's what into_string() does. On Unix, an OsString can be an arbitrary sequence of bytes, which is not necessarily valid UTF-8. On Windows, an OsString is a WTF-8 encoding of an arbitrary sequence of 16-bit integers, which also may not be valid UTF-8.

That isn't what I get from either the documentation or the method signature. I get that into_string() merely checks that it is a valid std::string::String and nothing more. It doesn't do any translation between WTF-8 and UTF-8 for instance.

WTF-8 is a strict superset of UTF-8. With valid unicode, there's no conversion necessary. When call Windows APIs there is a conversion to u16 integers. Any string results from the API are then immediately converted to WTF-8.

I didn't say it did any translation? I'm explaining why "However, it does seem weird that OsString cannot be converted to/from String directly" isn't possible. Because String is a strictly more restrictive type than OsString.

That is something that I did not know. However, in the docs of std::ffi::OsString is says that on windows it is composed of 16 bit values that may be interpreted as UTF-16 where applicable.

However, from this stackoverflow post, it seems to be the case that UTF-16 is not a superset of UTF-8.

This is main reason for my confusion since it seems that if the UTF-16 data is not UTF-8 then there is no standard way to convert it to be a UTF-8 string.

To be clear there are four different encodings here:

  • Windows Strings
  • UTF-16
  • UTF-8
  • WTF-8

UTF-16 and UTF-8 are different encodings for the same unicode standard. So you can always convert between UTF-16 and UTF-8 without losing anything.

Windows strings are not unicode. They are an array of u16 integers which may or may not be UTF-16.

WTF-8 is not unicode. It's an array of u8 integers that may or may not be valid UTF-8..

4 Likes

Yes that is what my understanding is. My quandary is that there doesn't seem to be a way to convert a windows string (as an OsString) to a valid utf-8 rust string. Even if it is "assume utf-16, then convert".

There is no way to interpret a UTF-16 string as a UTF-8 string, because the bit patterns in memory are different. However, every UTF-16 string can be recoded as a UTF-8 string, and vice versa.

That is exactly what OsString::into_string is. If the internal WTF-8 happens to be UTF-8, then you get a String back. If the internal WTF-8 is not valid UTF-8 (which means the original Windows string contained invalid UTF-16), then you get the OsString back. At that point, your only recourse is to lossily decode the OS string or otherwise convert the internal WTF-8 back to the corresponding 16-bit integers and do what you need there.

I think there might be some confusion here, so if this doesn't clear things up, then I'd recommend posting a concrete example that shows the conversion you want.

1 Like

Here's some wrapper functions for each conversion between types.

/// Assumes strings are valid unicode. May panic otherwise.

use std::ffi::{OsString, OsStr};
// Only works on windows
use std::os::windows::ffi::{OsStringExt, OsStrExt};

fn rust_to_windows(s: String) -> Vec<u16> {
    s.encode_utf16().collect()
}
fn rust_to_osstring(s: String) -> OsString {
    s.into()
}

fn osstring_to_rust(s: OsString) -> String {
    s.into_string().unwrap()
}
fn osstring_to_windows(s: OsString) -> Vec<u16> {
    s.encode_wide().collect()
}

fn windows_to_osstring(s: Vec<u16>) -> OsString {
    OsString::from_wide(&s)
}
fn windows_to_rust(s: Vec<u16>) -> String {
    String::from_utf16(&s).unwrap()
}

Notice that the wrapper functions are simply calling built-in Rust functions.

Yes I know, that is why I was talking about "converting" not "interpreting"

I agree that there was confusion. And this reply has cleared most of it, so thank you.

My last question is that https://simonsapin.github.io/wtf-8/ seems to imply that if a string is UTF-16 compliant it may not be WTF-8....

I think you are mixing up what "encoding" means. UTF-8, WTF-8 and UTF-16 are all encodings. UTF-8 and UTF-16 are ways of representing Unicode code points in a raw byte representation. UTF-8 uses between 1 and 4 bytes for each codepoint. UTF-16 uses either 2 bytes or 4 bytes for each codepoint.

UTF-8 cannot represent all file paths on UNIX because file paths are an arbitrary sequence of bytes (without interior NUL bytes). Similarly, UTF-16 cannot represent all file paths on Windows because file paths are an arbitrary sequence of 16-bit integers.

WTF-8 was invented to bridge the gap such that file paths on Windows can be non-lossily roundtripped between Rust's OsString and PathBuf types. WTF-8 was chosen so that Rust String types could be used in a zero cost fashion to manipulate file paths. (e.g., file_path.join("foo").)

When a PathBuf is used on Windows to interact with the operating system, it is converted to an OsString and then transcoded from its internal WTF-8 encoding (a strict superset of UTF-8) to a sequence of 16-bit integers that Windows expects.

When you have a PathBuf or an OsString in Rust in memory, its internal representation in memory uses WTF-8. When you actually go to use it for anything outside of Rust, it is first transcoded to 16-bit integers. Since WTF-8 is a strict superset of UTF-8 and can otherwise encode all possible sequences of 16-bit integers, it follows that 1) OsStrings can be manipulated in a zero cost fashion with Rust's guaranteed valid UTF-8 encoded String types and that 2) all possible file paths on Windows can be correctly represented and roundtripped by Rust. The cost is that, at least on Windows, all file paths must be transcoded to and from WTF-8 and its sequence of arbitrary 16-bit integers.

If a file path on Windows contains valid UTF-16, then its corresponding WTF-8 representation is guaranteed to be valid UTF-8. The WTF-8 representation is only invalid UTF-8 in precisely the case where the file path on Windows is not valid UTF-16. In which case, you get a WTF-8 encoded string that cannot be converted to UTF-8 without error or replacement or omission.

6 Likes

Thanks for the clarification. I think the lack of knowledge of std::os::windows::ffi was also contributing to my misunderstanding