Given that env::var_os produces a std::ffi::OsString (as it should) it would be nice to be able to write functions that rely on the implementation of TryFrom<OsString>.
Now, for most types this would be best done by hand, but I think a simpler stop gap could be to have impl TryFrom<OsString> for String since if I remember correctly, from and try_from can be executed several times.
TryFrom::try_from takes the argument by value, so that exact impl would be only usable once per OsString. That said, it also matches the signature of OsString::into_string, so is a sensible impl.
impl TryFrom<OsString> for String {
type Error = OsString;
fn try_from(value: OsString) -> Result<String, OsString> {
value.into_string()
}
}
However, it does seem weird that OsString cannot be converted to/from String directly (or by checking to see if it is already valid). Was that intentional?
That's what into_string() does. On Unix, an OsString can be an arbitrary sequence of bytes, which is not necessarily valid UTF-8. On Windows, an OsString is a WTF-8 encoding of an arbitrary sequence of 16-bit integers, which also may not be valid UTF-8.
That isn't what I get from either the documentation or the method signature. I get that into_string() merely checks that it is a valid std::string::String and nothing more. It doesn't do any translation between WTF-8 and UTF-8 for instance.
WTF-8 is a strict superset of UTF-8. With valid unicode, there's no conversion necessary. When call Windows APIs there is a conversion to u16 integers. Any string results from the API are then immediately converted to WTF-8.
I didn't say it did any translation? I'm explaining why "However, it does seem weird that OsString cannot be converted to/from String directly" isn't possible. Because String is a strictly more restrictive type than OsString.
That is something that I did not know. However, in the docs of std::ffi::OsString is says that on windows it is composed of 16 bit values that may be interpreted as UTF-16 where applicable.
However, from this stackoverflow post, it seems to be the case that UTF-16 is not a superset of UTF-8.
This is main reason for my confusion since it seems that if the UTF-16 data is not UTF-8 then there is no standard way to convert it to be a UTF-8 string.
Yes that is what my understanding is. My quandary is that there doesn't seem to be a way to convert a windows string (as an OsString) to a valid utf-8 rust string. Even if it is "assume utf-16, then convert".
There is no way to interpret a UTF-16 string as a UTF-8 string, because the bit patterns in memory are different. However, every UTF-16 string can be recoded as a UTF-8 string, and vice versa.
I think there might be some confusion here, so if this doesn't clear things up, then I'd recommend posting a concrete example that shows the conversion you want.
I think you are mixing up what "encoding" means. UTF-8, WTF-8 and UTF-16 are all encodings. UTF-8 and UTF-16 are ways of representing Unicode code points in a raw byte representation. UTF-8 uses between 1 and 4 bytes for each codepoint. UTF-16 uses either 2 bytes or 4 bytes for each codepoint.
UTF-8 cannot represent all file paths on UNIX because file paths are an arbitrary sequence of bytes (without interior NUL bytes). Similarly, UTF-16 cannot represent all file paths on Windows because file paths are an arbitrary sequence of 16-bit integers.
WTF-8 was invented to bridge the gap such that file paths on Windows can be non-lossily roundtripped between Rust's OsString and PathBuf types. WTF-8 was chosen so that Rust String types could be used in a zero cost fashion to manipulate file paths. (e.g., file_path.join("foo").)
When a PathBuf is used on Windows to interact with the operating system, it is converted to an OsString and then transcoded from its internal WTF-8 encoding (a strict superset of UTF-8) to a sequence of 16-bit integers that Windows expects.
When you have a PathBuf or an OsString in Rust in memory, its internal representation in memory uses WTF-8. When you actually go to use it for anything outside of Rust, it is first transcoded to 16-bit integers. Since WTF-8 is a strict superset of UTF-8 and can otherwise encode all possible sequences of 16-bit integers, it follows that 1) OsStrings can be manipulated in a zero cost fashion with Rust's guaranteed valid UTF-8 encoded String types and that 2) all possible file paths on Windows can be correctly represented and roundtripped by Rust. The cost is that, at least on Windows, all file paths must be transcoded to and from WTF-8 and its sequence of arbitrary 16-bit integers.
If a file path on Windows contains valid UTF-16, then its corresponding WTF-8 representation is guaranteed to be valid UTF-8. The WTF-8 representation is only invalid UTF-8 in precisely the case where the file path on Windows is not valid UTF-16. In which case, you get a WTF-8 encoded string that cannot be converted to UTF-8 without error or replacement or omission.