OsStr, WTF8, as_bytes, and to_string_unchecked

I was hoping for an unsafe fn(&OsStr) -> &str API (with the invariant that it be UTF8), and it came up in discussion on twitter with kennytm that part of the reason this doesn't exist is that it has been argued it would expose the underlying WTF8 encoding of OsStr, which is meant to be an implementation detail. This is also why you can't have OsStr::as_bytes in a cross-platform way (only on UNIX).

I have to admit I don't see the reasoning in this, because we already have OsStr::to_str(&self) -> Option<&str>. This method requires that in the case of UTF compatible data, the OsStr is in UTF-8 encoding; therefore it requires that the format of OsStr on all platforms is a superset of UTF-8. It seems to me that being a superset of UTF-8 is the pertinent fact about WTF-8 that we might want to keep private (that it is that instead of a superset of UTF-16 like UCS-2).

So, my belief is that we already expose the only thing we would really care about hiding about WTF-8. I also note that AFAIK, we've been using WTF-8 unchanged since before 1.0 and have never had any desire to change the encoding.

So I'd propose accepting that WTF-8 is our representation on Windows, adding methods to OsStr and OsString to convert them to their byte representation, and to unsafely convert them to strings without checking their UTF8 well formedness (I think this should be done regardless of whether we add access to their byte representation, because an invariant of safely using that API is that you don't convert non-utf osstrs).

I guess perhaps one concern is that the as_bytes will return a different value on different platforms. One option would be to add a windows version of OsStrExt which contains as_bytes like the unix version does. But this would be far from the only std API with platform-divergent behavior (e.g. the native endianness methods, the specific error codes returned by IO operations, etc).

5 Likes

I think that we have, but I don't think the RFC has been implemented yet? https://github.com/rust-lang/rfcs/pull/2295 There is also some pertinent discussion in that RFC thread between myself and @kennytm.

I kind of like this and personally have been weakly in favor of exposing WTF-8 for many years now. Although, @SimonSapin has brought up the important point that if I had gotten my way when I wanted it, then the aforementioned RFC might not have been possible. (It's been a while since I've read the RFC and I've forgotten the particulars at this point.)

In particular, I'd very much want the raw WTF-8 for regex and glob matching. Windows pays a performance penalty for this today.

See also:

As far as I understand it, the main arguments against exposing the WTF-8 representation are:

  1. It increases the chances that WTF-8 gets used somewhere in some persistent format. For example, if WTF-8 were exposed, it would be awfully convenient to rely on that in a Serde Serializable implementation, for example. (It's not completely obvious to me that this is a bad thing in and of itself, but I can see how WTF-8's scope being increased would be A Bad Thing.)
  2. It removes some flexibility for the internal implementation.
2 Likes

Related problem that would also benefit from exposing OsStr bytes on Windows:

I'm not sure this would actually solve this problem. If you're trying to pass paths to C functions in Windows, you should be using WideString, not CString (and you should be using C functions that accept it). If creating a CString was easier, it would likely lead to more incorrect use.

I know it doesn't solve the problem on Windows, because that problem is fundamentally impossible to solve on Windows, and solving it on Windows was not the goal. The problem is that non-portability of .as_bytes() causes problems on Unix. Lack of easy conversion to CString makes authors use the lazy workaround of .as_str().unwrap() instead, and that's incorrect on all platforms.

Making as_bytes() available on all platforms would keep Windows as broken as always (not any worse though), but would fix Path to CString conversion on non-Windows platforms.

I think there's more to it than being an implementation detail in Rust. The WTF-8 standard explicitly states it shouldn't be exposed to the system. The current situation makes it hard to accidentally violate those constraints, I would be concerned that this API would make it easier to make that mistake.

WTF-8 is a hack intended to be used internally in self-contained systems with components that need to support potentially ill-formed UTF-16 for legacy reasons.

Any WTF-8 data must be converted to a Unicode encoding at the system’s boundary before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text in a file format or for transmission over the Internet.

Since the WTF-8 spec is an unofficial spec maintained by a libs team member that was developed as a part of developing the OsString abstraction, it seems a bit farfetched to treat those statements as meaningfully normative. They essentially encode Rust's current policy (that our OsString's representation on windows is formally unspecified).

1 Like

Do you have any examples of code that has run into this issue? It would be interesting to see what the code ends up doing with those CStrings.

  • On Windows, are they passed to APIs that expect UTF-8, or APIs that expect legacy narrow strings?
  • What would be the practical consequences if the code were to pass WTF-8 that is not valid UTF-8?

Well I think it's meaningful depending on the nature of the change being proposed, WTF-8 does have users other than Rust so I would want to be at least a little cautious about changing the spec. If the format is just being specified on the Rust side I think you're right.

1 Like

Yes, rustc for example fails to pass exact bytes to LLVM on Unix: PathBuf to CString

1 Like

I don't have any desire to change the spec, though I think those clauses are a bit silly and outside of the spec's proper perview.

There seems to be two problems:

  1. libc expects paths to be representable as char* on all platforms. Projects that interop with libc in some way (eg. by passing paths to LLVM) need to be able to perform the "best" conversion they can, despite this being inherently broken.

    I think the best option to solve this problem would be to have an extension trait in the libc crate for this conversion, that uses WideCharToMultiByte on windows, and uses as_bytes on unix.

  1. Some operations don't care about the specific encoding being used (eg. regex operations) as long as the encoding meets some basic requirements.

    I see two approaches to solving this:

    1. expose WTF-8 in a platform-specific way, and commit to that being our encoding on windows. The regex crate would need to paper over that platform-specifics.

    2. Define a set of properties all platform-specific encodings should follow, and then add a platform-agnostic method to expose the raw bytes without reference to the encoding. The regex crate could then operate over the raw bytes.

4 Likes

Running with this example, I took a quick look at how LLVM opens files on Windows. It doesn't use the native narrow-char APIs; instead, it uses MultiByteToWideChar to convert the path from UTF-8 to UTF-16 (hardcoding the encoding choices), then passes the UTF-16 string to CreateFileW. So, in this particular case, if I'm not misreading the code:

  • It's correct to pass UTF-8 to LLVM; paths containing non-ASCII characters do work correctly as long as they don't have unpaired surrogates. This would break if we instead converted to the active code page.
  • Unsurprisingly, paths with unpaired surrogates don't work and cannot be made to work.
  • If we passed such paths as WTF-8, I assume MultiByteToWideChar would fail, causing the open operation to fail. No harm, but no benefit either.

LLVM is just one example, but I've seen other C++ projects use a similar approach. And of course that's how Rust's libstd works, so any Rust crates that export a C API involving paths probably also expect UTF-8. On the other hand, I'm sure there are also libraries that pass paths to native narrow-char APIs and hence expect the active code page.

So we really need two "best possible conversion" APIs: one would use UTF-8 on Windows, while the other would use the active code page. Both would be no-ops on Unix (i.e. they would work even if the path isn't valid UTF-8).

That leaves the question of whether the UTF-8 variant, when handling unpaired surrogates on Windows, should give you WTF-8 or just fail. Both options seem sort of… equally bad.

2 Likes

For reference, how MultiByteToWideChar behaves depends on if the MB_ERR_INVALID_CHARS flag was set or not. If it's not set then invalid characters get replaced by the unicode replacement character (U+FFFD). Worst case scenario would be that, for example, the library you call opens a different file than the one that your expect. This could potentially be part of a malicious exploit chain if the circumstances are right.


For what it's worth I do want to emphasise how incredibly rare it is for the OS to give out unpaired surrogates. Sure the kernel doesn't validate UTF-16 wide strings but that string had to have been given to the kernel somehow. Windows handles decoding/encoding of "narrow" strings (UTF-8 or otherwise) or the program knowingly does their own translation from input (keyboard strokes, files, network streams, hard coded strings, etc) to wide strings.

So there are very very few ways to accidentally pass in an invalid string. And if it does ever happen, the user would likely want to know about the corruption so they can fix it (or figure which malicious actor is responsible for it).

1 Like

That sounds good to me.

They both could return Option or Result to handle unrepresentable characters. This way WTF-8 wouldn't need to be exposed to the world.

The UTF-8-or-unix-bytes variant would be quite safe to use, even if people unconditionally called .unwrap() on it, and it wouldn't be any worse than .to_str().unwrap().

1 Like