PathBuf to CString

Conversion from a Path to a C-compatible string is not easy in libstd.

Because the conversion is not well-defined on Windows, access to bytes is hidden behind cfg(unix). However, that means that in user code the correct conversion code for Unix also has to be behind cfg(unix), and requires writing a fallback implementation. That is a lot to ask for such a minor detail, so people end up writing CString::new(path.to_str().unwrap()) which is the worst case of all.

I suggest adding TryFrom which preserves bytes on Unix, and falls back to UTF-8 on Windows.

On Windows there's no right way to convert Unicode path to bytes. ANSI code pages could be useful in some cases, but they're a lossy legacy encoding and make the conversion environment-specific. OTOH sticking to UTF-8 wouldn't be worse than the current practice of .to_str().unwrap().

13 Likes

The lack of conversion from Path to CString is really inconvenient, so I would also like to see a solution in std or in a well-established crate. However, I don't think defaulting to UTF-8 on Windows is a good solution. This is a problematic conversion, so it's better to force users to make an explicit decision about the approach.

Ideally, when passing a path to a library with C interface, you need to examine the library's documentation or source code to figure out which encoding it expects to receive. For example, some libraries explicitly declare that input must be UTF-8, and some libraries need the path to be converted to the current ANSI code page or OEM code page because they just call fopen with it. I think it makes sense to have a separate functions for these cases (and maybe there are other common cases). It seems that Windows API provides functions for this kind of conversion, so hopefully it can be done without dealing with all possible encodings directly.

3 Likes

If there was no solution to this in libstd, then there would be some motivation to use an external crate. But .to_str().unwrap() is an easy-but-incorrect way out, which makes it harder to convince people to use a crate for this.

It would be nice if Clippy could detect CString::new(path.to_str().unwrap()) antipattern, but AFAIK Rust has a policy of not picking favorite crates, so Clippy couldn't offer a fix for it if it's not in the standard library.

If TryInto shouldn't be that opinionated about encoding, how about adding a method on Path that's explicit about encodings? e.g. path.to_bytes_which_may_or_may_not_be_utf8()

2 Likes

I had exactly the OP's question when writing some basic bindings for a C library. I don't know anything about Windows (and don't care to invest more than minimal effort in it tbh).

What should I do for the #[cfg(windows)] case in a Rust function that takes Path and needs to call a C function that takes a char* filename? Let's say for simplicity that the C function (eventually) just calls fopen() with that char* value.

One possibility for Windows is to convert the path to its short path form (aka the 8.3 filename) if possible. I believe this is always safe to be encoded with the current code page, and therefore has a guarenteed CString representation that should work for all reasonably implemented C code (read: I haven’t personally encountered anything that doesn’t).

Note that the short path is not available (it can be disabled by the user), so a fallback is still needed. In which case there is really no right answer; a C API that takes path as char * on Windows is broken by design, and it’s impossible to tell what encoding is correct without reading the actual implementation. UTF-8 is as good a guess as any.

Could this be solved by creating a new clippy-community linter that would suggest non-std crates, while not being "official"?

Even on Windows, it would sometimes be nice to get the underlying (WTF-8) bytes, if only to interop with external code that uses WTF-8 instead of UCS-2 encoding. At the moment the only way to do that is to use encode_wide and then convert that back to WTF-8, which is quite wasteful.

Currently the underlying WTF-8 format is just an implementation detail and not exposed to any public interface. WTF-8 is not a standardised encoding but a known hack for handling encodings. Should we, the Rust language, define and manage a sort-of text encoding named WTF-8?

WTF-8 does have a specification, at least.

Caveat: The specification requires that:

Any WTF-8 data must be converted to a Unicode encoding at the system’s boundary before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text in a file format or for transmission over the Internet.

However, it's not clear how this is possible in general, because WTF-8 encodes a superset of Unicode.

There are some cases that would benefit from this even if they don't send the WTF-8 bytes to external code. For example, if you use serde for IPC, then sending a Path or OsStr between processes is currently much less efficient on Windows than non-Windows platforms.

3 Likes

WTF-8 is supposed to be an internal representation, not an encoding for interchange. I don't think it's commonly used, so libraries expecting char * are very unlikely to make a proper use of it. Windows itself certainly can't. That makes exposing it more of a new feature for Rust rather than compatibility for C, so I'd say it's out of scope for CString.

1 Like

What about simply failing on Windows? If there is no correct behavior, that would ensure that code using TryFrom works where the result is well defined.

Yeah, I suppose failing on any non-ASCII chars would be the most correct (I hope we can ignore EBCDIC :))

I shall add that this problem exists in rustc itself. LLVM takes char* for paths, and rustc uses things like:

CString::new(format!("{}", path.display())
CString::new(path_buf.to_string_lossy().as_bytes())

which is incorrect on all platforms. It's unfixable on Windows, but it's unnecessarily broken on Unix.

2 Likes

The correct behavior on Windows is probably to convert the (wide) OsString to a multibyte string using WideCharToMultiByte with the active code page and check that lpUsedDefaultChar was not set. But this really depends on the C library. Using char* on Windows to refer to paths is not really portable. See notes regarding this in Microsoft's fopen documentation.

Microsoft are working on the portability issues, recent versions of Windows 10 support a flag that makes the system ANSI APIs consume and emit UTF-8.

https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page

The page says this will be the default going forward for new APIs as well, though of course support for that in third party libraries will be spotty, and it doesn't help old OS releases so care will still have to be taken.

1 Like

Has to start somewhere; perhaps one day, when the minimum supported version of Windows is at least that version of Windows 10, we can just support UTF-8 for system APIs, and stop dealing with wide characters entirely except for conversions for FFI.

1 Like

It's interesting that all those CString::new also have .unwrap in them, which I think is impossible (as path can't have internal null bytes).

Should we add

pub trait std::os::unixOsStrExt {
    fn to_cstring(&self) -> CString;
}

?

This obviously doesn't solve cross-platformness issue, but I am somewhat skeptical about that: it seems that windows "OS Strings" are composed of wide chars, and C strings of chars, and that seems fundamentally incompatible.

On Windows OSString internally stores the data as WTF-8. This is UTF-8 compatible except for also representing unmatched surrogates. CString allows any byte sequence that doesn't contain a null, so storing WTF-8 in a CString is technically valid.

I mean, it doesn't matter what internal format we use. Windows API uses wchar_t *, C API uses char *, they are just not interconvertable.

Wait but Rust Paths can. They are essentially just a bunch of bytes, no? Interpreting them as anything else depends on the particular function being called and the platform it is being called on.

1 Like