PathBuf to CString

kornel · June 15, 2020, 11:30am

Conversion from a Path to a C-compatible string is not easy in libstd.

Because the conversion is not well-defined on Windows, access to bytes is hidden behind cfg(unix). However, that means that in user code the correct conversion code for Unix also has to be behind cfg(unix), and requires writing a fallback implementation. That is a lot to ask for such a minor detail, so people end up writing CString::new(path.to_str().unwrap()) which is the worst case of all.

I suggest adding TryFrom which preserves bytes on Unix, and falls back to UTF-8 on Windows.

On Windows there's no right way to convert Unicode path to bytes. ANSI code pages could be useful in some cases, but they're a lossy legacy encoding and make the conversion environment-specific. OTOH sticking to UTF-8 wouldn't be worse than the current practice of .to_str().unwrap().

Riateche · June 15, 2020, 11:54am

The lack of conversion from Path to CString is really inconvenient, so I would also like to see a solution in std or in a well-established crate. However, I don't think defaulting to UTF-8 on Windows is a good solution. This is a problematic conversion, so it's better to force users to make an explicit decision about the approach.

Ideally, when passing a path to a library with C interface, you need to examine the library's documentation or source code to figure out which encoding it expects to receive. For example, some libraries explicitly declare that input must be UTF-8, and some libraries need the path to be converted to the current ANSI code page or OEM code page because they just call fopen with it. I think it makes sense to have a separate functions for these cases (and maybe there are other common cases). It seems that Windows API provides functions for this kind of conversion, so hopefully it can be done without dealing with all possible encodings directly.

kornel · June 15, 2020, 12:38pm

If there was no solution to this in libstd, then there would be some motivation to use an external crate. But .to_str().unwrap() is an easy-but-incorrect way out, which makes it harder to convince people to use a crate for this.

It would be nice if Clippy could detect CString::new(path.to_str().unwrap()) antipattern, but AFAIK Rust has a policy of not picking favorite crates, so Clippy couldn't offer a fix for it if it's not in the standard library.

If TryInto shouldn't be that opinionated about encoding, how about adding a method on Path that's explicit about encodings? e.g. path.to_bytes_which_may_or_may_not_be_utf8()

gus · June 17, 2020, 3:51am

I had exactly the OP's question when writing some basic bindings for a C library. I don't know anything about Windows (and don't care to invest more than minimal effort in it tbh).

What should I do for the #[cfg(windows)] case in a Rust function that takes Path and needs to call a C function that takes a char* filename? Let's say for simplicity that the C function (eventually) just calls fopen() with that char* value.

uranusjr · June 17, 2020, 7:26am

One possibility for Windows is to convert the path to its short path form (aka the 8.3 filename) if possible. I believe this is always safe to be encoded with the current code page, and therefore has a guarenteed CString representation that should work for all reasonably implemented C code (read: I haven’t personally encountered anything that doesn’t).

Note that the short path is not available (it can be disabled by the user), so a fallback is still needed. In which case there is really no right answer; a C API that takes path as char * on Windows is broken by design, and it’s impossible to tell what encoding is correct without reading the actual implementation. UTF-8 is as good a guess as any.

robinm · June 17, 2020, 7:39am

Could this be solved by creating a new clippy-community linter that would suggest non-std crates, while not being "official"?

jstarks · June 17, 2020, 6:44pm

Even on Windows, it would sometimes be nice to get the underlying (WTF-8) bytes, if only to interop with external code that uses WTF-8 instead of UCS-2 encoding. At the moment the only way to do that is to use encode_wide and then convert that back to WTF-8, which is quite wasteful.

hyeonu · June 20, 2020, 3:56am

Currently the underlying WTF-8 format is just an implementation detail and not exposed to any public interface. WTF-8 is not a standardised encoding but a known hack for handling encodings. Should we, the Rust language, define and manage a sort-of text encoding named WTF-8?

mbrubeck · June 20, 2020, 4:11am

WTF-8 does have a specification, at least.

Caveat: The specification requires that:

Any WTF-8 data must be converted to a Unicode encoding at the system’s boundary before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text in a file format or for transmission over the Internet.

However, it's not clear how this is possible in general, because WTF-8 encodes a superset of Unicode.

There are some cases that would benefit from this even if they don't send the WTF-8 bytes to external code. For example, if you use serde for IPC, then sending a Path or OsStr between processes is currently much less efficient on Windows than non-Windows platforms.

kornel · June 23, 2020, 1:28pm

WTF-8 is supposed to be an internal representation, not an encoding for interchange. I don't think it's commonly used, so libraries expecting char * are very unlikely to make a proper use of it. Windows itself certainly can't. That makes exposing it more of a new feature for Rust rather than compatibility for C, so I'd say it's out of scope for CString.

droundy · July 1, 2020, 3:54pm

What about simply failing on Windows? If there is no correct behavior, that would ensure that code using TryFrom works where the result is well defined.

kornel · July 1, 2020, 8:07pm

Yeah, I suppose failing on any non-ASCII chars would be the most correct (I hope we can ignore EBCDIC :))

kornel · July 9, 2020, 1:14pm

I shall add that this problem exists in rustc itself. LLVM takes char* for paths, and rustc uses things like:

CString::new(format!("{}", path.display())
CString::new(path_buf.to_string_lossy().as_bytes())

which is incorrect on all platforms. It's unfixable on Windows, but it's unnecessarily broken on Unix.

jethrogb · July 9, 2020, 1:15pm

The correct behavior on Windows is probably to convert the (wide) OsString to a multibyte string using WideCharToMultiByte with the active code page and check that lpUsedDefaultChar was not set. But this really depends on the C library. Using char* on Windows to refer to paths is not really portable. See notes regarding this in Microsoft's fopen documentation.

The_Decryptor · July 19, 2020, 2:46am

Microsoft are working on the portability issues, recent versions of Windows 10 support a flag that makes the system ANSI APIs consume and emit UTF-8.

https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page

The page says this will be the default going forward for new APIs as well, though of course support for that in third party libraries will be spotty, and it doesn't help old OS releases so care will still have to be taken.

josh · July 20, 2020, 6:21am

Has to start somewhere; perhaps one day, when the minimum supported version of Windows is at least that version of Windows 10, we can just support UTF-8 for system APIs, and stop dealing with wide characters entirely except for conversions for FFI.

matklad · October 4, 2020, 6:18pm

It's interesting that all those CString::new also have .unwrap in them, which I think is impossible (as path can't have internal null bytes).

Should we add

pub trait std::os::unixOsStrExt {
    fn to_cstring(&self) -> CString;
}

?

This obviously doesn't solve cross-platformness issue, but I am somewhat skeptical about that: it seems that windows "OS Strings" are composed of wide chars, and C strings of chars, and that seems fundamentally incompatible.

bjorn3 · October 4, 2020, 7:01pm

On Windows OSString internally stores the data as WTF-8. This is UTF-8 compatible except for also representing unmatched surrogates. CString allows any byte sequence that doesn't contain a null, so storing WTF-8 in a CString is technically valid.

matklad · October 4, 2020, 7:22pm

I mean, it doesn't matter what internal format we use. Windows API uses wchar_t *, C API uses char *, they are just not interconvertable.

chrisd · October 4, 2020, 7:35pm

Wait but Rust Paths can. They are essentially just a bunch of bytes, no? Interpreting them as anything else depends on the particular function being called and the platform it is being called on.

Topic		Replies	Views
Why doesn't the `into_string` method be available directly under `PathBuf` even though it already exists in `OsString`? language design	10	852	February 28, 2024
Const OsString::new, PathBuf::new libs	5	296	November 1, 2024
OsStr, WTF8, as_bytes, and to_string_unchecked libs	15	3091	October 11, 2020
`into_string_lossy` methods for `CString` and `OsString`? libs	4	474	August 19, 2024
Maybe there should be a `osformat!` macro libs	6	870	August 7, 2023

PathBuf to CString

Related topics