Make std::os::unix::ffi::OsStrExt cross-platform

AFAIK OsStr-related APIs have already exposed and locked its implementation to be WTF-8 forever.

OsStr couldn’t be implemented as stored as UCS2/UTF-16, or even an arbitrary 8-bit codepage, because str is explicitly UTF-8 and implements non-allocating non-fallible AsRef<OsStr> (and there are more APIs like that).

So given that WTF-8 is the only possible implementation of OsStr's interface, can we drop the platform-dependence charade, and make as_bytes() cross-platform?

I wouldn’t mind making encode_wide() cross-platform too, because it’s easier to run tests/CI on non-Windows even if the code is meant for Windows.

1 Like

I would be in favor of something like this. A useful property of WTF-8 is that it is ASCII compatible, which can be used to speed up certain kinds of code, and I do that here: https://github.com/BurntSushi/ripgrep/blob/2c84825ccbe2d42e0648b56a7d3289c5f5bcb9e5/globset/src/pathutil.rs#L58-L84

Providing as_bytes would remove this use of unsafe and the lengthy comment attempting to rationalize such shenanigans.

An alternative approach here is to beef up the OsStr APIs to provide string search functionality. But there are limits to this. For example, what happens when you want to run a regex over an OsStr? For that to work, you need to expose the byte level representation. This isn’t a strange use case either. It’s actually really common: glob matching.

cc @SimonSapin

It is possible if you try hard enough :stuck_out_tongue:, e.g. represent it as something like

union {
   as_utf8: str,
   as_utf16: struct {
        magic: u16 = 0xfeff,  // <- no UTF-8 can contain \xfe or \xff
        utf16: [u16],
   },
}

(not implying this is a good choice, just that you can't rule that WTF-8 is the "only possible implementation".)

1 Like

The reason WTF-8 bytes are not exposed is not in case we change the internal representation later, it’s because anything one might do with them (except constructing another OsStr) is almost certainly wrong. WTF-8 is a hack that we made up for a very specific purpose. I’d be very sad if it accidentally becomes the de-facto encoding of some protocol because someone copied WTF-8 bytes from an OsStr into a file or a socket without realizing it’s not UTF-8. I’d very much prefer we don’t allow WTF-8 bytes to “leak” out of OsStr. (Though of course someone determined enough can always use transmute, but that’s less likely to be accidental.)

@kornel, what would you want to use this for?

As to encode_wide, what would it do on platforms where OsStr contains arbitrary bytes?

1 Like

I’m trying to check whether a UNC path is a valid legacy Windows path, so I need to check for reserved filenames (case insensitive) and reserved characters. These happen to be interesting only in ASCII subset, so I’m doing horrible things like:

    #[cfg(unix)]
    let char_ish_iterator = file_name.as_bytes().iter().map(|&b| b as u16);
    #[cfg(windows)]
    let char_ish_iterator = file_name.encode_wide();

and I want this to work on macOS (my dev machine) and Linux too (my CI).

If you’re only interested in the ASCII range, would to_string_lossy work?

It does work, but it feels like a quite a heavy approach to just check whether a string starts with “COM” or ends with a space.

1 Like

So what is really needed is a native .starts_with()/.ends_with()/.contains() method, not the ability to extract the underlying WTF-8/UCS-2 buffer.

We have been adding string-like APIs to OsStr and OsString, for example https://github.com/rust-lang/rfcs/pull/1307. So it definitely makes sense to add more. (I also recall work to make the Pattern trait support OsStr too, but I can’t find it again.)

1 Like

Regarding patterns.

I have been exploring and working on “constant space” substring matching algorithms recently. (constant space ~ no allocations; so libcore-compatible). It’s full of trade-offs, and we could even make do with a brute force search for most purposes but the “complicated” linear time algorithm seems to have ok speed, and it doesn’t have pathological worst cases.

I’m not sure if it’s too academic but there’s something attractive about being able to provide arbitrary &[T] where T: Eq substring search. Specialization allows picking a smarter algorithm for T: Ord and for T = u8/i8 respectively. What’s not as simple is to pick an allocating algorithm depending on if that’s available or not.

I see the point about preventing leakage of WTF-8 to the outside world. In that case addition of helper functions is an OK alternative.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.