Make std::os::unix::ffi::OsStrExt cross-platform

kornel · November 19, 2017, 3:14pm

AFAIK OsStr-related APIs have already exposed and locked its implementation to be WTF-8 forever.

OsStr couldn’t be implemented as stored as UCS2/UTF-16, or even an arbitrary 8-bit codepage, because str is explicitly UTF-8 and implements non-allocating non-fallible AsRef<OsStr> (and there are more APIs like that).

So given that WTF-8 is the only possible implementation of OsStr's interface, can we drop the platform-dependence charade, and make as_bytes() cross-platform?

I wouldn’t mind making encode_wide() cross-platform too, because it’s easier to run tests/CI on non-Windows even if the code is meant for Windows.

burntsushi · November 19, 2017, 3:32pm

I would be in favor of something like this. A useful property of WTF-8 is that it is ASCII compatible, which can be used to speed up certain kinds of code, and I do that here: https://github.com/BurntSushi/ripgrep/blob/2c84825ccbe2d42e0648b56a7d3289c5f5bcb9e5/globset/src/pathutil.rs#L58-L84

Providing as_bytes would remove this use of unsafe and the lengthy comment attempting to rationalize such shenanigans.

An alternative approach here is to beef up the OsStr APIs to provide string search functionality. But there are limits to this. For example, what happens when you want to run a regex over an OsStr? For that to work, you need to expose the byte level representation. This isn’t a strange use case either. It’s actually really common: glob matching.

burntsushi · November 19, 2017, 3:34pm

cc @SimonSapin

kennytm · November 19, 2017, 4:58pm

It is possible if you try hard enough , e.g. represent it as something like

union {
   as_utf8: str,
   as_utf16: struct {
        magic: u16 = 0xfeff,  // <- no UTF-8 can contain \xfe or \xff
        utf16: [u16],
   },
}

(not implying this is a good choice, just that you can't rule that WTF-8 is the "only possible implementation".)

SimonSapin · November 19, 2017, 10:05pm

The reason WTF-8 bytes are not exposed is not in case we change the internal representation later, it’s because anything one might do with them (except constructing another OsStr) is almost certainly wrong. WTF-8 is a hack that we made up for a very specific purpose. I’d be very sad if it accidentally becomes the de-facto encoding of some protocol because someone copied WTF-8 bytes from an OsStr into a file or a socket without realizing it’s not UTF-8. I’d very much prefer we don’t allow WTF-8 bytes to “leak” out of OsStr. (Though of course someone determined enough can always use transmute, but that’s less likely to be accidental.)

@kornel, what would you want to use this for?

As to encode_wide, what would it do on platforms where OsStr contains arbitrary bytes?

kornel · November 19, 2017, 10:40pm

I’m trying to check whether a UNC path is a valid legacy Windows path, so I need to check for reserved filenames (case insensitive) and reserved characters. These happen to be interesting only in ASCII subset, so I’m doing horrible things like:

    #[cfg(unix)]
    let char_ish_iterator = file_name.as_bytes().iter().map(|&b| b as u16);
    #[cfg(windows)]
    let char_ish_iterator = file_name.encode_wide();

and I want this to work on macOS (my dev machine) and Linux too (my CI).

SimonSapin · November 19, 2017, 11:45pm

If you’re only interested in the ASCII range, would to_string_lossy work?

kornel · November 20, 2017, 12:41am

It does work, but it feels like a quite a heavy approach to just check whether a string starts with “COM” or ends with a space.

kennytm · November 20, 2017, 4:41am

So what is really needed is a native .starts_with()/.ends_with()/.contains() method, not the ability to extract the underlying WTF-8/UCS-2 buffer.

SimonSapin · November 20, 2017, 7:49am

We have been adding string-like APIs to OsStr and OsString, for example https://github.com/rust-lang/rfcs/pull/1307. So it definitely makes sense to add more. (I also recall work to make the Pattern trait support OsStr too, but I can’t find it again.)

bluss · November 20, 2017, 6:39pm

Regarding patterns.

I have been exploring and working on “constant space” substring matching algorithms recently. (constant space ~ no allocations; so libcore-compatible). It’s full of trade-offs, and we could even make do with a brute force search for most purposes but the “complicated” linear time algorithm seems to have ok speed, and it doesn’t have pathological worst cases.

I’m not sure if it’s too academic but there’s something attractive about being able to provide arbitrary &[T] where T: Eq substring search. Specialization allows picking a smarter algorithm for T: Ord and for T = u8/i8 respectively. What’s not as simple is to pick an allocating algorithm depending on if that’s available or not.

kornel · November 20, 2017, 7:47pm

I see the point about preventing leakage of WTF-8 to the outside world. In that case addition of helper functions is an OK alternative.

system · March 25, 2019, 8:29am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OsStr, WTF8, as_bytes, and to_string_unchecked libs	15	3244	October 11, 2020
Making `OsStr` less opaque on Windows libs	15	2083	June 3, 2022
PreRFC: trait converting functions for OsString libs	19	1163	April 14, 2020
[pre-RFC] Deprecate and replace CStr/CString language design	51	6907	March 25, 2019
Get filename from DirEntry without allocating	8	1230	March 25, 2019

Make std::os::unix::ffi::OsStrExt cross-platform

Related topics