AFAIK OsStr-related APIs have already exposed and locked its implementation to be WTF-8 forever.
OsStr couldn’t be implemented as stored as UCS2/UTF-16, or even an arbitrary 8-bit codepage, because str is explicitly UTF-8 and implements non-allocating non-fallible AsRef<OsStr> (and there are more APIs like that).
So given that WTF-8 is the only possible implementation of OsStr's interface, can we drop the platform-dependence charade, and make as_bytes() cross-platform?
I wouldn’t mind making encode_wide() cross-platform too, because it’s easier to run tests/CI on non-Windows even if the code is meant for Windows.
Providing as_bytes would remove this use of unsafe and the lengthy comment attempting to rationalize such shenanigans.
An alternative approach here is to beef up the OsStr APIs to provide string search functionality. But there are limits to this. For example, what happens when you want to run a regex over an OsStr? For that to work, you need to expose the byte level representation. This isn’t a strange use case either. It’s actually really common: glob matching.
The reason WTF-8 bytes are not exposed is not in case we change the internal representation later, it’s because anything one might do with them (except constructing another OsStr) is almost certainly wrong. WTF-8 is a hack that we made up for a very specific purpose. I’d be very sad if it accidentally becomes the de-facto encoding of some protocol because someone copied WTF-8 bytes from an OsStr into a file or a socket without realizing it’s not UTF-8. I’d very much prefer we don’t allow WTF-8 bytes to “leak” out of OsStr. (Though of course someone determined enough can always use transmute, but that’s less likely to be accidental.)
I’m trying to check whether a UNC path is a valid legacy Windows path, so I need to check for reserved filenames (case insensitive) and reserved characters. These happen to be interesting only in ASCII subset, so I’m doing horrible things like:
#[cfg(unix)]
let char_ish_iterator = file_name.as_bytes().iter().map(|&b| b as u16);
#[cfg(windows)]
let char_ish_iterator = file_name.encode_wide();
and I want this to work on macOS (my dev machine) and Linux too (my CI).
We have been adding string-like APIs to OsStr and OsString, for example https://github.com/rust-lang/rfcs/pull/1307. So it definitely makes sense to add more. (I also recall work to make the Pattern trait support OsStr too, but I can’t find it again.)
I have been exploring and working on “constant space” substring matching algorithms recently. (constant space ~ no allocations; so libcore-compatible). It’s full of trade-offs, and we could even make do with a brute force search for most purposes but the “complicated” linear time algorithm seems to have ok speed, and it doesn’t have pathological worst cases.
I’m not sure if it’s too academic but there’s something attractive about being able to provide arbitrary &[T] where T: Eq substring search. Specialization allows picking a smarter algorithm for T: Ord and for T = u8/i8 respectively. What’s not as simple is to pick an allocating algorithm depending on if that’s available or not.