I want to explore ways to make OsStr slightly less opaque seeming on Windows. Currently the only ways to interact with an OsStr is:
- use one of the ascii methods
- creating a
str(which requires either valid Unicode or lossily converting toString) - using
Pathmethods - re-encoding as UTF-16 using
encode_wide()
Of these, encode_wide() is the most versatile for arbitrary handling of OS strings with minimal overhead because it doesn't require the std to have implemented the specific method and neither does it need to check for valid Unicode or convert to a String. However, it's not that great to work with. It only allows forward iteration and using UTF-16 isn't very helpful when most of the Rust ecosystem is geared towards UTF-8. Even broken UTF-8 tends to be easier to work with as there are plenty of crates for that.
To that end I'd like to explore some ways to expose the underlying representation.
edit: I've updated this post to better reflect the current ideas being discussed.
Expose the raw bytes
On *nix platforms, OsStr has an as_bytes method to expose the underlying bytes. We could do the same thing for Windows so that OsStr is no longer opaque at all. Other crates could more effectively extend OsStr almost as though they were part of the standard library.
However, a Windows as_byte method would mean publicly stabilizing the internal encoding (currently WTF-8). Note that there are some crates that already rely on this encoding, even though they technically shouldn't.
Allow iterating, indexing and splitting an OsStr
If keeping the internal encoding a secret is deemed necessary, we could still allow some operations.
For example, a char_indices method could be used to find an index (for the sake of simplicity I'm using generic returns here):
// Return an iterator over an index and Some(char) or None (if there is an unpaired surrogate)
fn char_indices(&self) -> impl Iterator<(usize, Option<char>)>
// Alternatively, instead of Option<char> it could return an enum that is either a char or the
// value of an unpaired surrogate
fn char_indices(&self) -> impl Iterator<(usize, enum {char, u16})>
These indices could then be used to slice the OsStr via get(range), split_at(), splitn(), etc.
However, this is less versatile then allowing direct access to the bytes. It means having to iterate char by char to find usable indices.