I want to explore ways to make OsStr
slightly less opaque seeming on Windows. Currently the only ways to interact with an OsStr
is:
- use one of the ascii methods
- creating a
str
(which requires either valid Unicode or lossily converting toString
) - using
Path
methods - re-encoding as UTF-16 using
encode_wide()
Of these, encode_wide()
is the most versatile for arbitrary handling of OS strings with minimal overhead because it doesn't require the std to have implemented the specific method and neither does it need to check for valid Unicode or convert to a String. However, it's not that great to work with. It only allows forward iteration and using UTF-16 isn't very helpful when most of the Rust ecosystem is geared towards UTF-8. Even broken UTF-8 tends to be easier to work with as there are plenty of crates for that.
To that end I'd like to explore some ways to expose the underlying representation.
edit: I've updated this post to better reflect the current ideas being discussed.
Expose the raw bytes
On *nix platforms, OsStr has an as_bytes
method to expose the underlying bytes. We could do the same thing for Windows so that OsStr
is no longer opaque at all. Other crates could more effectively extend OsStr
almost as though they were part of the standard library.
However, a Windows as_byte
method would mean publicly stabilizing the internal encoding (currently WTF-8). Note that there are some crates that already rely on this encoding, even though they technically shouldn't.
Allow iterating, indexing and splitting an OsStr
If keeping the internal encoding a secret is deemed necessary, we could still allow some operations.
For example, a char_indices
method could be used to find an index (for the sake of simplicity I'm using generic returns here):
// Return an iterator over an index and Some(char) or None (if there is an unpaired surrogate)
fn char_indices(&self) -> impl Iterator<(usize, Option<char>)>
// Alternatively, instead of Option<char> it could return an enum that is either a char or the
// value of an unpaired surrogate
fn char_indices(&self) -> impl Iterator<(usize, enum {char, u16})>
These indices could then be used to slice the OsStr
via get(range)
, split_at()
, splitn()
, etc.
However, this is less versatile then allowing direct access to the bytes. It means having to iterate char
by char
to find usable indices.