Making `OsStr` less opaque on Windows

I want to explore ways to make OsStr slightly less opaque seeming on Windows. Currently the only ways to interact with an OsStr is:

  • use one of the ascii methods
  • creating a str (which requires either valid Unicode or lossily converting to String)
  • using Path methods
  • re-encoding as UTF-16 using encode_wide()

Of these, encode_wide() is the most versatile for arbitrary handling of OS strings with minimal overhead because it doesn't require the std to have implemented the specific method and neither does it need to check for valid Unicode or convert to a String. However, it's not that great to work with. It only allows forward iteration and using UTF-16 isn't very helpful when most of the Rust ecosystem is geared towards UTF-8. Even broken UTF-8 tends to be easier to work with as there are plenty of crates for that.

To that end I'd like to explore some ways to expose the underlying representation.

edit: I've updated this post to better reflect the current ideas being discussed.

Expose the raw bytes

[suggested by josh]

On *nix platforms, OsStr has an as_bytes method to expose the underlying bytes. We could do the same thing for Windows so that OsStr is no longer opaque at all. Other crates could more effectively extend OsStr almost as though they were part of the standard library.

However, a Windows as_byte method would mean publicly stabilizing the internal encoding (currently WTF-8). Note that there are some crates that already rely on this encoding, even though they technically shouldn't.

Allow iterating, indexing and splitting an OsStr

[suggested by kornel]

If keeping the internal encoding a secret is deemed necessary, we could still allow some operations.

For example, a char_indices method could be used to find an index (for the sake of simplicity I'm using generic returns here):

// Return an iterator over an index and Some(char) or None (if there is an unpaired surrogate)
fn char_indices(&self) -> impl Iterator<(usize, Option<char>)>

// Alternatively, instead of Option<char> it could return an enum that is either a char or the
// value of an unpaired surrogate
fn char_indices(&self) -> impl Iterator<(usize, enum {char, u16})>

These indices could then be used to slice the OsStr via get(range), split_at(), splitn(), etc.

However, this is less versatile then allowing direct access to the bytes. It means having to iterate char by char to find usable indices.

I'd like to propose the alternative that we just expose the byte-based underlying nature on all platforms, rather than trying to hide that.

People already widely rely on that, to the point that I don't think we could reasonably change it. We could go ahead and expose it, for convenience, and make OsStr and OsString as close as we can reasonably get them to BStr and BString.

(In an ideal world, I wish we could have BStr and BString, and separately have "OS string" types that weren't byte-based and matched the underlying platform, so that on Windows we didn't have to convert back-and-forth between WTF-8 and UTF-16. But our current stable API surface doesn't allow for that. There are some other improvements we might be able to make in that vein, though.)

3 Likes

I would love that! But from reading earlier conversations it seemed that there was considerable reluctance to making it public. Perhaps time and experience has changed that. If so I'd very much welcome it.

2 Likes

Thinking further on this, would it make sense for OsStr to then be marked #[repr(transparent)] because we guarantee it's a thin wrapper around its bytes.

1 Like

It's something I'd like to see too.

The problem is that if you work with OsStr, it's because you specifically care about preserving the broken surrogates, so all the functionality has to support them losslessly. Otherwise you can call to_string_lossy to make all the problems disappear.

If you expose impl Iterator<Item=Option<u8>>, it's not materially different from to_string_lossy, you just get None instead of the replacement character.

Iterating over Result<char, u16> (I assume you've meant char here, not u8) is not especially useful, because with such iterator you can't do anything useful other than build a string, but given constraints, the only encoding you can use is WTF-8 or UCS-2, so you're back to square one.


I think char_indices() would be useful. It could iterate over (usize, Option<char>) (or (usize, enum {char, u16})). In such case you wouldn't iterate to keep the chars, but only to compare them and make a note of the indices for later slicing.

And of course it needs to support get(range), split_at(), splitn(), etc.

2 Likes

Seconding BStr/BString. The bstr crate is wonderful and I would love to use it in more places. Only problem is it is, of course, another dependency that isn't strictly necessary.

4 Likes

OsStr being bytes is already implied by the various AsRef implementations (e.g. str: AsRef<OsStr> and str is documented as being a slice of UTF8 bytes). I.e. OsStr is a superset of UTF8. Whether or not WTF8 is exposed, giving access to at least the UTF8 spans seems reasonable.

Having access to the original UTF8-violating portion of an OsStr (and not just an error or the replacement character) is useful when you wish to display a path, etc., back to the user in some way that isn't either lossy or writing the OsStr directly (bypassing Display), e.g. backslash escapes or the like. (Though perhaps this isn't so much a concern on Windows?)

I think OsStr may be implemented as an enum with one variant containing &str for the AsRef impl and another variant containing the actual encoding used by the OS for performance. In addition on not all OSes OsStr can be a superset of UTF-8. For example if it uses a character encoding more limited than unicode like EBCDIC which is used on ibm's z/OS.

impl AsRef<OsStr> for str requires that OsStr is a superset of UTF-8, for platforms where this is not true they must return an error at the OS interface rather than fail to convert to an OsStr.

1 Like

I personally have long wanted access to the raw byte representation of an OsStr. Without it, it is difficult to perform operations such as regex or glob matching on OsStrs without incurring a cost. We can do it on Unix of course, but not on Windows.

With that said, so long as we keep the raw byte representation an implementation detail, we have the freedom to change it. AIUI, the OsStr Pattern RFC has not been implemented yet, but it does want us to move to OMG-WTF-8 in order to deal with split surrogates elegantly. If we exposed the raw WTF-8 representation, we might not be able to make this change. I say "might," because we could say something like, "this as_bytes() method exposes the representation of an OsStr, but there are no guarantees about it and it may change at any point." However, I think this is likely an optimistic phrasing. As soon as many folks rely on the specific representation, it will be difficult to change it in practice.

To be clear, I am not necessarily opposed to exposing things here. I just want us to take everything into account and I don't think the aforementioned RFC has been mentioned yet.

4 Likes

Hm, I can see the argument for stabilizing without guaranteeing the specific encoding (other than it's some superset of UTF-8). Ultimately there's nothing interesting anyone can do with unpaired surrogates other than treat them as a kind of black box. Matching directly on them doesn't make sense other than for diagnostic purposes; splitting OsStrs between well formed surrogate pairs is not a good idea; they should not be used when creating new files; etc.

Code written for *nix platforms should already be able handle invalid UTF-8 in the input. If the result of adding as_bytes for Windows is that more code can be reused between platforms then it's an improvement. Especially as invalid UTF-16 is rare on Windows so those code paths may not have the same real world testing as the Unix equivalent code.


I don't mean to discount the issues with exposing an internal encoding. I'm just trying to think through the arguments.

@bjorn3 Such an implementation wouldn't handle converting the OS representation to str without allocation when it's valid Unicode, which we have stable APIs for.

I agree that we have to nail down the encoding before stably exposing the bytes. I think we could make a decision about WTF-8 vs OMG-WTF-8 before we do that. (I personally think we should stick with WTF-8 so that there's zero translation overhead for the common case of UTF-8. Also, the tradeoffs may be different now that we've decided to keep the pattern trait sealed and used only within std.)

Also, as mentioned above, we already have crates that are making assumptions about the bytes, and using them. We might break those if we changed the encoding, and while we would be allowed to do so, that doesn't mean we should.

Note: I've updated the OP to (hopefully) better reflect what's being discussed. You can see the original post via the edit history.

I think we should simply use Unix OsStr representation (Vec<u8>) on Windows too. Because non-allocating conversion &str -> &OsStr is already stable, the internal representation of OsStr has to be a superset of UTF-8, so any sort of BSTR-like representation is right out.
The only problem with this approach would be a compatible implementation of Windows windows::ffi::OsStrExt::encode_wide(), i.e. "What do we do when the OsStr contains invalid UTF-8?". IMO, something like Python's surrogateescape encoding would be entirely fine to handle this.