Why doesn't the `into_string` method be available directly under `PathBuf` even though it already exists in `OsString`?

It's just annoying to have to convert a PathBuf to OsString before converting it to String.

You can use:

  • PathBuf::to_str() to access a PathBut as an Option<&str>
    • It's an Option because a system path can have non-utf8 characters and &str is utf8.
    • If you want to handle non-utf8 strings use OsString directly.
  • PathBuf::to_string_lossy()
    • If you want to access path for e.g. printing where you don't care that the &str you get out is an exact match (invalid characters won't be printed properly anyway).

Take a look at OsString documentation because it explains this in more detail.

PathBuf::into_string() isn't any more useful because with those two methods you still need two metod calls. OsString::into_string() returns a Result<String, OsString> where Err is for the non-utf8 case anyway. It's much more cost effective to check for encoding (with to_str) before copying an invalid String you won't be able to use later which is I guess why into_string doesn't exist.

1 Like

Not a real solution, I'm afraid.

I need into_string because it must be into_string:

  • I need an owned string.
  • I don't want to clone, both methods you suggest will clone.
  • I need to error immediately on non-UTF8, no unnecessary task.

And given that the internal data of a PathBuf is just OsString, a PathBuf::into_string should be zero-cost.

For small additions like this, you can submit an API change proposal.

1 Like

I don't think there would be any major objections to

impl PathBuf {
    pub fn into_string(self) -> Result<String, PathBuf> {
        self.into_os_string().into_string().map_err(Into::into)
    }
}

as it's just a small helper to make already possible functionality easier to access. There may be some objection that this isn't an operation you "should" be doing (or that you "should" be using camino for known-utf8 paths) and that the longer spelling pushes you towards doing the "right" thing more often, but it's already trivial to panic on invalid Unicode paths without one more Result returning method.

5 Likes

Thank you for pointing me to camino, I will consider using it in my project. But for simple code with minimal dependencies, an into_string method is tremendously helpful.

I wouldn't buy the argument that it isn't what we should be doing, as a more expensive APIs (such as to_str or to_string_lossy) are currently more convenient to call.

a more expensive APIs (such as to_str or to_string_lossy) are currently more convenient to call.

OsString::into_string() does clone: playground.

All 3 paths do basically the same amount of allocation and copying and just allow handling the error at different points. OsString and String have different guarantees about stored data and internal representation which means the data has to be moved to turn one into another. The only difference is into_* does explicit drop of self.

Use path_buf.to_str().unwrap().to_string() and create a utility trait extending PathBuf if you do it often enough that you'd save a lot of time by calling just a single method. If you're not going to keep the OsString if the operation fails (i.e. you unwrap), I believe to_str might be more clear.

2 Likes

This playground is wrong. addr_of!(a) is the address of the variable holding the PathBuf or Result<String, …>, respectively, not the address of the data.

use std::path::PathBuf;

pub fn main() {
    let a = PathBuf::from("/hello/world");
    println!("a: {:p}", a.as_os_str());
    let a = a.into_os_string().into_string();
    println!("b: {:p}", a.as_ref().unwrap().as_str());
}
a: 0x5609d34bf9d0
b: 0x5609d34bf9d0

OsString/OsStr (and PathBuf/Path) do have guarantees of internal data representation: If their data is a valid unicode string, then it’s represented in UTF8, just like String/str, as evidenced by the existience of the to_str methods (for Path[Buf], for OsStr[ing]) returning a borrowed str, i.e. pointing to data that already existed in this format.

pub fn to_str(&self) -> Option<&str>

Though this method existing does not already prove conclusively that convering between owned String and OsString/PathBuf can happen without cloning, it is the case that it can, as the fixed playground demonstrates. For OsString: From<String>, this fact is even documented. Of course the other way does still involve a linear scan, in order to validate all the data, which is also why the method is called to_str, not as_str, as it’s not a cheap constant-time operation. So converting OsString or PathBuf to String is not a lot cheaper than it would be to copy the data to a new allocation, but still, it doesn’t copy. (Converting the other way is cheap though.)

4 Likes

To elaborate:

The byte encoding is an unspecified, platform-specific, self-synchronizing superset of UTF-8. By being a self-synchronizing superset of UTF-8, this encoding is also a superset of 7-bit ASCII.

In practice, this unspecified encoding is currently WTF-8.

1 Like

To be super clear, that's only true for Windows (and, to reiterate, it's not currently a stable guarantee). Other platforms may or may not have their own UTF-8 superset encoding. On posix systems this is usually arbitrary bytes which, for the sake of simplicity, are assumed to be UTF-8 like by default. You can do a manual encoding/decoding using (for example) the C locale or whatever.

Btw, Windows itself only really guarantees valid unicode for paths. That paths may not be validated is an implementation detail.

4 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.