[Pre-RFC] Additional path handling utilities

Summary

(Alternatively rendered RFC document here)

As part of feedback gathered by the CLI-WG we have concluded that there are several key aspects of path management missing from the Rust stdlib. Most of this pre-RFC is based on the summary made by this comment.

Motivation

The addition of these functions will have a positive impact on the Rust ecosystem as a whole. It will enable users to write CLI applications more quickly, without having to re-invent the wheel. Additionally these functions fit neatly into already existing structures and modules.

Guide-level explanation

Additions to the stdlib are made to the fs module, the path::PathBuf and path::Path constructs.

Firstly, a fs::normalize function is added that operates simmilarly to fs::canonicalize. The difference is that it doesn’t actually touch the filesystem. This means that it wouldn’t fail to normalize a path if a directory does not exist.

Non-destructive PathBuf::push

The PathBuf will gain a new, non-destructive append function which will operate similarly to the already existing push. However it will never overwrite the buffer, even if the provided path segment is absolute.

let mut path = PathBuf::from("/usr"); // --> /usr
path.append("/lib");                  // --> /usr/lib
path.push("/etc");                    // --> /etc

Non-destructive Path::join

The Path will gain a new function concat which operates simmilarly as PathBuf::append. It always appends provided path segments, even if they were absolute (see PathBuf::append)

Path::new("/usr").concat("/lib"); // --> /usr/lib
Path::new("/usr").join("/lib");   // --> /lib

Path::relative_to and Path::absolute

Path would also gain two more functions for computing relative paths to a base one, as well as the absolute base path for any given relative one. These functions require the current working directory (CWD) to be available. Although in cases where Path::starts_with is true, relative_to would not require it.

Reference-level explanation

While fs::normalize provides the same functional output as fs::canonicalize, it’s implementation is quite different, as it can not rely on either realpath (for unix) or CreateFile (Windows) to do it’s job.

This means that file path canonicalization needs to be done on it’s own, or using some other backend implementation.

The additional PathBuf and Path functions work similarly in their design that a lot of the code can be shared from PathBuf::push and Path::join respectively, while those functions can become wrappers around PathBuf::append and Path::concat with different input sanitisation/ handling.

For Path::relative_to and Path::absolute the most leg-work is required. Because in cases where Path::starts_with for the base-path of Path::relative_to is false, the CWD is required to compute the path difference, there should be a generalized way of getting the current working directory.

Optionally another function could be added to the fs module, that exposes this, however this is entirely optional.

Drawbacks

The functions Path::relative_to and Path::absolute require additional utilities to be available in the fs module and could add unwanted complexity to it. In case this is a problem, these two functions can be excluded, with a crate potentially implementing them, without having any impact on the other additions made.

For the other functions no drawbacks exist that we are aware of at this time.

Rationale and alternatives

While all of these functions could be provided via external crates, we believe that the inclusion into fs, PathBuf and Path respectively will yield the best developer experience in the long run, providing common ground between different applications and libraries, that want to use these functions, alike.

Overall the inclusion of these functions in the stdlib to have a positive effect on the Rust ecosystem.

6 Likes

Thanks!

Could you specify the expected signatures? I feel it would add some clarity of what is intended with the different functions, even if some of them have parallels in the existing API.

1 Like

I happen to have been looking into path-normalization quite a bit recently, helping out with the path_abs crate.

Firstly, a fs::normalize function is added that operates simmilarly to fs::canonicalize . The difference is that it doesn’t actually touch the filesystem.

There's a few of ways of turning a relative path into an absolute one.

The first is what canonicalize does - double-check every path-component against what's on disk, resolving every symlink, dropping components followed by .., etc. The advantage of this process is that you get the One True Canonical Path to a given item; the disadvantages are that it's expensive, and that it requires every path component to exist—a drag if your program wants to talk about paths that might exist, or that you want to create.

The second is what Go's Clean() function does - completely ignore what's on disk, and just drop every component followed by .. regardless. The advantage of this process is that it's fast; the disadvantage is that it can change the meaning of a path, meaning that it refers to a different file on disk than the original, un-Cleaned path did.

The third is what Python's resolve() does - canonicalize each component until you get ErrorKind::NotFound, then clean the remainder. Unlike canonicalizing, this works for paths that don't exist on disk; unlike cleaning, this is still disk-intensive.

Some people might wonder why Clean() can change the meaning of a path. It happens when a path-component that refers to a symlink is followed by a .. component. If you've got a POSIXy computer around, here's an experiment to try:

mkdir -p /tmp/foo/a /tmp/bar
ln -s /tmp/foo/a /tmp/bar/a
echo expected > /tmp/bar/result
echo gotcha > /tmp/foo/result
cat /tmp/bar/a/../result

You might expect that /tmp/bar/a/../result is the same as /tmp/bar/result and hence these commands print "expected", but that's not what happens! Because /tmp/bar/a is a symlink to /tmp/foo/a, whose .. directory entry points at /tmp/foo, /tmp/bar/a/.. is the same as /tmp/foo. Clean() pretends that /tmp/bar/a/.. is the same as /tmp/bar, which it clearly is not.

And so I have a fourth algorithm for turning a relative path into an absolute one, that I call the "monotonic" algorithm: when you see a .. component, check the previous component. If it exists and it's a symlink, resolve it, but if it doesn't exist or it isn't a symlink, drop it. Path components not followed by .. never need to be checked. Unlike canonicalize(), this algorithm doesn't require the path to exist, and for paths without .. it doesn't touch the disk at all. Unlike Clean(), it always produces a correct path - maybe not the One True Canonical Path, but one that descends more-or-less directly ("monotonically") toward the target.

It sounds like fs::normalize is proposed to match the behaviour of Clean(); I hope that doesn't make it into the standard library, since that algorithm prioritizes speed over correctness, at least in a corner-case. I'd much rather my alternative algorithm be implemented, since it's just as fast in the common case, but prioritizes correctness over speed in that same corner case.

This means that file path canonicalization needs to be done on it’s own, or using some other backend implementation.

Here's a basic, unoptimized outline:

  • Let res be the current working directory.
  • Call fs::canonicalize() on res. On POSIX this does nothing, but on Windows it puts the path into extended-length path ("verbatim") syntax, which makes life a whole lot simpler in general.
  • For each component of the path to be canonicalized:
    • if the component is a Prefix component, canonicalize it and push it to res, which effectively replaces the existing content. This canonicalization means you'll keep the path in extended-length path syntax, and you'll correctly handle paths like C:foo (relative to the current directory on drive C:)
    • if the component is a RootDir component, do not canonicalize it, and push it to res, which effectively truncates res to its prefix. You might think you need to canonicalize to support paths like \foo (relative to the current drive), but (a) that's actually solved by initializing res to the current working directory, and (b) Windows fails to canonicalize \ when the current working directory uses extended-length path syntax.
    • If the component is a CurrentDir component, you can ignore it.
    • If the component is a Normal component, push it to res. You don't have to worry about special Windows names like CON or LPT because res uses extended-length path syntax.
    • If the component is a ParentDir component, you can choose whether you want to just pop the previous component (the Clean() algorithm) or resolve it (the "monotonic" algorithm).
  • res now contains the absolute path you wanted.

That outline mentions a bunch of Windows-specific concerns, but on POSIX platforms the same logic works unmodified.

Sorry if this is a bit long and rambly, but it's 1AM and apparently I have Strong Opinions about path normalization. :confused:

5 Likes

I'm probably in the minority but I have a couple applications that manipulate paths for what will be written to an empty directory (or even zip file). The paths are usually relative when I'm operating on them and relative to a directory besides CWD. I have my own hacked up "Clean" functions and would be appreciative of a version of "Clean" making it into the standard library.

On a related note, coming from Python's pathlib, I missed the clarity of what are "pure" functions (doesn't touch FS) and what are "concrete" functions (might touch FS). Would love for us to explicitly specify these. I think this would help clarify canonicalize from clean.

1 Like

The paths are usually relative when I’m operating on them and relative to a directory besides CWD. I have my own hacked up “Clean” functions and would be appreciative of a version of “Clean” making it into the standard library.

Hmm... I definitely understand why you'd want a collection of relative paths to describe files in an archive, but I don't quite see where you'd use Clean(). Do you build up internal paths with a lot of .. components and then run them all through Clean() to decide what paths to write to the resulting archive?

The OP says " a fs::normalize function is added that operates simmilarly to fs::canonicalize", which I guess involves turning relative into absolute paths. If you just want to clean up relative paths without making them absolute, that seems reasonable, but probably should be a separate function. For example, cleaning the relative path ../foo should probably return an error since it escapes from the path it's relative to, and the same goes for /foo. However, both of those would be perfectly fine inputs to a "relative to absolute" function.

I see people express these kind of thoughts but there are a lot of cases where joins behavior is exactly what people want. I feel I'm not the only one because it seems every major Path OOP library does it this way.

1 Like

Part of it is cleaning up direct user input and how different pieces of user input interact.

The OP says " a fs::normalize function is added that operates simmilarly to fs::canonicalize ", which I guess involves turning relative into absolute paths. If you just want to clean up relative paths without making them absolute, that seems reasonable, but probably should be a separate function. For example, cleaning the relative path ../foo should probably return an error since it escapes from the path it’s relative to, and the same goes for /foo . However, both of those would be perfectly fine inputs to a “relative to absolute” function.

This is one of the reasons I want signature, I feel it would help clarify intent on these questions. Are errors expected? Is the "CWD" actually a CWD or an input path? etc.

In my mind, there are two reasons

  • Another way to frame somePath.join(absolutePath) is customCwd.join(absolutePath). It works really well for treating any directory as your CWD without changing global process state. This also by extension works well when you want to force a path under a directory (join and then check starts_with)
  • Simplicity. We can express a lot in the type system but we also need to decide what is worthwhile. There are multiple dimensions to work along (absolute / relative / unknown, file / directory / symlink / unknown, exists / doesn't exist, owned / borrowed). Do we prioritize one dimension or do we make combinations of the dimensions? How do we help the user navigate what the starting point in the API is and how they can move between the different states?

With that said, I don't think a one-size-fits-all API is best. I appreciate the PathBuf for what it is. I could see you all continuing in the type-heavy `path_abs, and I've been considering the idea of a rapid prototyping string/path/buffer API.

1 Like

Current semantics of .join is very useful when you accept path from the user, and that path may be even relative to something, or absolute. For example, in Cargo you can specify various paths (dependencies.foo = { path = "..."}) as either absolute or relative (to Cargo.toml parent dir). To handle this situation, you need exactly the semantics that .join has. Here's the specific code example: cargo/src/cargo/util/toml/mod.rs at f9926c68f6371c51efeb5b71429a0ec2e5deb54d · rust-lang/cargo · GitHub.

3 Likes

I can't even tell how frustrated was I when I found out that Rust's path.join(abspath) was completely different than other languages. Took me a few hours trying to understand what was going on.

Agree that the naming could probably better, and that the fact that this differs from other languages is surprising (I’ve been confused over this myself).

That said, at least in my experience, this is exactly the behavior you want most of the time, when the path signifies some file on the local machine (cases like command-line flags, configuration options, etc).

When I’ve understood how join works, I was like “wow, this is API I’ve been looking for the whole of my life, without even knowing about it”.

1 Like

I'm curious, which languages?

Python and C++ behave like Rust.

Go is different but from the overview of Go that I read, it felt like their path handling was more like string operations / shell operations than an path oriented data structure.

Thanks for clarifying your intent. Personally, it did come across that way (“non-salvagable”, “join was a mistake”, “hard time to understand why [it] should type checl”). I hope it can be seen why we took it that way.

Also, for completeness, I felt I should remind people that the winner of the 2016 Underhanded Rust Contest leveraged join.

2 Likes

Ruby:

irb(main):002:0> File.join("some","/path","yknow")                                                                                            
=> "some/path/yknow"

NodeJS:

❯ node                                                                                                                                        
> const path = require('path');                                                                                                               
undefined                                                                                                                                     
> path.join("some","/path","yknow");                                                                                                          
'some/path/yknow'

Thanks!

Those look like they might be in the same vein as Go, more path-string manipulation rather than OOP Path like Python’s pathlib and C++'s path API. Not saying this to dismiss it but to check for understanding and putting them in context of each other.

Ok, so your concern is mostly with the name and not the functionality?

Like @matklad, I love join's functionality, but I can definitely see the argument that it has a misleading name. As another example, somebody experienced with Vec::push() and Vec::pop() could be quite confused by PathBuf::push() and PathBuf::pop() Not only is PathBuf::pop()'s return-type different, but pushing a value and then immediately popping is not a no-op: if you push "/" then your path is (mostly) cleared.

The Rust docs for Path::join() and PathBuf::push() explicitly call out that they use "adjoining" rules rather than simple concatenation, maybe join should have been adjoin. It's a bit late to change it now, I guess, and you really need two names: one for the Path method and one for the PathBuf method.

Another thing that comes to mind: while I've felt the best justification for join's behaviour is "that's what the kernel does", there are other situations where simple concatenation is the Right Thing. For example, consider code that expands a string like this:

${HOME}/.config/myapp

The obvious implementation would see ${HOME}, expand it as environment variable, then concatenate the rest of the string. Unfortunately, .push("/.config/myapp") will do the wrong thing here, and it's not immediately obvious how to do it correctly. Off the top of my head, it'd be something like:

for component in other.components() {
    match component {
        | Component::CurrentDir // does nothing
        | Component::RootDir // we're in the middle of a path, consecutive delimiters should be ignored
            => (),
        | Component::Normal => res.push(component),
        | Component::ParentDir => res.pop(), // modify according to taste
        | Component::Prefix => return Err("lolwut"),
    }
}

...which doesn't roll off the tongue nearly as easily as .push(), or the .append() method described in the RFC.

As a single type representing the hazy notion of a "path" (which the standard library needs anyway, for use with File::open() and create_dir(), etc.) I think PathBuf is pretty reasonable. I'd definitely like to see an API with separate newtypes for absolute and relative paths, though, and I'd like to try it out to see if it's as pleasant to use as it would be to design. Battle-testing such an API could definitely be done in a third-party crate, though, and the methods proposed in this RFC would make it even easier.

For new APIs or anyone attempting to push for a "rename" (deprecate join, create new fn), another option is merge.

Doesn't that kind of prove the point that it is entirely misnamed and misleading in its functionality? I think that is what many of the posters are getting at (or at least that's how I'm reading the discussion). It seems quite odd to me as well that a method called "join" might "replace" depending on its argument. That is entirely surprising behavior whether or not it is appropriately documented or desirable in most circumstances behavior.

This is a little different, IMO. This is comparing a join utility function/method, rather than an instance method. Personally, join behaves exactly as I expected it to, probably because I'm most familiar with ruby's Pathname class:

require 'pathname'

p = Pathname.new("some")     => #<Pathname:some>
p.join("/path", "yknow")     => #<Pathname:/path/yknow>

For me, operating on an existing PathBuf, I expect join to give me the equivalent to changing around directories in a filesystem, and rely on this behavior to allow a user to specify a path which is either absolute or relative to some other (often non-cwd) directory. This is the relative-to-Cargo.toml behavior.

On the other hand, a hypothetical (variadic) PathBuf::join("some", "/path", "yknow") I'd expect to mirror the ruby/node utility functions.

Just for completeness, there's also this:

println!("{:?}", ["some", "/path", "yknow"].iter().collect::<PathBuf>());
=> /path/yknow

I do agree that join as a name is, at best, vague. Also, I think referring to these as "non-destructive" is a little unclear -- I initially thought you meant the operations returned a new PathBuf, rather than altering the existing buffer.

And I guess we could bikeshed these until the end of time, but the difference between append, push, concat and join feels semantically vague. Thanks to ruby and javascript, I'll never really know what to expect from concat. Ruby's concat mutates the receiver in place, js returns a newly allocated object. Wheeeee. :smiley:

Finally, I think relying on the current working directory for anything related to generate relative paths is not a good idea. (OTOH, I don't totally understand the reasoning around needing to get the CWD, so I might be missing something.) The relative_path_from method in ruby's Pathname class has worked well for my use-cases.

2 Likes