[Pre-RFC] Additional path handling utilities


#1

Summary

(Alternatively rendered RFC document here)

As part of feedback gathered by the CLI-WG we have concluded that there are several key aspects of path management missing from the Rust stdlib. Most of this pre-RFC is based on the summary made by this comment.

Motivation

The addition of these functions will have a positive impact on the Rust ecosystem as a whole. It will enable users to write CLI applications more quickly, without having to re-invent the wheel. Additionally these functions fit neatly into already existing structures and modules.

Guide-level explanation

Additions to the stdlib are made to the fs module, the path::PathBuf and path::Path constructs.

Firstly, a fs::normalize function is added that operates simmilarly to fs::canonicalize. The difference is that it doesn’t actually touch the filesystem. This means that it wouldn’t fail to normalize a path if a directory does not exist.

Non-destructive PathBuf::push

The PathBuf will gain a new, non-destructive append function which will operate similarly to the already existing push. However it will never overwrite the buffer, even if the provided path segment is absolute.

let mut path = PathBuf::from("/usr"); // --> /usr
path.append("/lib");                  // --> /usr/lib
path.push("/etc");                    // --> /etc

Non-destructive Path::join

The Path will gain a new function concat which operates simmilarly as PathBuf::append. It always appends provided path segments, even if they were absolute (see PathBuf::append)

Path::new("/usr").concat("/lib"); // --> /usr/lib
Path::new("/usr").join("/lib");   // --> /lib

Path::relative_to and Path::absolute

Path would also gain two more functions for computing relative paths to a base one, as well as the absolute base path for any given relative one. These functions require the current working directory (CWD) to be available. Although in cases where Path::starts_with is true, relative_to would not require it.

Reference-level explanation

While fs::normalize provides the same functional output as fs::canonicalize, it’s implementation is quite different, as it can not rely on either realpath (for unix) or CreateFile (Windows) to do it’s job.

This means that file path canonicalization needs to be done on it’s own, or using some other backend implementation.

The additional PathBuf and Path functions work similarly in their design that a lot of the code can be shared from PathBuf::push and Path::join respectively, while those functions can become wrappers around PathBuf::append and Path::concat with different input sanitisation/ handling.

For Path::relative_to and Path::absolute the most leg-work is required. Because in cases where Path::starts_with for the base-path of Path::relative_to is false, the CWD is required to compute the path difference, there should be a generalized way of getting the current working directory.

Optionally another function could be added to the fs module, that exposes this, however this is entirely optional.

Drawbacks

The functions Path::relative_to and Path::absolute require additional utilities to be available in the fs module and could add unwanted complexity to it. In case this is a problem, these two functions can be excluded, with a crate potentially implementing them, without having any impact on the other additions made.

For the other functions no drawbacks exist that we are aware of at this time.

Rationale and alternatives

While all of these functions could be provided via external crates, we believe that the inclusion into fs, PathBuf and Path respectively will yield the best developer experience in the long run, providing common ground between different applications and libraries, that want to use these functions, alike.

Overall the inclusion of these functions in the stdlib to have a positive effect on the Rust ecosystem.


#2

Thanks!

Could you specify the expected signatures? I feel it would add some clarity of what is intended with the different functions, even if some of them have parallels in the existing API.


#3

I happen to have been looking into path-normalization quite a bit recently, helping out with the path_abs crate.

Firstly, a fs::normalize function is added that operates simmilarly to fs::canonicalize . The difference is that it doesn’t actually touch the filesystem.

There’s a few of ways of turning a relative path into an absolute one.

The first is what canonicalize does - double-check every path-component against what’s on disk, resolving every symlink, dropping components followed by .., etc. The advantage of this process is that you get the One True Canonical Path to a given item; the disadvantages are that it’s expensive, and that it requires every path component to exist—a drag if your program wants to talk about paths that might exist, or that you want to create.

The second is what Go’s Clean() function does - completely ignore what’s on disk, and just drop every component followed by .. regardless. The advantage of this process is that it’s fast; the disadvantage is that it can change the meaning of a path, meaning that it refers to a different file on disk than the original, un-Cleaned path did.

The third is what Python’s resolve() does - canonicalize each component until you get ErrorKind::NotFound, then clean the remainder. Unlike canonicalizing, this works for paths that don’t exist on disk; unlike cleaning, this is still disk-intensive.

Some people might wonder why Clean() can change the meaning of a path. It happens when a path-component that refers to a symlink is followed by a .. component. If you’ve got a POSIXy computer around, here’s an experiment to try:

mkdir -p /tmp/foo/a /tmp/bar
ln -s /tmp/foo/a /tmp/bar/a
echo expected > /tmp/bar/result
echo gotcha > /tmp/foo/result
cat /tmp/bar/a/../result

You might expect that /tmp/bar/a/../result is the same as /tmp/bar/result and hence these commands print “expected”, but that’s not what happens! Because /tmp/bar/a is a symlink to /tmp/foo/a, whose .. directory entry points at /tmp/foo, /tmp/bar/a/.. is the same as /tmp/foo. Clean() pretends that /tmp/bar/a/.. is the same as /tmp/bar, which it clearly is not.

And so I have a fourth algorithm for turning a relative path into an absolute one, that I call the “monotonic” algorithm: when you see a .. component, check the previous component. If it exists and it’s a symlink, resolve it, but if it doesn’t exist or it isn’t a symlink, drop it. Path components not followed by .. never need to be checked. Unlike canonicalize(), this algorithm doesn’t require the path to exist, and for paths without .. it doesn’t touch the disk at all. Unlike Clean(), it always produces a correct path - maybe not the One True Canonical Path, but one that descends more-or-less directly (“monotonically”) toward the target.

It sounds like fs::normalize is proposed to match the behaviour of Clean(); I hope that doesn’t make it into the standard library, since that algorithm prioritizes speed over correctness, at least in a corner-case. I’d much rather my alternative algorithm be implemented, since it’s just as fast in the common case, but prioritizes correctness over speed in that same corner case.

This means that file path canonicalization needs to be done on it’s own, or using some other backend implementation.

Here’s a basic, unoptimized outline:

  • Let res be the current working directory.
  • Call fs::canonicalize() on res. On POSIX this does nothing, but on Windows it puts the path into extended-length path (“verbatim”) syntax, which makes life a whole lot simpler in general.
  • For each component of the path to be canonicalized:
    • if the component is a Prefix component, canonicalize it and push it to res, which effectively replaces the existing content. This canonicalization means you’ll keep the path in extended-length path syntax, and you’ll correctly handle paths like C:foo (relative to the current directory on drive C:)
    • if the component is a RootDir component, do not canonicalize it, and push it to res, which effectively truncates res to its prefix. You might think you need to canonicalize to support paths like \foo (relative to the current drive), but (a) that’s actually solved by initializing res to the current working directory, and (b) Windows fails to canonicalize \ when the current working directory uses extended-length path syntax.
    • If the component is a CurrentDir component, you can ignore it.
    • If the component is a Normal component, push it to res. You don’t have to worry about special Windows names like CON or LPT because res uses extended-length path syntax.
    • If the component is a ParentDir component, you can choose whether you want to just pop the previous component (the Clean() algorithm) or resolve it (the “monotonic” algorithm).
  • res now contains the absolute path you wanted.

That outline mentions a bunch of Windows-specific concerns, but on POSIX platforms the same logic works unmodified.

Sorry if this is a bit long and rambly, but it’s 1AM and apparently I have Strong Opinions about path normalization. :confused:


#4

I’m probably in the minority but I have a couple applications that manipulate paths for what will be written to an empty directory (or even zip file). The paths are usually relative when I’m operating on them and relative to a directory besides CWD. I have my own hacked up “Clean” functions and would be appreciative of a version of “Clean” making it into the standard library.

On a related note, coming from Python’s pathlib, I missed the clarity of what are “pure” functions (doesn’t touch FS) and what are “concrete” functions (might touch FS). Would love for us to explicitly specify these. I think this would help clarify canonicalize from clean.


#5

Just chiming in that I think this proposal looks quite nice, the due diligence done to check the status quo of resolving paths is impressive and the way forward look very promising.

I’m looking forward to adopt many portions of it in a third-party library, which discards the non-salvageable parts of Path/PathBuf (join, mixing absolute and relative paths etc.) and has a stronger focus on cross-platform compatibility.

I think there was some discussion recently on overhauling Rust’s path handling on Windows, does this RFC interact with it?


#6

The paths are usually relative when I’m operating on them and relative to a directory besides CWD. I have my own hacked up “Clean” functions and would be appreciative of a version of “Clean” making it into the standard library.

Hmm… I definitely understand why you’d want a collection of relative paths to describe files in an archive, but I don’t quite see where you’d use Clean(). Do you build up internal paths with a lot of .. components and then run them all through Clean() to decide what paths to write to the resulting archive?

The OP says " a fs::normalize function is added that operates simmilarly to fs::canonicalize", which I guess involves turning relative into absolute paths. If you just want to clean up relative paths without making them absolute, that seems reasonable, but probably should be a separate function. For example, cleaning the relative path ../foo should probably return an error since it escapes from the path it’s relative to, and the same goes for /foo. However, both of those would be perfectly fine inputs to a “relative to absolute” function.


#7

I see people express these kind of thoughts but there are a lot of cases where joins behavior is exactly what people want. I feel I’m not the only one because it seems every major Path OOP library does it this way.


#8

Part of it is cleaning up direct user input and how different pieces of user input interact.

The OP says " a fs::normalize function is added that operates simmilarly to fs::canonicalize ", which I guess involves turning relative into absolute paths. If you just want to clean up relative paths without making them absolute, that seems reasonable, but probably should be a separate function. For example, cleaning the relative path ../foo should probably return an error since it escapes from the path it’s relative to, and the same goes for /foo . However, both of those would be perfectly fine inputs to a “relative to absolute” function.

This is one of the reasons I want signature, I feel it would help clarify intent on these questions. Are errors expected? Is the “CWD” actually a CWD or an input path? etc.


#9

I think the current path handling is a remnant of old times in which languages had poor error handling capabilities/weak typesystems, so these languages just tried to do “something” with the given inputs instead of crashing (like you can still see today in languages like PHP or JS).

Rust being a typed language, I have a hard time to understand why somePath.join(absolutePath) should even type-check. Either the user wants to replace the existing path, or she wants to append to it – having one function with semantics of either the former or the latter, depending on the presence or absence or a single character at a specific place in a string feels ill-guided.


#10

In my mind, there are two reasons

  • Another way to frame somePath.join(absolutePath) is customCwd.join(absolutePath). It works really well for treating any directory as your CWD without changing global process state. This also by extension works well when you want to force a path under a directory (join and then check starts_with)
  • Simplicity. We can express a lot in the type system but we also need to decide what is worthwhile. There are multiple dimensions to work along (absolute / relative / unknown, file / directory / symlink / unknown, exists / doesn’t exist, owned / borrowed). Do we prioritize one dimension or do we make combinations of the dimensions? How do we help the user navigate what the starting point in the API is and how they can move between the different states?

With that said, I don’t think a one-size-fits-all API is best. I appreciate the PathBuf for what it is. I could see you all continuing in the type-heavy `path_abs, and I’ve been considering the idea of a rapid prototyping string/path/buffer API.


#11

I’m really largely unconvinced by those two points.

  • I have read your first point multiple times now, and I still don’t understand what you are trying to accomplish. This makes me even more convinced that it’s not the best idea to have “append one path to another” share the same function with “do something clever for treating any directory as your CWD without changing global process state”. I’m not arguing the cleverness shouldn’t be possible, I just don’t want it piggy-back on some innocent-looking method.

  • People have engineered exceedingly complex abstractions leveraging every bit of power of Rust’s type system provides – both inside and outside the standard library … but at “let’s not pretend it makes sense to append an absolute path to a relative path” we have gone a bridge too far? :wink: Where exactly would you see this huge complexity coming in if you changed e. g. join's return type to return an Option?

I don’t think a one-size-fits-all API is best.

That’s exactly why join was a mistake I want to fix in my library (and probably migrate dirs to it). :slight_smile:


#12

Current semantics of .join is very useful when you accept path from the user, and that path may be even relative to something, or absolute. For example, in Cargo you can specify various paths (dependencies.foo = { path = "..."}) as either absolute or relative (to Cargo.toml parent dir). To handle this situation, you need exactly the semantics that .join has. Here’s the specific code example: https://github.com/rust-lang/cargo/blob/f9926c68f6371c51efeb5b71429a0ec2e5deb54d/src/cargo/util/toml/mod.rs#L1213.


#13

Why not write a function that implements this very specific mixture of both a replace and a join function, and give it some sensible name? It would be way more readable.

I’d wager that at least half of the people using join in their code

  • have no idea how it behaves in error cases such as path.join(absolute path),
  • wouldn’t have chosen this function if they knew about this behavior, and
  • are not prepared to handle the function’s behavior in their code.

#14

I can’t even tell how frustrated was I when I found out that Rust’s path.join(abspath) was completely different than other languages. Took me a few hours trying to understand what was going on.


#15

Agree that the naming could probably better, and that the fact that this differs from other languages is surprising (I’ve been confused over this myself).

That said, at least in my experience, this is exactly the behavior you want most of the time, when the path signifies some file on the local machine (cases like command-line flags, configuration options, etc).

When I’ve understood how join works, I was like “wow, this is API I’ve been looking for the whole of my life, without even knowing about it”.


#16

I’m curious, which languages?

Python and C++ behave like Rust.

Go is different but from the overview of Go that I read, it felt like their path handling was more like string operations / shell operations than an path oriented data structure.


#17

I think people here are not arguing that multiplyAdd (join) shouldn’t exist or that it isn’t useful, but that it would be nice to also have the more fundamental operations add (append) and multiply (replace), so that users don’t have to reverse engineer the behavior they want out of multiplyAdd – or mistakenly use multiplyAdd where they actually wanted to use add.


#18

Thanks for clarifying your intent. Personally, it did come across that way (“non-salvagable”, “join was a mistake”, “hard time to understand why [it] should type checl”). I hope it can be seen why we took it that way.

Also, for completeness, I felt I should remind people that the winner of the 2016 Underhanded Rust Contest leveraged join.


#19

Ruby:

irb(main):002:0> File.join("some","/path","yknow")                                                                                            
=> "some/path/yknow"

NodeJS:

❯ node                                                                                                                                        
> const path = require('path');                                                                                                               
undefined                                                                                                                                     
> path.join("some","/path","yknow");                                                                                                          
'some/path/yknow'

#20

Thanks!

Those look like they might be in the same vein as Go, more path-string manipulation rather than OOP Path like Python’s pathlib and C++'s path API. Not saying this to dismiss it but to check for understanding and putting them in context of each other.