[Pre-RFC] Additional path handling utilities

gbutler · September 16, 2018, 2:11am

So, why is the method called "join" instead of "cd" or "chdir" or something like that? I really don't think a good justification here for why the method is called "join" is that other languages have used that name. It's just a BAD NAME(tm) that doesn't reflect what the operation is doing (IMHO).

uberjay · September 16, 2018, 2:27am

I almost suggested that, but it doesn't really make sense if you're joining with a non-directory path. I couldn't think of a Good Name™ for it. Not a short one, anyway. join_respecting_absolute?

ExpHP · September 16, 2018, 2:27am

Maybe somebody can revive this PR?

github.com/rust-lang/rust

Adding Path.normalize() method

rust-lang:master ← linclelinkpart5:master

opened 08:39PM - 11 Jan 18 UTC

linclelinkpart5

+245 -0

As per the discussion here: https://github.com/rust-lang/rfcs/issues/2208 I f…ound myself with a need to normalize file paths. By "normalize", I refer to the lexical cleanup of a path, following a similar algorithm as Python's [os.path.normpath()](https://docs.python.org/3/library/os.path.html#os.path.normpath) and Go's [filepath.Clean()](https://golang.org/pkg/path/filepath/#Clean). These in turn follow the methodology outlined in Rob Pike's paper: [Lexical File Names in Plan 9](https://9p.io/sys/doc/lexnames.html). I was able to write a method that works for my use case, and I felt it could be a helpful addition to stdlib! However, I am very new to both Rust and contributing to OSS projects. Please do let me know if there are any changes that could/should be made in those regards.

Bah humbug. The semantics of UNIX symlinks are horrible from a user perspective. When I cd into some symlink into my home directory and then cd .., I intend to go home.

Some programs ~~(and I mean the important ones! Like cp!)~~ do things like check to see if a (shell-defined) PWD environment variable exists, join paths from command line arguments to it, and then deliberately normalize them without following symlinks. Why? Because that's what feels right from the user's perspective!

(example redacted after further testing (see edit))

~ $ ls -l rsp2
lrwxrwxrwx 19 lampam 16 Nov  2017 rsp2 -> cpp/other/rust/rsp2

~ $ cd rsp2

~/rsp2 $ cp log.lammps ..

~/rsp2 $ cd ..

~ $ ls -l log.lammps
.rw-r--r-- 186 lampam  5 Sep 10:06 log.lammps

Edit: Well, this is embarassing

uberjay · September 16, 2018, 2:36am

Huh, no way! I had no idea. It doesn't seem to behave that way for me... I wonder what system you're on?

ExpHP · September 16, 2018, 2:37am

Right, so of course, this varies by implementation.

$ cp --version
cp (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Torbjorn Granlund, David MacKenzie, and Jim Meyering.

$ uname -a
Linux arch-t430s 4.18.6-arch1-1-ARCH #1 SMP PREEMPT Wed Sep 5 11:54:09 UTC 2018 x86_64 GNU/Linux

uberjay · September 16, 2018, 2:50am

Interesting… I have the same version of cp, and it doesn’t behave that way for me. That does seem convenient, and I’m pretty curious what’s causing the difference in behavior. I am in macOS, although the same thing is happening on a linux system (with GNU coreutils 8.22). (and I realize this is drifting further off topic, sorry!)

❯ mkdir -p a/b/c; ln -s a/b/c wat; touch a/b/c/yowzers

~
❯ cd wat

~/wat
❯ cp yowzers ..

~/wat
❯ cd ..

~
❯ file yowzers
yowzers: cannot open `yowzers' (No such file or directory)

~
❯ file a/b/yowzers
a/b/yowzers: empty

ExpHP · September 16, 2018, 3:05am

Ah, yes! Do please say things like this, it helps me make my point!

So, I checked the configure script for coreutils, but it didn't look like there were any options for enabling/disabling this. Did you check that your shell defines the environment var PWD? (I'm not sure that they all do?)

Edit: Looking at the source of coreutils now, I don't see any evidence that it actually does what I say. I wonder if Arch Linux has some patch to give it QoL improvements like this.

Screwtapello · September 16, 2018, 3:19am

Computing the relative path from one filesystem location to another fundamentally requires two absolute paths - if you're not talking about two specific locations, then you can't possibly navigate reliably from one to the other. If you're using a library like @anon2808951 talks about, where there's a data-type with the "is an absolute path" invariant, this is easy—you just put the method on that data type. However, if you've only got one data-type for both absolute and relative paths (as is the case with PathBuf), one or both of the inputs might be relative, and you just have to deal with it, so what do you do?

One solution is to return an error if one or both paths are relative, which is simple, predictable behaviour but a bit annoying.

Another solution is to say "relative paths are relative to the current directory", which has some precedent, and seems to be what's being proposed for Path::relative_to().

I'm not familiar with Ruby's relative_path_from(), but the docs mention that either both inputs must be absolute, or both must be relative, which I suspect means it gives the same result as concatenating paths with the current working directory (or any fixed path, like /).

ExpHP · September 16, 2018, 3:22am

Update on this: Sorry for getting your hopes up. I was wrong. Turns out that file was already there in my home directory.

I really coulda sworn, man…

Screwtapello · September 16, 2018, 3:36am

The semantics of symlinks may be subtle and quick to anger, but they are what they are. I don't mind tools designed for human interaction like shells and GUI file-managers doing a bit of magic behind the scenes (bash does this for the cd command by default, you can turn it off with set -P), but I'm uneasy about that kind of "do what I mean" functionality lower in the stack.

Basically, my nightmare scenario is being woken up at 3AM because of a production outage, and the system logs say it read a bogus value from a file at a particular path, and that path doesn't actually exist on disk because it's the "cleaned" version of the path that was really used, and there's no good way to find out what actually happened.

When I'm trying to understand and diagnose a failure, I don't mind "confusing" or "unfriendly" or "legacy" behaviour because I can trace through that if I have to, as long as I have all the details available. I dislike approximations and heuristics, even ones that attempt to be helpful and guiding, because there's a chance they'll throw away the clue I needed to understand what was going on.

ExpHP · September 16, 2018, 3:53am

There’s a time and place for everything. I would consider that for these specific situations:

Paths coming from CLI arguments
When the program itself creates a tree of directories and symlinks in the filesystem, and constructs paths into that tree

cleaning (i.e. logical path manipulation) is the thing that will give you the information you care about. I mean, what if the issue was that the program created a symlink pointing to the wrong directory? It’d take you forever to find that out if it only told you canonicalized paths.

Meanwhile, for these specific situations:

a path contained somewhere in some file (a config file, a source file, whatever)
…actually, yeah, I guess that’s it, but it’s a pretty big one.

canonicalized paths clearly win out.

My point is: A tree can fall in more than one direction.

…

(hmm… bad analogy, sometimes trees are supposed to fall. Maybe “a bridge can collapse from either side”?)

ExpHP · September 16, 2018, 4:03am

Not to mention: The standard library’s fascination with canonical paths (which at some point plagued the path_abs crate to an even greater extent) is terrifying if you ever want to delete something. (which could be a link)

kornel · September 16, 2018, 12:13pm

append/push/join/concat creates lots of names for almost the same thing. That’s kinda confusing.

Also does any one of them provide safety from path traversal vulnerabilities? I need root_dir.safely_join(subpath) to never be able exit root_dir (joining ../../etc should be an error or join ./etc instead).

uberjay · September 16, 2018, 5:46pm

Taking a quick survey, I like the names used elsewhere a bit better -- join has the problem that it feels very... uh... mechanical? Maybe because of the conflict with slice join? I have a suspicion the intuitive understanding of path join is that it's like slice join, except the separator is chosen for you. Apparently C# calls their method Combine. (and it has the same behavior as PathBuf.join). But, a function named Combine hints that there may be more going on than just mechanical string concatenation.

I actually like the names of the javascript path maniuplators:

join - does not interpret input paths as possibly absolute.

resolve - does allow for input absolute path.

Again, "resolve" tells me there's more going on with the operation. It's even described in a totally different way -- it's not that an input path can "reset" or "destroy" the existing path:

The given sequence of paths is processed from right to left, with each subsequent path prepended until an absolute path is constructed. For instance, given the sequence of path segments: /foo , /bar , baz , calling path.resolve('/foo', '/bar', 'baz') would return /bar/baz .

uberjay · September 16, 2018, 6:13pm

d'oh!!! That is a bummer! It sounded plausible, but also, perhaps a bit too magical for my tastes.

scottmcm · September 16, 2018, 10:37pm

I'm a big fan of relative_to; I remember discussing it for Boost in C++ (wow, 10 years ago already?).

But it needs more than just the CWD, because it needs to be symlink-aware to work in general.

This seems eminently reasonable to me. It's a place where it's easy to look-before-you-leap if you need to, and you can always explicitly use a cwd-aware function to make a path absolute before hand if you need to.

I wonder if there's a place for some sort of domain separation parameter in paths, or association with a VFS, or something? Because in a sense the difference here is between a LocalPath and an AbstractPath, and maybe there's space for things like an InMyZipFilePath and such. Also, perhaps an AbstractPath isn't based on OsStr, which would mean it can't be a newtype around or method on Path...

Is there something commonly used for the Path component of URLs in rust?

Screwtapello · September 17, 2018, 2:56am

For a CLI program that uses its arguments directly, like grep or find, I wouldn't bother with any cleaning or canonicalisation at all. If the user requests an operation on "foo", it's easiest for them to understand results and responses if they're phrased in identical terms.

As for the directory-tree-as-data-structure case, I came up with my "monotonic" algorithm while I was working on a crate for local http caching, so it's definitely the way I want such a program to work.

That example cuts both ways, though: imagine a symlink pointing to the wrong place, and you get an error message saying "could not read foo/bar/baz: not found" when foo/bar/baz definitely exists and its last-modified time is much older than the error message.

Yeah, a fully-canonicalized path is useful when you care about file contents but less so when you care about the directory structure itself. That said, the "monotonic" algorithm never tries to resolve a symlink at the end of a path (because the end of a path can never be followed by .. or it wouldn't be the end of the path) so I'm not worried about that.

"unsafe" usually means exclusively memory-unsafety, so maybe not those particular names. This is more the "validated" versus "unvalidated" kind of safety, like SQL injection and cross-site-scripting, and (like those problems) the real solution is separate data-types that can't be easily mixed. I think this thread is about extending the (single) PathBuf type rather than designing a new, safer path manipulation API, so I don't think "safe by default" is a practical goal here.

On the other hand, providing the tools to build a safe, higher-level API seems reasonable. How would you feel about a is_relative_descendant(&self) -> bool method that returns true for a path that does not start with a prefix or a root component, and does not escape its prefix with .. components? It would fit nicely with is_absolute() and is_relative(), since on Windows, paths like \foo and C:foo are relative, but not relative descendants.

You might look at Python's pathlib API, which has PurePosixPath and PureWindowsPath types, which do not touch the filesystem and can therefore be used on any platform, and PosixPath and WindowsPath types which do touch the filesystem and therefore can only be used on their respective platforms.

Soni · September 17, 2018, 3:57am

I don’t like how Rust paths are just type-punned strings.

most of the overhead is in the filesystem. I feel like you made a mistake by allowing strings to be type-punned into paths.

also, not every system uses string paths. some use string arrays and place no restrictions on filenames (except the NUL byte). it’s impossible to port Rust to those systems, as far as I can tell.

vitiral · September 17, 2018, 5:36am

I think this should mostly be fixed now, take a look at the new version and @Screwtapello is working on improving the story even more. We now do not require paths to exist and should be able to delete links

Screwtapello · September 17, 2018, 6:34am

If it’s not too late, I’d like to propose another utility method: head()

let mut p = Path::new(r"C:\Windows\System32").head();
assert_eq!(p, Path::new(r"C:\");

let mut p = Path::new(r"\Users").head();
assert_eq!(p, Path::new(r"\");

let mut p = Path::new(r"C:some\other\path").head();
assert_eq!(p, Path::new(r"C:");

let mut p = Path::new(r"pure\relative\path").head();
assert_eq!(p, Path::new(r"");

This method returns the “head” of a path; that is, the Prefix and RootDir components (if any). This is useful for implementing “adjoinment” logic (if you want to implement your own alternative to canonicalize()) and could also be useful for Windows programs want to know whether two paths are on the same drive.

For an absolute path, this is easily accomplished with p.push('/'), but that doesn’t work for a relative path if you want to keep it relative. You can repeatedly call .parent() until it returns None, but that repeats the hard work of segmenting the path into components for each call. You can walk over the components() iterator, stopping when you reach the first non-Prefix, non-RootDir component and collecting into a new PathBuf, but that requires a new allocation.

The .head() method suggests a companion .tail() method that skips Prefix and RootDir components and returns the part of the path composed of Normal, CurrentDir and ParentDir components, but I haven’t thought of a practical use-case for that yet.

Topic		Replies	Views
Proposal: std::path::resolve libs	8	985	March 1, 2022
`into_join` for `PathBuf` language design	10	381	November 10, 2024
Path trailing separator inconsistency libs	9	1679	December 11, 2021
Pre-Pre-RFC: / for Paths libs	39	6383	March 25, 2019
Pre-RFC: Rename Path and PathBuf	31	3853	March 25, 2019

[Pre-RFC] Additional path handling utilities

Related topics