[Pre-RFC] Additional path handling utilities


#41

It’s interesting that in Java, despite having an explicit LinkOption enum, Files.delete(Path) doesn’t accept such an argument, and always opts for deleting the symbolic link.

I think Rust’s documentation on remove_file is insufficient regarding its semantics: https://doc.rust-lang.org/std/fs/fn.remove_file.html

Maybe its behavior could be derived by thinking about the semantics of unlink, but I think the behavior on links should be documented.


#42

append/push/join/concat creates lots of names for almost the same thing. That’s kinda confusing.

Also does any one of them provide safety from path traversal vulnerabilities? I need root_dir.safely_join(subpath) to never be able exit root_dir (joining ../../etc should be an error or join ./etc instead).


#43

Also does any one of them provide safety from path traversal vulnerabilities?

That’s a very good point! The path API felt like it had a very scripting-oriented design until now, were people are intended to do stuff, and then hopefully check later on whether the stuff they did returned the intended result.

I think it would be a useful addition – though I share the concern that good names are hard to find.

Maybe it makes sense to have a safe-by-default approach (ok, that’s a hard sell in the presence of join and push :-D) and prohibit path traversal in append and concat, and have unsafe_append and unsafe_concat in addition?


#44

Taking a quick survey, I like the names used elsewhere a bit better – join has the problem that it feels very… uh… mechanical? Maybe because of the conflict with slice join? I have a suspicion the intuitive understanding of path join is that it’s like slice join, except the separator is chosen for you. Apparently C# calls their method Combine. (and it has the same behavior as PathBuf.join). But, a function named Combine hints that there may be more going on than just mechanical string concatenation.

I actually like the names of the javascript path maniuplators:

join - does not interpret input paths as possibly absolute.

resolve - does allow for input absolute path.

Again, “resolve” tells me there’s more going on with the operation. It’s even described in a totally different way – it’s not that an input path can “reset” or “destroy” the existing path:

The given sequence of paths is processed from right to left, with each subsequent path prepended until an absolute path is constructed. For instance, given the sequence of path segments: /foo , /bar , baz , calling path.resolve('/foo', '/bar', 'baz') would return /bar/baz .


#45

d’oh!!! That is a bummer! It sounded plausible, but also, perhaps a bit too magical for my tastes.


#46

I’m a big fan of relative_to; I remember discussing it for Boost in C++ (wow, 10 years ago already?).

But it needs more than just the CWD, because it needs to be symlink-aware to work in general.

:+1: This seems eminently reasonable to me. It’s a place where it’s easy to look-before-you-leap if you need to, and you can always explicitly use a cwd-aware function to make a path absolute before hand if you need to.

I wonder if there’s a place for some sort of domain separation parameter in paths, or association with a VFS, or something? Because in a sense the difference here is between a LocalPath and an AbstractPath, and maybe there’s space for things like an InMyZipFilePath and such. Also, perhaps an AbstractPath isn’t based on OsStr, which would mean it can’t be a newtype around or method on Path…

Is there something commonly used for the Path component of URLs in rust?


#47

For a CLI program that uses its arguments directly, like grep or find, I wouldn’t bother with any cleaning or canonicalisation at all. If the user requests an operation on “foo”, it’s easiest for them to understand results and responses if they’re phrased in identical terms.

As for the directory-tree-as-data-structure case, I came up with my “monotonic” algorithm while I was working on a crate for local http caching, so it’s definitely the way I want such a program to work.

That example cuts both ways, though: imagine a symlink pointing to the wrong place, and you get an error message saying “could not read foo/bar/baz: not found” when foo/bar/baz definitely exists and its last-modified time is much older than the error message.

Yeah, a fully-canonicalized path is useful when you care about file contents but less so when you care about the directory structure itself. That said, the “monotonic” algorithm never tries to resolve a symlink at the end of a path (because the end of a path can never be followed by .. or it wouldn’t be the end of the path) so I’m not worried about that.

“unsafe” usually means exclusively memory-unsafety, so maybe not those particular names. This is more the “validated” versus “unvalidated” kind of safety, like SQL injection and cross-site-scripting, and (like those problems) the real solution is separate data-types that can’t be easily mixed. I think this thread is about extending the (single) PathBuf type rather than designing a new, safer path manipulation API, so I don’t think “safe by default” is a practical goal here.

On the other hand, providing the tools to build a safe, higher-level API seems reasonable. How would you feel about a is_relative_descendant(&self) -> bool method that returns true for a path that does not start with a prefix or a root component, and does not escape its prefix with .. components? It would fit nicely with is_absolute() and is_relative(), since on Windows, paths like \foo and C:foo are relative, but not relative descendants.

You might look at Python’s pathlib API, which has PurePosixPath and PureWindowsPath types, which do not touch the filesystem and can therefore be used on any platform, and PosixPath and WindowsPath types which do touch the filesystem and therefore can only be used on their respective platforms.


#48

I don’t like how Rust paths are just type-punned strings.

most of the overhead is in the filesystem. I feel like you made a mistake by allowing strings to be type-punned into paths.

also, not every system uses string paths. some use string arrays and place no restrictions on filenames (except the NUL byte). it’s impossible to port Rust to those systems, as far as I can tell.


#49

I think this should mostly be fixed now, take a look at the new version and @Screwtapello is working on improving the story even more. We now do not require paths to exist and should be able to delete links :slight_smile:


#50

If it’s not too late, I’d like to propose another utility method: head()

let mut p = Path::new(r"C:\Windows\System32").head();
assert_eq!(p, Path::new(r"C:\");

let mut p = Path::new(r"\Users").head();
assert_eq!(p, Path::new(r"\");

let mut p = Path::new(r"C:some\other\path").head();
assert_eq!(p, Path::new(r"C:");

let mut p = Path::new(r"pure\relative\path").head();
assert_eq!(p, Path::new(r"");

This method returns the “head” of a path; that is, the Prefix and RootDir components (if any). This is useful for implementing “adjoinment” logic (if you want to implement your own alternative to canonicalize()) and could also be useful for Windows programs want to know whether two paths are on the same drive.

For an absolute path, this is easily accomplished with p.push('/'), but that doesn’t work for a relative path if you want to keep it relative. You can repeatedly call .parent() until it returns None, but that repeats the hard work of segmenting the path into components for each call. You can walk over the components() iterator, stopping when you reach the first non-Prefix, non-RootDir component and collecting into a new PathBuf, but that requires a new allocation.

The .head() method suggests a companion .tail() method that skips Prefix and RootDir components and returns the part of the path composed of Normal, CurrentDir and ParentDir components, but I haven’t thought of a practical use-case for that yet.


#51

It might be worth taking a look at what Boost.Filesystem does here:

https://www.boost.org/doc/libs/1_68_0/libs/filesystem/doc/reference.html#path-decomposition-table


#52

This method returns the “head” of a path; that is, the Prefix and RootDir components (if any). This is useful for implementing “adjoinment” logic (if you want to implement your own alternative to canonicalize() ) and could also be useful for Windows programs want to know whether two paths are on the same drive.

In Python’s pathlib, this is called .anchor.


#53

There is a suggestion in the API Guidelines to offer unchecked versions of functions that are unsafe. There is some question about invariant-safety vs memory-safety so I’ve opened rust-lang-nursery/api-guidelines#179


#54

That’s not the case: PathBuf is a newtype around OsString, which has different guarantees than String.


#55

Due to the &str (or &OsStr at least) to Path and vice-versa AsRef conversions, it’s impossible to have Path be a NULL-terminated array of OsStr in platforms that use such representation.

This also applies to PathBuf.

AsRef = type-punned.


#56

fs::normalize on Windows needs to be implemented using GetFullPathNameW. Doing anything else is liable to get things wrong in corner cases.

Also just a reminder that people should read https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html