Out-of-band crate evaluation for 2017-07-18: same-file

Out-of-band crate evaluation for 2017-07-11: same-file

For additional contribution opportunities, see the main libz blitz thread.

This post is a wiki. Feel free to edit it.

Links

Needs your help!

Anything that is not checked off still needs your help! There is no need to sign up or ask for permission - follow any of these links and leave your thoughts:

Guidelines checklist

Legend

  • [y] = guideline is adhered to, no work needed.
  • [n] = guideline may need work, see comments nearby
  • [/] = guideline not applicable to this crate

Checklist

Guidelines checklist

This document is collaboratively editable. Pick a few of the guidelines, compare the same-file crate against them, and fill in the checklist with [y] if the crate conforms to the guideline, [n] if the crate does not conform, and [/] if the guideline does not apply to this crate.

  - [n] Crate name is a palindrome (C-PALINDROME)
   - same-file backwards is elif-emas which is not the same as same-file

Cookbook recipes

Cookbook example ideas

Come up with ideas for nice introductory examples of using $CRATE, possibly in combination with other crates, that would be good to show in the Rust Cookbook. Please leave a comment in that issue with your ideas! You don't necessarily have to write the example code yourself but PRs are always welcome.

API guideline updates

What lessons can we learn from same-file that will be broadly applicable to other crates? Please leave a comment in that issue with your ideas!

Discussion topics

Anything that's not a concrete crate issue yet. We want to eventually promote any topics here into actionable crate issues below.

What does it mean for two files to be the same?

What does "files are the same" actually mean?

Could the docs clarify this more, without attempting to offer guarantees it can't?

Can the current API be tweaked without breaking?

There have been some potential improvements thrown around for is_same_file. If the public API for same-file is flexible to change without breaking then we don't need to answer all these questions now.

Well-behaved filesystems

Are there concrete cases where the approach used in same_file will produce false positives/negatives? Are these cases common among similar libraries in other ecosystems? Are there alternative approaches that avoid them?

Could is_same_file return something like an Option<bool> that captures the confidence that the files are the same?

We don't want to be basing behaviour on individual file systems to keep the implementation simple and easy to reason about.

If there are concrete edge cases could the docs clarify them and point callers in a different direction? Maybe same_file isn't the right crate for their needs.

Optimising is_same_file

Could the is_same_file function be made more efficient by eliminating some system calls?

Crate issues

Issues to file against the same-file crate.

  • docs: Clarify the statement about false positives a little more. It doesn't need too much though
  • docs: Add examples to public methods on Handle
  • docs: Replace try! with ? in examples
  • docs: Add Panic section to methods in unix that call unwrap, or document why the unwrap is infallible
  • docs: Add Errors section to methods that return Results
  • docs: Add hyperlink to source example mentioned in the crate root docs
  • docs: Add hyperlinks to types in docs prose
  • Implement Ord and Hash for Handle
  • Add assert_send / assert_sync methods and test Handle
  • Add html_root attribute to "https://docs.rs/same-file/1.0.0"
  • Start tracking release notes for version bumps
  • Move dev and ino methods to a platform-specific extension trait

How are we tracking?

Pre-review checklist

  • Create evaluation thread based on this template
  • Work with author and issue tracker to identify known blockers
  • Compare crate to guidelines by filling in checklist
  • Record other questions and notes about crate
  • Draft several use case statements to serve as cookbook examples
  • Record recommendations for updated guidelines

Post-review checklist

  • Create new issues and tracking issue on crate's issue tracker
  • Solicit use cases for cookbook examples related to the crate
  • File issues to implement cookbook examples
  • File issues to update guidelines
  • Post all approachable issues to TWiR call for participation thread
  • Update links in coordination thread
2 Likes

A few critical issues:

  1. The crate seems to fail to explain what ā€œbeing the same fileā€ means. I suppose it could be defined as ā€œif it returns true, modifying one file/directory causes the modification to be visible using the other file/directory as if they were performed on the other one; if it returns false, that may or may not be the case (and usually wonā€™t be)ā€. Alternatively the definition could be ā€œhas the same inodeā€, although such a concrete definition might be problematic for portability.

  2. This crate is broken on Windows due to missing 128-bit inode support as described in the source code. The comment claims that ā€œI donā€™t have access to such systemsā€, but trial versions of Windows Server 2012 (and 2016) are available and can be run in a virtual machine to perform testing, if necessary (and also available on Azure)

  3. Handle must implement Ord and Hash, since that allows to find duplicates in n paths in O(n log n) by sorting or O(n) by hash table instead of the O(n^2) time required with the current version of the crate, which is unacceptable

  4. The crate is currently useless according to its specification that states that ā€œNote that itā€™s possible for this to produce a false positive on some platforms. Namely, this can return true even if the two file paths donā€™t resolve to the same file.ā€. Since false negatives are unavoidable (imagine a FUSE filesystem that aliases files), this means that is_same_file provides no information whatsoever as specified. Obviously false positives must be made impossible. Itā€™s not clear if that refers to the Windows 128-bit inodes issue or if there is something else.

  5. The current implementation actually opens the paths provided. This is pretty bad, since opening files can have arbitrary side effects (most frequently, hanging the program as the open waits on some sort of event before completing). Instead, it should use stat directly on the path on Unix. On Windows, opening the file might be unavoidable, which means that it needs to be documented that the API will open the files and that this might hang the program or cause arbitrary behavior if used on special filesystems or device paths.

  6. The current Unix implementation returns true if two files have both inode 0. However, inode 0 may perhaps mean ā€œno inode number availableā€ at least in some implementations, so it should perhaps return false instead in such cases. On the other hand, the POSIX standard doesnā€™t seem to allow this, and I havenā€™t been able to find an example of such behavior (but have done very little research). Some investigation would be required. Same thing for Windows.

  7. Same thing for device number 0

Also, while the current API and implementation works for files that exist, it doesnā€™t allow to figure out for instance whether two paths are symbolic links that resolve to the same non-existent file. Iā€™m not sure how to best handle this: one option would be to add a ā€œis_same_file_or_would_create_same_fileā€ API, or to change the definition of ā€œis_same_fileā€, and then add code that uses path canonicalization.

4 Likes

@jmst Thanks for your feedback. It's clear that the documentation could probably use a bit more clarification. Although, I find some of your framing to be a bit exaggerated. Saying the crate is "broken" on Windows for example seems a bit off the mark.

Note that the design of this crate is based on how other popular software does this type of detection. It is at the heart of Java's recursive directory iterator, for example, when checking for file system loops (last time I looked).

Your first proposal seems reasonable to me, but I'd be a little weary of saying much more than it does already. Nebulousness is a feature here since I found it very hard to get specific guarantees on this matter. Therefore, it doesn't seem right for this crate to make specific guarantees.

This is pretty far from a critical issue given the probability of one actually observing a bug because of it. Moreover, the existence of trial versions does not imply that everyone is aware of said systems.

Great idea!

It's not useless at all. It's used to detect loops in the file system created by symbolic links. It works. Therefore it's not useless.

That's false. The information provided is that, with some high confidence, if is_same_file returns true, then the two given paths refer to the same file. If you have zero tolerance for false positives, then you have enough information to know that you can't use is-same-file. If you can abide some false positives in rare circumstances, then you have enough information to use it. For example, if a false positive occurs during file system loop detecting during recursive directory traversal, then the results of the traversal will miss some file paths that it should have taken. This is unfortunate.

I believe it is the Windows 128-bit issue that caused me to write that statement. It is very low priority to fix because it isn't a problem in practice. (Or more precisely, I'm not aware of it being a problem in practice. If someone was actually experiencing it, then it would certainly raise the priority level!)

No, it's not bad. It's necessary. Otherwise the underlying file system may reuse whatever identifier one is using. Keeping the handle open prevents that from happening.

With that said, the is_same_file function could probably be implemented more efficiently by using stat calls as you say!

Where is this documented?

The original motivation for is-same-file was as a means for detecting file system loops. Such detection doesn't require handling unresolvable symbolic links. Additional APIs seem OK; but I'd like to hear use cases first.

1 Like

If ReFS is using 128-bit ids and doesn't guarantee that the lower 64 bits are unique (which is what seems to be the case), that's a serious issue if the API starts guaranteeing no false positives, since it means that the API guarantees are sometimes broken on Windows.

Granted, it's low probability, but it's still not as low as the collision probability on even the shortest cryptographic hashes and thus not low enough to be discounted (MD5 has 128 bits as opposed to the 64 bits here).

Also, if ReFS 128-bit ids are not cryptographically secure random an attacker might be able to generate a collision intentionally even easier than with the 2^32 files birthday attack.

Well, this would indeed be an issue if a file system exists that generates a new inode id every time open or stat is called, but only if the file is not already open.

Is that a concern on Unix? If so, then there is a tradeoff between using open+fstat vs stat, and I'm not sure how to best proceed (perhaps determine the filesystem type and use stat only on known well-behaved filesystems?). Otherwise, stat seems better since it avoids possible side effects of opening files.

It's an hypothesis that since generating inode numbers might be hard for some virtual filesystems, and just returning a 0 inode number seems a very easy out, that someone might be doing that. But the POSIX standard doesn't seem to allow it, and indeed I think a normal filesystem could have 0-valued inodes. So not sure if this is a concern or not, would need more knowledge or research of the problem area.

Well, this is a bit of a problem because "detecting file system loops" (e.g. for find or grep -r clone) is an application that does not tolerate neither false positives nor false negatives, since one causes a wrong result, and the other causes an infinite loop.

Thus (assuming that the API gets specified based on the "modifications" working and thus with false negatives but not positives) I think the API needs to provide a further guarantee that when both paths point to "well-behaved filesystems" (which should include at least all disk-based ones shipped with the OS), then the result is exact.

Also perhaps is_same_file could be changed to return an Option where Some(true) means that files are the same (same inode), Some(false) means they are definitely not the same (well-behaved filesystem, different inode), and None that they may or may not be the same (not a known well-behaved filesystem). In the Handle API this would be reflected by not implementing Eq/Ord and just having PartialEq/PartialOrd with a == b and a != b be both false in the is_same_file None case.

Or perhaps this could just be an API that returns whether a path is in a well-behaved filesystem (one where paths are duplicates if and only if they have matching inodes, and which as a corollary also has a finite number of inodes).

1 Like

AFAIK, this is true for every symlink loop detector I've ever seen. So if your criticism is, "same-file suffers from the same problems that every other implementation suffers from." Then OK, I acknowledge that. I think we should probably move on now. And if it's not true, then we should learn from other implementations and adopt their strategies.

I would be happy to be educated on how inodes are assigned in Unix. On Windows at least, the only way a comparison between inode numbers can possibly be valid is if there are open handles to said files at the point of comparison. I don't know for sure whether the same thing is required on Unix, but that's certainly my hypothesis.

I think it is out of scope to base behavior on individual file systems. At least, that is not something I'm interested in maintaining. It sounds like a significant task to me.

Thanks for kicking off some great discussion @jmst and @burntsushi!

Iā€™ve captured some of the discussion so far in the OP, but this is a wiki, so please feel free to go and update it. Iā€™d like to keep the discussion points in there easy to reduce into issues to file, maybe with a few alternatives to discuss.

Thereā€™s clearly complexity in trying to offer strong guarantees across platforms. Right now the docs have a note on is_same_file warning that you might get false positives. I think a short discussion either in the crate root docs or in is_same_file would improve that situation enough for me. Even if it just says: read the code for yourself if you care about this.

It seems like youā€™ve got some reservations about the implementation @jmst. Thatā€™s not a bad thing. If you were evaluating the code for use in your own application (which is a good thing to do), then you might conclude that itā€™s not right for you. Thatā€™s also not a bad thing. What is a bad thing is promising more than we can deliver, and my concern (without being an expert on filesystems) is that adding a lot of additional complexity might cause us to raise those guarantees beyond what weā€™ve actually delivered.

So my conclusion is that a bit of extra documentation to suggest reviewing the implementation would be sufficient. @burntsushi has already peppered the implementation with useful nuggets of commentary.

2 Likes

Iā€™ve updated the guidelines checklist. Itā€™s ~3/4 done.

1 Like

Awesome, thanks @budziq!

hmm it looks like I finished the list. Sorry for spamming.

1 Like

This Stackoverflow question has some details about special inode numbers on linux: https://stackoverflow.com/questions/2099121/why-do-inode-numbers-start-from-1-and-not-0

Iā€™ve added some crate issues to the OP based on the issues @budziq raised in the evaluation. Thanks again for jumping on that!

@burntsushi Iā€™ve also pushed the arbitrary review date back a week (that was when I was planning to make the tracking issue) since I just noticed itā€™s the same day as the libs review of gcc. I thought you might appreciate not having to think about same-file at the same time as that.

1 Like

Here are some notes on inode numbers and checking whether two files are ā€œthe sameā€ on Unix. I know rather too much about this thanks to my time working on GCCā€™s preprocessor: the redundant-include detector manages to hit all of the corner cases simultaneously, and when #pragma once is in use, both false positives and false negatives are catastrophic.

The executive summary is that the check currently done by same-file will experience both false positives and false negatives when working with file systems where the ā€œinode numbersā€ reported by the kernel do not correspond to an actual data structure on disk (notably FAT and HFS+), and when working with network file servers. Errors are, paradoxically, more likely to happen with the Handle API than with the basic is_same_file, because Handle comparisons are more likely to involve comparing inode numbers observed at widely separated times. Holding files open does not help.


Inode numbers were originally an exposed implementation detail of the first several generations of UNIX file systems. (If you havenā€™t already read McKusickā€™s paper A Fast File System for UNIX, you should probably stop reading these notes now, go read that, and come back when youā€™re done.) A directory entry in these file systems was a 2-tuple (name, inode number), and the inode number was just the index of an actual on-disk object, the inode, in an on-disk array. An inode (i is either for ā€œinternalā€ or ā€œindexā€, depending who you ask) contains all of the meta-information about each file apart from its name(s), and points to the disk blocks containing the fileā€™s data. (Contrast FAT, where all of the meta-information about a file is embedded in its directory entry. In all generations of FAT, there is no on-disk datum that corresponds to an ā€œinode number.ā€)

Thus, in principle, if two directory entries refer to the same inode number, they must refer to the same file. But even in the days of 4.2BSD, there were two important exceptions:

  • Inode numbers only uniquely identify a file within a file system. You must also compare the device numbers to be sure that two files are the same. (I checked, and same-file does do this, but I mention it anyway, because itā€™s a common mistake.)

  • Inode numbers are reused when files are deleted. If you ā€˜statā€™ a pathname at two different times and observe the same inode number, it does not mean that the file is the same, because it could have been deleted and replaced, and the old inode number reused. (This is guaranteed not to happen if you hold the file open over the interval, because the ā€œopen file descriptionā€ holds a reference to the inode and prevents it from being reused ā€” but see below. Some newer file systems expose ā€œgeneration numbersā€ that can detect this situation, but not all.)

Both of these are sources of false positives: two files are identified as the same, when they arenā€™t.

Now, on a more modern system, you have several additional things to worry about. Most obviously, modern systems will support FAT, HFS+, and other filesystems that donā€™t have inode numbers, even if they are some variation of Unix. When the file youā€™re looking at is on a one of those filesystems, the inode number you observe was made up somehow by the kernel, and may be stable only for as long as the filesystem remains mounted, or only as long as the file is open. Thatā€™s actually more troublesome than inode numbers getting reused when files are deleted and replaced, because it can cause both false positives and false negatives, and because holding the file open will not protect you (the operating system canā€™t stop the user from yanking a USB stick with the filesystem still mounted and then putting a different one into the same slot).

Similar problems come up when network file systems are in use. Depending on a lot of fiddly details, NFS and SMB may or may not give you inode numbers that are stable over a mount-unmount cycle; more importantly, the numbers may or may not be stable if the server crashes and is rebooted. If you hold a file open across a server reboot, youā€™ll get a ā€œstale file handleā€ error the next time you try to use the file handle, but if youā€™re just doing stat on the pathname and hoping that having the file open will protect you, youā€™re SOL. Again, this can cause both false positives and false negatives.


I am not aware of alternative approaches that will work in general. The current redundant-include detector in GCC actually doesnā€™t use inode numbers, because they were found to be unreliable (particularly on HFS+, if memory serves); it instead checks last modification time, file size, and contents ā€” and it still isnā€™t bulletproof; it can get false negatives (files are thought to be different, when theyā€™re the same) in the presence of clock skew between network file server and client.

Specific file systems may provide extended meta-information that includes a long-term stable, unique label for each file, perhaps even one that is at least ā€œalmost surelyā€ unique across multiple file systems (e.g. a GUID). I donā€™t actually know of a concrete example but if you want to find one I would suggest you look at ZFS, BtrFS, and Microsoft and Appleā€™s next-generation filesystems, whose names I do not remember right now.

For the specific case of loop detection when doing a directory tree walk, the Right Thing is to canonicalize and compare pathnames. This can be made 100% reliable, because a directory (unlike a file) can have only one canonical pathname.

13 Likes

Thanks for the deep insights @zackw! This is really interesting stuff.

Iā€™d like to frame this discussion a bit: our goal is to release same-file 1.0 with a stable API (and plenty of docs).

I think the big question is: can the current public API support changes to the implementation without breaking? I feel like it could. If the only observable effects of a change are fewer false positives/negatives then that sounds like something we can address post 1.0 as a minor bump.

With that in mind, perhaps we need to do something with the ino method on Handle.

Having said all that I definitely donā€™t want to get in the way of this valuable discussion! For visibilities sake post libz blitz maybe it belongs as an issue on same-file? Itā€™s an important topic and I donā€™t think itā€™ll be easy to find again in a few months.

Does anyone have any other thoughts?

Hi everyone!

Iā€™m looking at opening a tracking issue for same-file this week with the feedback raised here. Is there anything else anyone wants to add? Are you all happy enough with the idea of same-file proceeding to 1.0 with its current implementation?

Just one thing:

I would encourage you to remove the ino and dev methods entirely, or at least move them to an unstable trait.

I meant to say this before, but I wanted to check what the docs said first, and docs.rs wasn't working for me at the time, and then I forgot about it. Sorry.

ripgrep uses the ino method as a quick way to determine whether two files will possibly be the same before doing the more expensive check. This is part of a feature that avoids searching a file that ripgrep is redirecting stdout to. That is, if you do rg pattern > output, then ripgrep won't search output because it knows its redirecting stdout to it. The code for that is here: ripgrep/src/main.rs at 229b8e3b3322177e2a518d87358b3503dfc18cdf Ā· BurntSushi/ripgrep Ā· GitHub --- AFAIK, this can't work if the only way to detect equivalent file paths is by comparing the file paths themselves. But maybe there is a better way to do this that I don't know about?

I think this sounds OK to me, but I'd really like to figure out why all of the recursive directory iterators I've looked at don't do this before actually committing to this. (And this only applies to walkdir. It doesn't apply to stdout detection I mentioned above.)

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.