Here are some notes on inode numbers and checking whether two files
are “the same” on Unix. I know rather too much about this thanks to
my time working on GCC’s preprocessor: the redundant-include detector
manages to hit all of the corner cases simultaneously, and when
#pragma once is in use, both false positives and false negatives
are catastrophic.
The executive summary is that the check currently done by same-file
will experience both false positives and false negatives when working
with file systems where the “inode numbers” reported by the kernel do
not correspond to an actual data structure on disk (notably FAT and
HFS+), and when working with network file servers. Errors are,
paradoxically, more likely to happen with the Handle API than with
the basic is_same_file, because Handle comparisons are more likely
to involve comparing inode numbers observed at widely separated
times. Holding files open does not help.
Inode numbers were originally an exposed implementation detail of the
first several generations of UNIX file systems. (If you haven’t
already read McKusick’s paper
A Fast File System for UNIX,
you should probably stop reading these notes now, go read that, and
come back when you’re done.) A directory entry in these file systems
was a 2-tuple (name, inode number), and the inode number was just
the index of an actual on-disk object, the inode, in an on-disk array.
An inode (i is either for “internal” or “index”, depending who you
ask) contains all of the meta-information about each file apart from
its name(s), and points to the disk blocks containing the file’s data.
(Contrast FAT, where all of the meta-information about a file is
embedded in its directory entry. In all generations of FAT, there is
no on-disk datum that corresponds to an “inode number.”)
Thus, in principle, if two directory entries refer to the same inode
number, they must refer to the same file. But even in the days of
4.2BSD, there were two important exceptions:
-
Inode numbers only uniquely identify a file within a file system.
You must also compare the device numbers to be sure that two
files are the same. (I checked, and same-file does do this, but I
mention it anyway, because it’s a common mistake.)
-
Inode numbers are reused when files are deleted. If you ‘stat’ a
pathname at two different times and observe the same inode number,
it does not mean that the file is the same, because it could have
been deleted and replaced, and the old inode number reused. (This
is guaranteed not to happen if you hold the file open over the
interval, because the “open file description” holds a reference to
the inode and prevents it from being reused — but see below.
Some newer file systems expose “generation numbers” that can
detect this situation, but not all.)
Both of these are sources of false positives: two files are identified
as the same, when they aren’t.
Now, on a more modern system, you have several additional things to
worry about. Most obviously, modern systems will support FAT, HFS+,
and other filesystems that don’t have inode numbers, even if they are
some variation of Unix. When the file you’re looking at is on a one
of those filesystems, the inode number you observe was made up somehow
by the kernel, and may be stable only for as long as the filesystem
remains mounted, or only as long as the file is open. That’s actually
more troublesome than inode numbers getting reused when files are
deleted and replaced, because it can cause both false positives and
false negatives, and because holding the file open will not protect
you (the operating system can’t stop the user from yanking a USB stick
with the filesystem still mounted and then putting a different one
into the same slot).
Similar problems come up when network file systems are in use.
Depending on a lot of fiddly details, NFS and SMB may or may not give
you inode numbers that are stable over a mount-unmount cycle; more
importantly, the numbers may or may not be stable if the server
crashes and is rebooted. If you hold a file open across a server
reboot, you’ll get a “stale file handle” error the next time you try
to use the file handle, but if you’re just doing stat on the
pathname and hoping that having the file open will protect you,
you’re SOL. Again, this can cause both false positives and false
negatives.
I am not aware of alternative approaches that will work in general. The
current redundant-include detector in GCC
actually doesn’t use inode numbers, because they were found to be unreliable
(particularly on HFS+, if memory serves); it instead checks last
modification time, file size, and contents — and it still isn’t
bulletproof; it can get false negatives (files are thought to be
different, when they’re the same) in the presence of clock skew
between network file server and client.
Specific file systems may provide extended meta-information that
includes a long-term stable, unique label for each file, perhaps even
one that is at least “almost surely” unique across multiple file
systems (e.g. a GUID). I don’t actually know of a concrete example
but if you want to find one I would suggest you look at ZFS, BtrFS,
and Microsoft and Apple’s next-generation filesystems, whose names I do
not remember right now.
For the specific case of loop detection when doing a directory tree
walk, the Right Thing is to canonicalize and compare pathnames.
This can be made 100% reliable, because a directory (unlike a file)
can have only one canonical pathname.