`fs::read_dir` can't read directory entries with very long names

Recently I read the article How (not) to use readdir_r(3) (2012). It describes an issue with the readdir_r function in file systems that allow file names longer that 255 bytes.

From the article:

Here's an AOSP patch: libc: readdir_r smashed the stack. Some engineers point out that FAT32 lets you store 255 UTF-16 characters, Android's struct dirent has a 256-byte d_name field, so if you have the right kind of filename on a FAT32 file system, it won't fit in a struct dirent.

Since std::fs::read_dir uses readdir64_r (code) I was curious to see what happens in Rust. As shown below, fs::read_dir throws a code: 0 error.

I tested the problem in Linux, but I think that it may happen in other Unix-like systems.

Steps to reproduce

  1. Create a FAT32 image.

    dd if=/dev/zero of=/tmp/vfat bs=1M count=32
    
    mkfs.vfat -F 32 /tmp/vfat
    
    sudo mount -o loop,uid=$UID /tmp/vfat /mnt/disk
    
  2. Create some files which names need more than 255 bytes to be stored.

    create_file() { ruby -e 'File.open("/mnt/disk/" + ARGV[0] * ARGV[1].to_i, "a")' -- "$@"; }
    
    create_file á 127    # 254 bytes
    create_file é 128    # 256 bytes
    create_file í 129    # 258 bytes
    

    create_file S N creates a file which names consists on the S string repeated N times.

    Only the first file (á × 127) can be copied to a non-FAT32 file system:

    $ cp -a /mnt/disk/ $(mktemp -d)
    cp: cannot create regular file '/tmp/tmp.ZBXQO63Syn/disk/éééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééé': File name too long
    cp: cannot create regular file '/tmp/tmp.ZBXQO63Syn/disk/ííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííí': File name too long
    

    Other programs, like ls and find does not have any issue.

  3. Read the contents of the directory with a Rust program.

    // ls.rs
    fn main() {
        for arg in std::env::args_os().skip(1) {
            for f in std::fs::read_dir(arg).unwrap() {
                println!("{:?}", f);
            }
        }
    }
    

    Executes it against the FAT32 image:

    $ rustc -O ls.rs
    
    $ ./ls /mnt/disk/
    Ok(DirEntry("/mnt/disk/ááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááá"))
    Err(Os { code: 0, kind: Uncategorized, message: "Success" })
    

Cause

The problem happens because readdir64_r uses a fixed size for the file name. In the definition of struct dirent64, the field d_name is a 256 bytes array.

The struct dirent for macOS uses a 1024 bytes array, so I assume that macOS is not affected by this issue.

Other File Systems

The post mentioned above is related to FAT32, but according to Wikipedia there are other file systems that allows more than 255 bytes in the file name:

File System Limit
APFS 255 UTF-8 characters
GPFS 255 UTF-8 characters
HFS Plus 255 UTF-16 characters
NSS 256 characters
NTFS 255 UTF-16 characters
ReFS 255 UTF-16 characters
Reiser4 3976 bytes
ReiserFS 4032 bytes
SquashFS 256 bytes
exFAT 255 UTF-16 characters

Conclusion

I posted the issue here because I'm not sure if this is actually a bug. The conditions to trigger it are very rare.

However, the current manpage for readdir_r(3) recommends using readdir instead:

It is recommended that applications use readdir(3) instead of readdir_r(). Furthermore, since version 2.24, glibc deprecates readdir_r(). The reasons are as follows:

  • On systems where NAME_MAX is undefined, calling readdir_r() may be unsafe because the interface does not allow the caller to specify the length of the buffer used for the returned directory entry.

  • On some systems, readdir_r() can't read directory entries with very long names. When the glibc implementation encounters such a name, readdir_r() fails with the error ENAMETOOLONG after the final directory entry has been read. On some other systems, readdir_r() may return a success status, but the returned d_name field may not be null terminated or may be truncated.

  • In the current POSIX.1 specification (POSIX.1-2008), readdir(3) is not required to be thread-safe. However, in modern implementations (including the glibc implementation), concurrent calls to readdir(3) that specify different directory streams are thread-safe. Therefore, the use of readdir_r() is generally unnecessary in multithreaded programs. In cases where multiple threads must read from the same directory stream, using readdir(3) with external synchronization is still preferable to the use of readdir_r(), for the reasons given in the points above.

  • It is expected that a future version of POSIX.1 will make readdir_r() obsolete, and require that readdir(3) be thread-safe when concurrently employed on different directory streams.

6 Likes

See also:

3 Likes

I think this is a case where we should just go ahead and change fs to use readdir, at least with C libs where it's known to be thread safe.

The alternative would be to cut the C library out of the loop and use open(..., O_DIRECTORY) and getdents, the main problem with which is that getdents is very poorly documented and may not even be an officially supported API on all the OSes Rust cares about.

4 Likes

Yeah -- feel free to tag me as a reviewer if someone wants to make it use readdir on linux-gnu, and any other targets you care to audit for thread safety.

3 Likes

Switching to readdir sounds like it might be the right option for UNIX targets where we know it's thread-safe to do so. But also, I think we should absolutely switch to getdents64 on Linux systems, for performance.

6 Likes

It might not be a bug on the rust side, but it feels like a security issue waiting to happen that will be erroneously blamed on rust... I like that it's being proactively fixed!

4 Likes

Why do you think that would help performance? Would you use a larger buffer, or what?

Anyway, let's consider that in steps, since readdir is a smaller change.

getdents allows specifying buffer size, while readdir doesn't. I think users can often gain performance by using getdents because they know the right buffer size, but I am not sure whether Rust can. Since std::fs::read_dir doesn't receive buffer size either, Rust can only win by choosing better default value than C library, assuming default value of C library is suboptimal. Is this the case? (i.e. is default value of C library widely known to be suboptimal?)

2 Likes

Glibc uses the directory's st_blksize for its buffer allocation, clamped between 32K and 1MB.

Musl uses a static 2K buffer.

In case anyone finds it useful, I did start on this for walkdir: walkdir/mod.rs at 1d7293a5a1ef548ce587a0b08abce5f21571a100 · BurntSushi/walkdir · GitHub

I'm skeptical that direct use of getdents would do much for std, given the API we have. There isn't a lot of opportunity to amortize allocation, and glibc likely already uses buffering internally (as noted by others). I suspect further wins would likely require API design, e.g., by providing access to the buffer used by getdents directly (for example).

2 Likes

The default buffer size can definitively by too low. For example http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html required increasing the buffer size to a couple of MB to list all directory entries in a reasonable amount of time. Libstd could offer an extension method to Readdir that sets the buffer size to help with this if it were to start using getdents directly.

1 Like

I wrote an implementation of the std::fs::ReadDir iterator using the getdents64 syscall, with no libc wrappers.

Please note that this is experimental, and there may be bugs either in the implementation or in the benchmark code, so the results should not be conclusive.

The code is available in the test-rust-getdents64 repository.

Main differences from std::fs::ReadDir.

  • readdir64_r locks a mutex in every call.

    I didn't include any synchronization method because the calls to getdents64 is protected by a mutable reference, so it should be guaranteed that nothing else is using the file descriptor.

  • This implementation is not affected by the 255-bytes limit in file names.

  • The size of the buffer to get data from getdents64 is based on the st_size field of the directory.

Buffer size

As far as I could see, there is no infallible method to get the best size for the buffer.

The field st_blksize does not look very useful. In my tests it is always 4096 (even for directories with thousands of files).

Golang, which uses getdents64 with no libc wrappers, initially used a fixed buffer of 4096 bytes, but it was incremented to 8192 because of some issues with CIFS.

For this experimental implementation I tried an approach to minimize the number of syscalls. The idea is to use the size reported in the st_size field, and round it to the next multiple of BUFSIZ (usually 8 Kib); then, it clamps the result between BUFSIZ and 1 Mib. In most tests, this approach is able to get all directory entries in a single call to getdents64, with a very low waste of memory.

The main issue is with directories with a lot of removed files, because the st_size does not shrink.

For example:

$ mkdir /tmp/x

$ stat -c %s /tmp/x
4096

$ touch /tmp/x/{1..1000}

$ stat -c %s /tmp/x
20480

$ rm /tmp/x/{1..1000}

$ stat -c %s /tmp/x
20480

$ ls /tmp/x | wc -l
0

Even so, this method looks the best balance.

Benchmarks

The test-rust-getdents64 repository includes a program to compare its implementation with the one available in the standard library.

The test is something like this:

for 10 seconds {
    let run_start = Instant::now();

    for f in fs::read_dir(path).unwrap() {
        drop(f);
    }

    let elapsed = run_start.elapsed();
    // Compute average, max, min.
}

The program has a -p flag to print the name of the files. Its output has to be identical to execute ls -1UA in the same directory, and it is used to verify the implementation.

Results

The following table shows the average time to read all entries in a directory.

The first column is the number of files in the directory, from 0 to 2M.

The last column shows how fast is this experimental implementation against the standard library.

Average

N. Files std syscall std / syscall
0 1 μs 1 μs 0.95
16 3 μs 3 μs 1.08
256 30 μs 23 μs 1.27
4096 487 μs 386 μs 1.26
10000 1.26 ms 989 μs 1.28
1000000 241 ms 217 ms 1.11
2000000 479 ms 431 ms 1.11
Maximum
N. Files std syscall std / syscall
0 3.97 ms 4.38 ms 0.90
16 2.98 ms 2.83 ms 1.05
256 220 μs 159 μs 1.38
4096 679 μs 768 μs 0.88
10000 2.55 ms 1.98 ms 1.29
1000000 241 ms 217 ms 1.11
2000000 479 ms 431 ms 1.11
Minimum
N. Files std syscall std / syscall
0 1 μs 1 μs 0.95
16 3 μs 2 μs 1.10
256 29 μs 22 μs 1.31
4096 481 μs 376 μs 1.28
10000 1.24 ms 973 μs 1.28
1000000 240 ms 217 ms 1.11
2000000 478 ms 430 ms 1.11

Conclusion

The results here are not conclusive, but I think that they are good as a first step.

My conclusion is:

  1. The standard library should use the readdir function from libc, instead of calling getdents64 directly.

    This is the safest approach, and the performance improvement (even when it is big) does not to compensate the extra work to maintain a custom readdir implementation.

  2. A crate like walkdir should experiment with a direct call to getdents64.

    During the development of this experiment I used perf(1) to detect the bottlenecks, and I saw that memcpy took a lot of time. As @burntsushi said in a previous comment, a different API is needed to reuse the buffer, and this can provide a much bigger improvement.

6 Likes

"Best" depends on your goal. I believe the point of using st_blksize is to optimize how long you're waiting for IO requests, which may be more important than syscall overhead if it's not already cached. If you try to buffer the entire directory list in one syscall, this could be waiting a long time for slow devices, especially with network filesystems.

I suppose this was done without regard to the filesystem cache?

Even if you flushed, I do expect that a more aggressive buffer size will finish sooner for all entries, but it will probably wait longer before it can report the first entry.

Sounds good, and thanks for doing this experiment!

That almost sounds like you are relying on IO safety here?

1 Like