`fs::read_dir` can't read directory entries with very long names

ayosec · October 14, 2021, 10:52pm

Recently I read the article How (not) to use readdir_r(3) (2012). It describes an issue with the readdir_r function in file systems that allow file names longer that 255 bytes.

From the article:

Here's an AOSP patch: libc: readdir_r smashed the stack. Some engineers point out that FAT32 lets you store 255 UTF-16 characters, Android's struct dirent has a 256-byte d_name field, so if you have the right kind of filename on a FAT32 file system, it won't fit in a struct dirent.

Since std::fs::read_dir uses readdir64_r (code) I was curious to see what happens in Rust. As shown below, fs::read_dir throws a code: 0 error.

I tested the problem in Linux, but I think that it may happen in other Unix-like systems.

Steps to reproduce

Create a FAT32 image.

dd if=/dev/zero of=/tmp/vfat bs=1M count=32

mkfs.vfat -F 32 /tmp/vfat

sudo mount -o loop,uid=$UID /tmp/vfat /mnt/disk

Create some files which names need more than 255 bytes to be stored.

create_file() { ruby -e 'File.open("/mnt/disk/" + ARGV[0] * ARGV[1].to_i, "a")' -- "$@"; }

create_file á 127    # 254 bytes
create_file é 128    # 256 bytes
create_file í 129    # 258 bytes

create_file S N creates a file which names consists on the S string repeated N times.

Only the first file (á × 127) can be copied to a non-FAT32 file system:

$ cp -a /mnt/disk/ $(mktemp -d)
cp: cannot create regular file '/tmp/tmp.ZBXQO63Syn/disk/éééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééé': File name too long
cp: cannot create regular file '/tmp/tmp.ZBXQO63Syn/disk/ííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííí': File name too long

Other programs, like ls and find does not have any issue.

Read the contents of the directory with a Rust program.

// ls.rs
fn main() {
    for arg in std::env::args_os().skip(1) {
        for f in std::fs::read_dir(arg).unwrap() {
            println!("{:?}", f);
        }
    }
}

Executes it against the FAT32 image:

$ rustc -O ls.rs

$ ./ls /mnt/disk/
Ok(DirEntry("/mnt/disk/ááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááááá"))
Err(Os { code: 0, kind: Uncategorized, message: "Success" })

Cause

The problem happens because readdir64_r uses a fixed size for the file name. In the definition of struct dirent64, the field d_name is a 256 bytes array.

The struct dirent for macOS uses a 1024 bytes array, so I assume that macOS is not affected by this issue.

Other File Systems

The post mentioned above is related to FAT32, but according to Wikipedia there are other file systems that allows more than 255 bytes in the file name:

File System	Limit
APFS	255 UTF-8 characters
GPFS	255 UTF-8 characters
HFS Plus	255 UTF-16 characters
NSS	256 characters
NTFS	255 UTF-16 characters
ReFS	255 UTF-16 characters
Reiser4	3976 bytes
ReiserFS	4032 bytes
SquashFS	256 bytes
exFAT	255 UTF-16 characters

Conclusion

I posted the issue here because I'm not sure if this is actually a bug. The conditions to trigger it are very rare.

However, the current manpage for readdir_r(3) recommends using readdir instead:

It is recommended that applications use readdir(3) instead of readdir_r(). Furthermore, since version 2.24, glibc deprecates readdir_r(). The reasons are as follows:

On systems where NAME_MAX is undefined, calling readdir_r() may be unsafe because the interface does not allow the caller to specify the length of the buffer used for the returned directory entry.

On some systems, readdir_r() can't read directory entries with very long names. When the glibc implementation encounters such a name, readdir_r() fails with the error ENAMETOOLONG after the final directory entry has been read. On some other systems, readdir_r() may return a success status, but the returned d_name field may not be null terminated or may be truncated.

In the current POSIX.1 specification (POSIX.1-2008), readdir(3) is not required to be thread-safe. However, in modern implementations (including the glibc implementation), concurrent calls to readdir(3) that specify different directory streams are thread-safe. Therefore, the use of readdir_r() is generally unnecessary in multithreaded programs. In cases where multiple threads must read from the same directory stream, using readdir(3) with external synchronization is still preferable to the use of readdir_r(), for the reasons given in the points above.

It is expected that a future version of POSIX.1 will make readdir_r() obsolete, and require that readdir(3) be thread-safe when concurrently employed on different directory streams.

cuviper · October 15, 2021, 12:00am

`readdir_r` is deprecated as of glibc-2.24

opened 05:29PM - 05 Jul 16 UTC

cuviper

C-enhancement T-libs-api

Per [bug 19056](https://sourceware.org/bugzilla/show_bug.cgi?id=19056), glibc-2.…24 has deprecated `readdir_r` and `readdir64_r` in favor of plain `readdir` and `readdir64`. The reasons are discussed on the updated [manpage](http://man7.org/linux/man-pages/man3/readdir_r.3.html). It states that `readdir` is already thread-safe in "modern" implementations, including glibc. We should see if all platforms targeted by `unix/fs.rs` satisfy this. Note that the Solaris (Illumos) port is already using plain `readdir`. It requires external synchronization if a particular directory stream may be shared among threads, but I believe we avoid that naturally from the lack of `&mut` aliasing. `Dir` is `Sync`, but only `ReadDir` accesses it, and only from its mutable `Iterator` implementation.

github.com/rust-lang/rust

`fs::read_dir()` fails with error 0 on Linux if directory is on NTFS and contains file with filename > 255 bytes

opened 05:51PM - 26 Jun 21 UTC

ariasuni

O-linux C-bug T-libs

1. Create a filename on an NTFS partition with a name > 255 bytes, e.g. `一二三四五六七…八九十一二三四五六七八九十一二三四五六七八九十一二三四五六七八九十一二三四五六七八九十一二三四五六七八九十一二三四五六七八九十一二三四五六七八九十一二三四五六七八九十`. 2. Run this code with the path to the directory containing the file with the very long filename: ```rust use std::env; use std::fs::{self, ReadDir}; fn main() { test().unwrap(); } fn test() -> std::io::Result<ReadDir> { let args: Vec<String> = env::args().collect(); for result in fs::read_dir(&args[1])? { result.map(|entry| entry.path()).unwrap(); } todo!() } ``` I expected to see this happen: the code works until it panics on `todo!()` Instead, this happened: ```thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 0, kind: Other, message: "Success" }', test.rs:11:42```. Note that in exa, [it fails with `Bad address (os error 14)`](https://github.com/ogham/exa/issues/893). ### Meta `rustc --version --verbose`: ``` rustc 1.53.0 (53cb7b09b 2021-06-17) binary: rustc commit-hash: 53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b commit-date: 2021-06-17 host: x86_64-unknown-linux-gnu release: 1.53.0 LLVM version: 12.0.1 ```

zackw · October 15, 2021, 12:02am

I think this is a case where we should just go ahead and change fs to use readdir, at least with C libs where it's known to be thread safe.

The alternative would be to cut the C library out of the loop and use open(..., O_DIRECTORY) and getdents, the main problem with which is that getdents is very poorly documented and may not even be an officially supported API on all the OSes Rust cares about.

cuviper · October 15, 2021, 12:32am

Yeah -- feel free to tag me as a reviewer if someone wants to make it use readdir on linux-gnu, and any other targets you care to audit for thread safety.

josh · October 15, 2021, 6:42am

Switching to readdir sounds like it might be the right option for UNIX targets where we know it's thread-safe to do so. But also, I think we should absolutely switch to getdents64 on Linux systems, for performance.

ckaran · October 15, 2021, 12:26pm

It might not be a bug on the rust side, but it feels like a security issue waiting to happen that will be erroneously blamed on rust... I like that it's being proactively fixed!

cuviper · October 15, 2021, 3:55pm

Why do you think that would help performance? Would you use a larger buffer, or what?

Anyway, let's consider that in steps, since readdir is a smaller change.

sanxiyn · October 15, 2021, 5:57pm

getdents allows specifying buffer size, while readdir doesn't. I think users can often gain performance by using getdents because they know the right buffer size, but I am not sure whether Rust can. Since std::fs::read_dir doesn't receive buffer size either, Rust can only win by choosing better default value than C library, assuming default value of C library is suboptimal. Is this the case? (i.e. is default value of C library widely known to be suboptimal?)

cuviper · October 15, 2021, 6:14pm

Glibc uses the directory's st_blksize for its buffer allocation, clamped between 32K and 1MB.

Musl uses a static 2K buffer.

burntsushi · October 16, 2021, 12:50pm

In case anyone finds it useful, I did start on this for walkdir: walkdir/mod.rs at 1d7293a5a1ef548ce587a0b08abce5f21571a100 · BurntSushi/walkdir · GitHub

I'm skeptical that direct use of getdents would do much for std, given the API we have. There isn't a lot of opportunity to amortize allocation, and glibc likely already uses buffering internally (as noted by others). I suspect further wins would likely require API design, e.g., by providing access to the buffer used by getdents directly (for example).

bjorn3 · October 16, 2021, 2:50pm

The default buffer size can definitively by too low. For example http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html required increasing the buffer size to a couple of MB to list all directory entries in a reasonable amount of time. Libstd could offer an extension method to Readdir that sets the buffer size to help with this if it were to start using getdents directly.

ayosec · October 18, 2021, 4:30am

I wrote an implementation of the std::fs::ReadDir iterator using the getdents64 syscall, with no libc wrappers.

Please note that this is experimental, and there may be bugs either in the implementation or in the benchmark code, so the results should not be conclusive.

The code is available in the test-rust-getdents64 repository.

Main differences from `std::fs::ReadDir`.

readdir64_r locks a mutex in every call.

I didn't include any synchronization method because the calls to getdents64 is protected by a mutable reference, so it should be guaranteed that nothing else is using the file descriptor.
This implementation is not affected by the 255-bytes limit in file names.
The size of the buffer to get data from getdents64 is based on the st_size field of the directory.

Buffer size

As far as I could see, there is no infallible method to get the best size for the buffer.

The field st_blksize does not look very useful. In my tests it is always 4096 (even for directories with thousands of files).

Golang, which uses getdents64 with no libc wrappers, initially used a fixed buffer of 4096 bytes, but it was incremented to 8192 because of some issues with CIFS.

For this experimental implementation I tried an approach to minimize the number of syscalls. The idea is to use the size reported in the st_size field, and round it to the next multiple of BUFSIZ (usually 8 Kib); then, it clamps the result between BUFSIZ and 1 Mib. In most tests, this approach is able to get all directory entries in a single call to getdents64, with a very low waste of memory.

The main issue is with directories with a lot of removed files, because the st_size does not shrink.

For example:

$ mkdir /tmp/x

$ stat -c %s /tmp/x
4096

$ touch /tmp/x/{1..1000}

$ stat -c %s /tmp/x
20480

$ rm /tmp/x/{1..1000}

$ stat -c %s /tmp/x
20480

$ ls /tmp/x | wc -l
0

Even so, this method looks the best balance.

Benchmarks

The test-rust-getdents64 repository includes a program to compare its implementation with the one available in the standard library.

The test is something like this:

for 10 seconds {
    let run_start = Instant::now();

    for f in fs::read_dir(path).unwrap() {
        drop(f);
    }

    let elapsed = run_start.elapsed();
    // Compute average, max, min.
}

The program has a -p flag to print the name of the files. Its output has to be identical to execute ls -1UA in the same directory, and it is used to verify the implementation.

Results

The following table shows the average time to read all entries in a directory.

The first column is the number of files in the directory, from 0 to 2M.

The last column shows how fast is this experimental implementation against the standard library.

Average

N. Files	`std`	`syscall`	`std` / `syscall`
0	`1` μs	`1` μs	`0.95`
16	`3` μs	`3` μs	`1.08`
256	`30` μs	`23` μs	`1.27`
4096	`487` μs	`386` μs	`1.26`
10000	`1.26` ms	`989` μs	`1.28`
1000000	`241` ms	`217` ms	`1.11`
2000000	`479` ms	`431` ms	`1.11`

Maximum

N. Files	`std`	`syscall`	`std` / `syscall`
0	`3.97` ms	`4.38` ms	`0.90`
16	`2.98` ms	`2.83` ms	`1.05`
256	`220` μs	`159` μs	`1.38`
4096	`679` μs	`768` μs	`0.88`
10000	`2.55` ms	`1.98` ms	`1.29`
1000000	`241` ms	`217` ms	`1.11`
2000000	`479` ms	`431` ms	`1.11`

Minimum

N. Files	`std`	`syscall`	`std` / `syscall`
0	`1` μs	`1` μs	`0.95`
16	`3` μs	`2` μs	`1.10`
256	`29` μs	`22` μs	`1.31`
4096	`481` μs	`376` μs	`1.28`
10000	`1.24` ms	`973` μs	`1.28`
1000000	`240` ms	`217` ms	`1.11`
2000000	`478` ms	`430` ms	`1.11`

Conclusion

The results here are not conclusive, but I think that they are good as a first step.

My conclusion is:

The standard library should use the readdir function from libc, instead of calling getdents64 directly.

This is the safest approach, and the performance improvement (even when it is big) does not to compensate the extra work to maintain a custom readdir implementation.
A crate like walkdir should experiment with a direct call to getdents64.

During the development of this experiment I used perf(1) to detect the bottlenecks, and I saw that memcpy took a lot of time. As @burntsushi said in a previous comment, a different API is needed to reuse the buffer, and this can provide a much bigger improvement.

cuviper · October 18, 2021, 5:30pm

"Best" depends on your goal. I believe the point of using st_blksize is to optimize how long you're waiting for IO requests, which may be more important than syscall overhead if it's not already cached. If you try to buffer the entire directory list in one syscall, this could be waiting a long time for slow devices, especially with network filesystems.

I suppose this was done without regard to the filesystem cache?

Even if you flushed, I do expect that a more aggressive buffer size will finish sooner for all entries, but it will probably wait longer before it can report the first entry.

Sounds good, and thanks for doing this experiment!

RalfJung · October 20, 2021, 3:36pm

That almost sounds like you are relying on IO safety here?

ayosec · October 25, 2021, 1:47am

You are right. I didn't think about that. Maybe the approach implemented by libc (32K in most cases) is preferable as the default option.

In GNU libc, the allocator can be tuned using mallopt(3) and some MALLOC_* environment variables. Something similar could be done to configure the internal buffer used by readdir, but this is something out of the scope of Rust.

ayosec · October 25, 2021, 1:49am

I think so. Since fs::ReadDir does not implement AsRawFd, there is no way to access to the underlying file descriptor without unsafe code (except for tricks with /proc/self/mem).

I don't have experience using the new IO safety feature, so please correct me if I'm wrong.

The type OwnedFd (currently nightly-only) is very handy to store the file descriptor in a crate that wants to access getdents directly.

RalfJung · October 25, 2021, 1:24pm

Cc @sunfishcode

sunfishcode · October 25, 2021, 1:55pm

Yes, I/O safety means that you can encapsulate file descriptors and be sure that you have exclusive access to them. To be sure, it's just about exclusive access to the file descriptor, and not the underlying directory, but if I read the thread correctly, the file descriptor is what's important here.

You can use OwnedFd if it's useful here, but it doesn't do anything special; you can also just encapsulate a raw fd in the usual way and I/O safety still applies.

system · January 23, 2022, 1:55pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pre-RFC: Rust and Large File Support (LFS) libs	18	4578	March 25, 2019
Allow std::fs::DirEntry to vend borrows of its filename libs	5	738	June 22, 2021
`fs` module's design decision language design	2	713	October 15, 2022
Get filename from DirEntry without allocating	8	1209	March 25, 2019
How to delete the contents of files? libs	3	8430	March 25, 2019

`fs::read_dir` can't read directory entries with very long names

Steps to reproduce

Cause

Other File Systems

Conclusion

Main differences from std::fs::ReadDir.

Buffer size

Benchmarks

Results

Conclusion

Related topics

Main differences from `std::fs::ReadDir`.