Not necessarily. Heuristics can be used to limit this to special cases (same device, size) and an identity check would not have to checksum the whole file at all, you can do a byte-by-byte comparison and bail out fast on first mismatch. In fact this is what FIDEDUPERANGE does in the kernel, so we don’t even have to do this in userspace.
Asymptotically you do not need to touch more bytes than a regular copy would. Either it’s duplicate and you do 1 read on the in-file and 1 read on the outfile or it’s not duplicate and you do 1 read, 1 write.
I think a rough outline of the algorithm would be as follows:
- to open target file for read-write without
O_TRUNC; read-only fallback in case of EACCESS
-
stat; we already do this
- check same-device and same-length; bail otherwise
- get FIEMAP and check
FIEMAP_EXTENT_SHARED flags; bail if target is not shared as we only want to avoid reflink breaking
- attempt
FIDEDUPERANGE on first extent, bail on FILE_DEDUPE_RANGE_DIFFERS
on bailout: ftruncate1 tail of file as we would have done on open anyway.
positive outcome: 2x read, no data writes, disk space saved
negative outcome: 2 extra syscalls and reading 1 extra block, fallback to normal copy
most common outcome: 1 extra syscall (the truncate) as the preconditions are not fulfilled
1: This delayed truncate after stat also avoids the footgun of copying a file into itself because we can check whether device id and inode numbers match and bail out if that’s the case.