You will never get an accidental collision because of SHA-1. The likelihood of this happening due to SHA-1 is astronomically small (literally, there may not be enough energy available in our solar system to compile enough programs for this to ever happen). The attacks on SHA-1 do not change that fact, because it works fine when not fed specially-crafted data (and such data is also astronomically unlikely to happen by chance).
Just to satisfy my curiosity, I found some historical context in the form of a wiki page by the original author of build-id and its implementation in GNU ld, Roland McGrath:
In later contemplation, I became dissatisfied with using a checksum on the entire contents of the .debug file. I would like to be able to perform transformations on the DWARF data after the fact (fancy compression et al) and still say "this is the DWARF data for your binary". It feels wrong to have to edit the stripped binary to make it match transformed debug data. (In the abstract, I also think one should always be able to do spurious ELF file layout juggling that doesn't change the semantics of the data.) So I thought about using sha1sum of the loaded segments and phdrs or something. But that's not right. Your main(){} and my main(){} are going to produce the same stripped binaries, but my binaries should get me the source code with my comments in it, and not yours just because it's identical modulo pontification. (It's all about the pontification!) But, the real plan behind using a strong checksum was never actually to compute a checksum from the data ever again after build time, because you really just rely on the comparison of a strongly unique embedded identifier.
[..]
What we really want is a unique build ID. At first I liked the canonical UUID generation for this (128 bits of random or of something time and host-based). But that has the very undesireable property of making for unreproducible builds, where it's between difficult and impossible to start from the same conditions and repeat the procedure of making binaries from all the same constituents to get binaries without gratuitous differences from the original build. Perhaps something like sha1sum of the unstripped file is what we want to use as the basis of a reproducible identifier unique to completely identical builds. But I'd like to specify it explicitly as being a unique identifier good only for matching, not any kind of checksum that can be verified against the contents. (There are external general means for content verification, and I don't think debuginfo association needs to do that.)
It sounds like he was weighing various pros and cons and didn't want to pick a single option. Ultimately, even the first version committed to GNU ld in 2011 already included the three main options supported today: randomly generated UUID, hash, or manually-specified ID, with hash being the default. Originally it only supported MD5 hashes, but SHA-1 was added and made the default literally a week later.
When hashing is chosen, the hash includes debug sections (both then and now), so it doesn't support the use case mentioned in the first paragraph of transforming DWARF info.
Well, compilers and linkers typically are deterministic. Still, I agree, trying to pick a build ID in advance seems practically impossible to do soundly. You would need to hash the linker itself, plus any libraries the linker uses (e.g. LTO plugin), plus of course any libraries you're linking your binary against, even though locating those on disk is usually the linker's job…
Looking for open source code that manually specifies a build ID, there isn't much of it. There's a hit from the Go toolchain, which I'm not too familiar with, but I checked: even when forcing the use of an external linker (go build -x -a -ldflags=-linkmode=external hello.go), the default Go toolchain doesn't actually pass an explicitly specified ID to the linker by default. gccgo, though, does.
The SVH contains the version string, but not the actual source of rustc. Hashing the rustc executable would both be way too slow and make rustc versions compiled for different targets produce different outputs even if compiling for the same target.
No, but even just normal linking without using build-id can be slow. I expect build-id hashing to produce a slowdown of at least 10% over using lld as linker.
Indeed! I just checked on my Debian laptop, and I do get a build-id by default there. (The initial “discovery” and post was made from my workstation, running NixOS.)
Thanks for pointing this out, I think that makes the situation even clearer: most users already pay the cost of link-time hashing, and it seems it is unnoticeable, but by default it is only included when building on most mainstream Linux distributions.
As such, I strongly believe it makes even more sense to enable it by default: not only for all the benefits we already discussed, but also so it behaves consistently regardless of which Linux distribution the user runs (and, more generally, whether gcc was built with --enable-linker-build-id).
Hang on. I think there's something that doesn't make sense here. At least to me.
Rustc uses the local C compiler as a linker sometimes; in that case, it picks up this build id thing.
But this only works if you happen to have a gcc with the right flags? What if I burned my gcc install to the ground and replaced it with Clang?
What would it mean for rustc to enable this "by default", if this is a choice GCC makes?
And stepping back... what problem are we solving here again? You cite some tools that want this. I use a debugger every day and have never had a need for this, and I actively zap notes sections out of my binaries in some cases.
one typically wants build-id when examining an existing binary (or a coredump produced by it) and needing to associate it back to a specific build, so re-running the build with those flags is not possible ;
What does this mean? Why are you debugging a random binary you didn't build?
This whole thread sounds to me like "GCC does a thing; can rustc do it too?", without much justification beyond "GCC does it".
There are quite a few reasonable reasons to do this. Most commonly, you have a pre-built binary (such as from a Linux distribution), and a source of debug symbols for that binary, and you want to do some simple debugging along the lines of "attach, watch for condition/signal/etc, thread apply all bt full", in the course of helping to report and/or reproduce a bug.
This is the default on Unix, because it's traditionally the C compiler (driver)'s job to add various crt objects, standard libraries, and search paths to the command line before invoking the linker. (rustc does have some support for adding that stuff itself and invoking the linker directly, which you can use by passing -C linker=ld. But at least on the two systems I tried, it doesn't even work as-is; there are missing search paths so the linker ends up complaining it can't find libraries.)
But this only works if you happen to have a gcc with the right flags? What if I burned my gcc install to the ground and replaced it with Clang?
rustc invokes the C compiler by running the command cc, so it depends on your PATH and what cc points to on your system.
What would it mean for rustc to enable this "by default", if this is a choice GCC makes?
GCC adds --build-id relatively early in the linker command line, but you can ask GCC to pass arbitrary arguments to the linker command line with -Wl, and these will appear later. The linker accepts multiple --build-id arguments and keeps the last one (this is documented). So the effect would be to override GCC's setting.
Why are you debugging a random binary you didn't build?
I debug things on my system all the time, either because they're not working or because I want to see how they work. Or even if I'm debugging my own program, I can get better stack traces if I have symbols for the system libraries that it loads.
It's helpful when debugging infos are split from the executable binary.
This is a very common situation, as one might debug:
a binary that their Linux distro built;
a binary that they built themselves but in another environment, e.g. built in CI/CD, the executable got pushed to someplace where it was run, produced a coredump, and one would like their debugger to automatically fetch the debug info from a central location; this is pretty common in more-mature production environments.
It is also really helpful when multiple builds of the binary exist, as otherwise finding the right version of the debug infos can be really ambiguous.
... Over half of my initial post (not counting the console log) was dedicated to an explanation why build-id is useful? If it's not useful to you, that's great, but that doesn't mean it isn't useful to others.