Emitting build-id when linking ELF binaries

Hi,

I just noticed that Rust doesn't put a build-id by default in ELF metadata (see reproducing example below). Passing RUSTFLAGS="-C link-arg=-Wl,--build-id" is sufficient, but very inconvenient.

I believe this ought to be the default, on relevant platforms, because:

  • one typically wants build-id when examining an existing binary (or a coredump produced by it) and needing to associate it back to a specific build, so re-running the build with those flags is not possible ;
  • many debug tools rely on this to match DWARF debug info, source files, and executables, for instance through elfutils' debuginfod.
$ cargo init --name hello
     Created binary (application) package
$ cargo build
   Compiling hello v0.1.0 (/run/user/1000/tmp.yMON4DX3SM)
    Finished dev [unoptimized + debuginfo] target(s) in 0.22s
$ cargo build --release
   Compiling hello v0.1.0 (/run/user/1000/tmp.yMON4DX3SM)
    Finished release [optimized] target(s) in 0.15s

$ ./target/release/hello
Hello, world!

$ readelf -n target/{debug,release}/hello

File: target/debug/hello

Displaying notes found in: .note.gnu.property
  Owner                Data size 	Description
  GNU                  0x00000010	NT_GNU_PROPERTY_TYPE_0
      Properties: <procesor-specific type 0xc0008002 data: 01 00 00 00 >

Displaying notes found in: .note.ABI-tag
  Owner                Data size 	Description
  GNU                  0x00000010	NT_GNU_ABI_TAG (ABI version tag)
    OS: Linux, ABI: 2.6.32

File: target/release/hello

Displaying notes found in: .note.gnu.property
  Owner                Data size 	Description
  GNU                  0x00000010	NT_GNU_PROPERTY_TYPE_0
      Properties: <procesor-specific type 0xc0008002 data: 01 00 00 00 >

Displaying notes found in: .note.ABI-tag
  Owner                Data size 	Description
  GNU                  0x00000010	NT_GNU_ABI_TAG (ABI version tag)
    OS: Linux, ABI: 2.6.32

$ cargo --version && rustc --version
cargo 1.53.0
rustc 1.53.0

$ RUSTFLAGS="-C link-arg=-Wl,--build-id" cargo build
   Compiling hello v0.1.0 (/run/user/1000/tmp.yMON4DX3SM)
    Finished dev [unoptimized + debuginfo] target(s) in 0.23s

$ readelf -n target/debug/hello
Displaying notes found in: .note.gnu.property
  Owner                Data size 	Description
  GNU                  0x00000010	NT_GNU_PROPERTY_TYPE_0
      Properties: <procesor-specific type 0xc0008002 data: 01 00 00 00 >

Displaying notes found in: .note.gnu.build-id
  Owner                Data size 	Description
  GNU                  0x00000014	NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: de95deca03881849c7f6dd208d22022a48a66982

Displaying notes found in: .note.ABI-tag
  Owner                Data size 	Description
  GNU                  0x00000010	NT_GNU_ABI_TAG (ABI version tag)
    OS: Linux, ABI: 2.6.32
3 Likes

AFAIK build id requires either a random number (bad for reproducability) or hashing the linker inputs (bad for linker performance). Build id isn't necessary for most people AFAIK, so I don't think it should be enabled by default given the disadvantages. I think it only makes much sense to enable it by default when building distro packages, in which case the build wrapper of the distro could enable it automatically.

5 Likes

@bjorn3 The default behavior for --build-id is deterministic and based on the SHA1 hash, though I did not check what GNU ld puts there exactly (I didn't find those details in the documentation and it didn't seem terribly relevant)

You mention slower linking being an issue, but do you have actual test-cases and numbers? SHA1 can be computed at ~1GiB/s, which is orders-of-magnitude faster than the throughput I'd expect from a (non-trivial) linker.

If there is indeed an impact that's significant and unacceptable, there are some low-hanging speedups:

  • use a faster hash function than SHA1, like BLAKE3 which is almost 7× faster ;
  • construct a tree hash, so each section can be hashed concurrently as it is written out ;
  • more radically (and maybe most sensible?) set build-id to be the hash that represents that specific build in rustc's compilation cache.
8 Likes

I think I agree we can use the same crate disambiguator hash we use for symbol mangling and TypeId.

However, distros may want something stronger, and that hash depends on Cargo feeding enough distinguishing data via -Cmetadata.

So maybe what we need is the content hash (not sure if that's called SVH, but it at least used to be) - that will change with changes to the source code, and would allow detecting stale split debuginfo even while developing (if there is e.g. a copy/deploy step).

We definitely compute deep hashes of everything for incremental, and I think we still do some amount of hashing for non-incremental but it would be nice to get a confirmation.

cc @wesleywiser @michaelwoerister

5 Likes

Yes, the SVH is a hash that depends on all build inputs.

The SVH is always computed. It is used to ensure that if dependencies change dependent crates need to be recompiled to prevent an error.

Making sure build-id is deterministic would be one prerequisite; another would be having a stable way to control whether it's on or off, for people trying to build smaller binaries or for people who need more control over what sections get emitted in their binary.

Perhaps it might make sense to provide a build-id option to control it, but also to have the default of that option depend on whether we're building debug information? Without debug information, having the build-id enabled doesn't provide as much value. (And if you enable debug information in release builds, you may benefit from having the build-id as well.)

3 Likes

I absolutely agree build-id should be deterministic, and the proposed solutions align with that: either letting the linker compute build-id by hashing parts of the executable, or deriving it from the SVH.

Also, build-id does not introduce new sections, it is merely additional metadata in the .notes ELF section (which is already emitted by default).

Regarding binary size, there are several facts I think are relevant:

  1. build-id has a one-time (per ELF executable) 40 bytes overhead in the size of the .notes section ;
  2. it wouldn't affect most code-size constrained environment like microcontrollers (which typically use raw images, not ELF) ;
  3. users who care about a few less bytes of binary size are already post-processing their executables, and presumably remove the whole .notes section (which already weighs in at 64 bytes, build-id not included) ;
  4. in any case, one can prevent build-id from being emitted via RUSTFLAGS: the linker arguments passed in that way always appear after the one set by rustc, and the GNU linker itself only honours the last --build-id option, so -Wl,--build-id=none would work.

All in all, I don't think this warrants making a stabilized control toggle a hard requirement.

I agree that would make sense, even though it seems to me like a very niche use-case (see 2 and 3 above), but I'd consider it a “nice to have” rather than a “must have”.

Given the low overhead and general utility of build-id, I'd be reluctant to make its default depend on whether debug informations are built, as users might find that more surprising and unexpected than it being on-by-default.

2 Likes

Thanks for confirming the SVH is always computed and covers the relevant info, and extra thanks to @eddyb for helping me find where it's defined/computed etc. Apparently it is a SipHash-2-4 hash truncated to 64b, which I'd rate that as a latent concern for the purpose of build-id:

  • Only accidental collisions need to be considered, as users presumably do not interact with untrusted debug info — or rather, if they trust an executable enough to run and debug it, they presumably trust the associated debug info.

  • The collision probability (due to the birthday bound) becomes non-negligible at ~2³² builds; this is more than reasonable for SVH's intended usecase (a local cache, scoped to a checked out crate and its dependencies) but maybe a bit close in the case of build-id:

    • Debian's debuginfod instance indexed 3186963 ≃ 2²² distinct build-ids in the stable and development suites alone ;
    • Fedora's indexed 19146900 ≃ 2²⁴ since v32 (released April 2020) ;
    • users whose environment contains builds from multiple sources (multiple distros, 3rd-party software, internal builds, etc.) may combine debug info sources.
  • A later Rust version can transparently change the way build-id is computed.

TL;DR: Using SVH as build-id would be fine for now, but maybe a bit short in the long term; we might want to switch to a longer hash (for the SVH itself or build-id only) sometime before the carcination of all software :crab: :smiling_imp:

2 Likes

A bit of history, some hashes (though I didn't check whether it applies to SVH) used to be BLAKE2, before the switch to 128-bit SipHash:

Since we seem to be computing 128 bits, I don't really know that Svh::new(crate_hash.to_smaller_hash()) is important for compiler performance (AFAIK the SVH is only checked once per dependent crate, to ensure it didn't change from under us).
We could try removing that step, and keeping a whole Fingerprint instead.

Thanks, that's pretty informative. :heart:

Yes, as far as I can tell there should be little to no performance impact to making SVH non-truncating, and that seems worth doing if it's easy and doesn't break anything.

I guess that ought to be split out of the build-id MCP, once we get there?

All queries that load metadata from external crates also depend on tcx.crate_hash(). This is necessary to ensure that incremental compilation will invalidate all queries that depend on changed crate metadata.

:+1:

Oh, is that what we replaced the (more intricate) metadata dependency hashing with?

We could keep around the larger hash and "just" ignore some bits when it's being used for that purpose, I suppose (ideally we can benchmark both).

I'm skeptical of using the SVH as a build-id. Two binaries with the same build ID don't just need to be similar, or ABI compatible. They should have identical code and data, or else GDB will silently load the wrong debug symbols, and show you incorrect file/line information and similar.

And this doesn't just happen if you manually conflate the two builds. GDB will look for symbols at a global filesystem path, and even attempt to fetch them over the Internet from a debuginfod server, in both cases using the build-id as the sole identifier of which binary it wants symbols for. Thus build-ids should be truly globally unique identifiers.

Maybe the SVH will change if you change compiler options. But does it change if you make a 1-line patch to rustc and rebuild your binary with the new rustc? Does it change if you upgrade your linker and rebuild? If the answer is no to either, then the SVH is unsuitable as a build-id, since both of those changes could result in a non-identical binary. (I'm not sure what the answer is, but I think it's "no" in both scenarios.)

I doubt there would be a noticeable impact from using the default sha1 instead, but it would be good to have actual numbers.

8 Likes

At this point SHA1 is pretty broken, so if we're at all concerned (as you suggest) about the effects of hash collisions, the suggestion of using BLAKE3 seems like a better fit.

You're right that changing rustc won't result in different "stable crate ID" (used by incremental, TypeId, mangling, etc.), let alone SVH ("stable hash of crate contents"), but I am worried "just not use those existing hashes for this new purpose" will lead to them remaining that way.

You likely run into unsoundness without much effort, if you can change behavior while keeping those hashes unchanged, given what relies on them.

Does it change if you upgrade your linker and rebuild?

That alone pretty much invalidates all of the other discussion.
In fact, I'm increasingly suspicious of the concept of a build ID.

If the primary usecase is looking up split debuginfo, why would anything be used, other than a hash of that split debuginfo? (presumably as e.g. an ELF object with just the debug sections)
As a comparison point, DWARF allows content hashes for files, and they're just that, hashes, not some arbitrary "file version ID".

Or in other words, why isn't it a content-addressed store?
It's so close to being one, and if it was one, any kind of distro tooling that would be able to split out debuginfo would also be injecting the resulting hash of that back into the debuginfo-less binary, meaning Rust would need to do nothing special.

But more importantly than anything, it eliminates so many variables in one go, leaving only hash collisions to worry about.

3 Likes

My understanding is that the SVH covered all build inputs, including the Rust compiler itself.

If it doesn't, or at too-coarse a granularity, that indeed impacts build-id negatively, but it would also impact other users of the SVH (like the compile cache) and I think it should then be changed.

@eddyb If I understand you correctly, the answer is both “yes the SVH doesn't (sufficiently) cover rustc” and “this is a bug in the SVH” ?

That's an excellent point: the debug infos and other build artefacts are linker-dependent.
In the absence of a linker that's captured by the SVH (or whatever other ID) and known to be deterministic, it seems setting build-id prior to linking (as required for passing it as a linker parameter) cannot be sound; am I missing something here?

In this case, I agree it's much more sensible to let the linker determine the build-id, so we went full circle and back to the starting point of the conversation: is there truly an unacceptable impact on linking time when making the GNU linker hash inputs to set build-id ?
@bjorn3, as you raised that issue, do you have concrete numbers?

1 Like

The premise here is that the sources of build infos etc. are (to some extent) trusted, so I was concerned about randomly occurring collisions rather that deliberate attacks (hence the mention of the birthday bound, and reasoning about an idealised fixed-width hash function)

Nevertheless, I agree using SHA1 there is far from ideal; the problem is, SHA1 is what the GNU linker uses, and rustc doesn't ship a linker of its own and relies on whatever is in the user's environment, so we can't easily go and change that.
(I think it would make sense to implement BLAKE3 and a tree hashing mode in commonly-used linkers, but that's far outside the scope of this discussion)

The problem is, build-id is used as a unique id for a many-to-many relationship: for instance, it's possible to lookup the executable binary itself based on build-id, not only the DWARF info; that's is pretty useful when dissecting a coredump or such, to automatically get the right instance of the binary that produced it.

Another example: buildinfod and its clients also rely on build-id to associate source files to the binary being debugged.

Agreed, that would be a much better design, but unfortunately that ship sailed when GNU specified build-id and people did the work to integrate that metadata in debug tooling etc.
In this context, I believe it is better (at least in the short term) to align with what existing tooling understands, rather than roll with our own thing and hope to gain traction.

2 Likes

Trying to summarize the discussion here:

  • rustc picking the build-id wouldn't be sound, due to the reliance on an external linker;
  • the remaining deterministic solution is to let the GNU linker pick the build-id by hashing;
  • build-id behaviour can always be overriden via link-arg, but a config flag would be nice to have.

Outstanding concerns:

  • performance of the GNU linker? (still awaiting concrete numbers and/or testcases)
  • @josh, were your other concerns adequately addressed?

As a point of comparison, gcc has a configure-time build option --enable-linker-build-id which makes it pass --build-id to all linker invocations. I don't know how to find out how widespread this is in practice, but I suspect all major distros enable this since they're all using Build IDs for locating debug info.

I'm curious about the results from the original post, however. Given the fact that rustc invokes the system C compiler as the linker for linux-gnu targets, if your distro's C compiler enables build ids by default then binaries produced by rustc should also have them. I tested this on my local system (Ubuntu 20.04, gcc 9.3.0) and it does result in binaries containing a build id.

2 Likes