Pre-RFC: Cargo SBOM

This is an initial draft at getting precise dependency information from Cargo for each compiled output. There are already a number of Cargo extensions (auditable, cyclonedx, bom) that could consume this information rather than reconstructing themselves.

I'm particularly interested in what information people need for an SBOM and whether this would be sufficient. This SBOM file generated by Cargo is intended as an intermediate format that could be processed by other tools into standardized formats such as CycloneDX, or SPDX. Feedback from authors & contributors of existing tools like cargo-cyclonedx would also be very helpful.


Summary

This RFC adds an option to Cargo to emit a Software Bill of Materials (SBOM) alongside compiled artifacts. Similar to how Cargo emits split debug info or "dep-info" (.d) files, this change emits an SBOM in a Cargo-specific format alongside outputs in the target directory. External tooling can consume this Cargo SBOM file and transform it into other SBOM formats such as SPDX or CycloneDX.

Motivation

A SBOM (software bill of materials) is a list of all components and dependencies used to build a piece of software. The two leading SBOM formats being adopted by industry are SPDX and CycloneDX. Both are still evolving and have multiple specification versions & data formats (JSON, XML).

New government initiatives aimed at improving the security of the software supply chain such as the US "Executive Order on Improving the Nation's Cybersecurity" or the EU "Cyber Resilience Act" require a Software Bill of Materials. Generating accurate SBOMs with Cargo is currently difficult because, depending on target selection or activated features, the dependencies may be different.

For workspaces that generate multiple compiled artifacts, each artifact may have different dependencies referenced. Existing tools (see prior art section) attempt to approximate the correct dependency set, however precise dependency information for each compiled artifact is difficult without built-in Cargo support. Generating the SBOM at the same time as the compiled artifact allows precise dependency information to be emitted for each compiled artifact.

Guide-level explanation

The generation of SBOM information is controlled by Cargo's configuration. To enable SBOM generation, set the following:

[build]
sbom = true

If enabled, an SBOM file will be placed next to each compiled artifact for bin, staticlib cdylib crate types in the target directory with the name <crate_name>.cargo-sbom.json. The SBOM will contain information about dependencies used to build the compiled artifact. If the performance impact is deemed low enough, this could be enabled by default.

Reference-level explanation

Format

The format will use JSON, but the exact format is not specified in this RFC.

The SBOM will include the following information (if available) for each crate:

  • ID (opaque identifier)
  • Name
  • Version
  • Source (registry / git / etc.)
  • License
  • Checksum
  • Dependencies (list of IDs)
  • Type (normal, build)
  • Activated features

Information about the current build environment:

  • Rust toolchain version
  • RUSTFLAGS
  • Current build profile name
  • Selected profile values

If a crate is used as both a normal dependency and a build dependency that is separately compiled from resolver v2, then separate entries will exist in the dependency tree with the correct activated features listed for each instance.

Drawbacks

It introduces yet another SBOM format. However, the format is specifically designed to be used as an intermediate, to be converted to an industry-standard format by external tooling.

Rationale and alternatives

Since there is no consensus on a single SBOM format within the software industry, and existing formats are still evolving, Cargo should not pick an existing SBOM format. If Cargo were to use existing SBOM formats, multiple formats (and multiple versions of each format) would need to be supported. The task of generating a specific SBOM format is best left to applications outside Cargo or Cargo extension.

Unfortunately it's difficult to extract accurate SBOM information with existing options. Using the Cargo.lock file or cargo metadata overincludes dependencies. Additionally, since Cargo has many different commands that produce compiled artifacts (build, test, bench, etc.) and each of these commands take arguments that can affect the dependency list it's difficult to ensure that the correct dependency list is used.

Adding an option to cargo metadata to support resolver v2 would help with overinclusion of dependencies, but still makes it difficult to ensure the exact set of features, command-line arguments, and other options are taken into account.

Another alternative is to extract information by setting the RUSTC_WRAPPER environment variable, then capture feature flags and dependencies via a wrapper tool. This would require the wrapper tool to parse the rustc command line arguments to capture the set of feature flags and referenced dependencies. This approach would prevent other uses of RUSTC_WRAPPER, as well as being potentially fragile.

Prior art

  • RFC2801: proposes embedding dependency information directly into the binary. Implemented as the cargo auditable extension.
  • cargo-auditable: Cargo extension that embeds a subset of the information described in this RFC directly into the binary. The JSON format used by this RFC could be based on the cargo-auditable format.
  • cargo-cyclonedx: Cargo extension to generate a CycloneDX SBOM.
  • cargo-bom: Cargo extension to generate a BOM in an ASCII format including license information.
  • cargo build-plan (#5579): provides an option to emit a JSON representation of the commands to execute, without actually running them. This option has poor integration with build.rs and was planned for deletion in 2018.

Unresolved questions

The exact specifics about what will be included in the SBOM and the specific JSON format are subject to change during the implementation of the RFC.

Future possibilities

Industry standard format

If the software industry converges on a single, stable SBOM format, Cargo could directly emit it. The existing SBOM formats are currently changing too much at this time to standardize on a specific format.

Additional fields can be added to the SBOM without a breaking change.

Build scripts

Build scripts could communicate back to Cargo to inject additional dependencies into the SBOM. For example, if a crate builds c code and then links with it, it could emit a message that causes Cargo to read in a file describing the c dependency.

cargo::sbom=<PATH>

Cargo would then include the additional dependency information in the SBOM graph.

Embedding dependency information into binaries

The implementation of RFC2801 could be based on the information provided by this RFC. A subset of this information could be embedded directly into the binaries.

3 Likes

Placing files next to built products will run into the same problem that Cargo has with separate debug files: bin and lib products have the same base name. A file like <crate name>.sbom will be ambiguous/overwritten when building a lib + bin package. This is an unresolved problem in Cargo, so I don’t have a good suggestion how to avoid it.

4 Likes

Is there a downside to <artifact name>.<metadata extension>?

Should this be a separate file, or given it's a custom format anyway is what cargo-auditable doing with embedding it a better approach? The external tooling to deal with this format can read it from the binary and emit the different standard formats almost as easily as they can read it from a separate file.

2 Likes

Thank you very much for writing this up.

No matter the exact implementation we end up with, the general concept of a file that records what actually happened during a build is useful beyond just SBOMs (e.g. provenance, pedigree, attestations etc.). As such I'm not convinced if "SBOM" is the correct term but I'd be happy with that as well.

A few comments, most of them minor:

  • I wonder (need to think about it a bit more) if it makes sense to have one metadata file for the whole build pointing at all the "cargo-sbom.json" files to avoid having to do a find -name to gather all of those and then to have to parse the filename to match it to a compiled artifact or something like that (but I do see the comment from @kornel so maybe the whole thing has to change anyway)

  • The SBOM should also contain a subset of items from Cargo.toml:

    • Author
    • Version
    • License
  • (Optional) Timestamp of when it was created

  • And a hash of the dependency would be great... that's a broader discussion though.

In general it'd also be great if we could add a sentence on reproducability: As in, we'd like to guarantee that the same sbom file is generated for the same inputs (ideally bit-by-bit)

I hope that helps as initial feedback, I'm sure I'll have more.

All I can say that this will definitely be required by a huge majority of people building and selling software for end customers in the future.

1 Like

Also which linker is used, and the version of the linker, and maybe a checksum of the linker.

Also the value of any cargo flags that can affect the output (maybe it's already covered by this above, maybe not)


But more importantly: a single command that can take the sbom and either build the program exactly as it was, or fail if some component couldn't be found (for example, if you used mold and mold isn't in PATH or is a different version, fail to reproduce)

1 Like

cargo-auditable specifically optimizes for

  • Size
  • Build reproducibility

The intent here is to be a source material without these kinds of restraints and without concern for other policies (e.g. what symbol to put these under).

To clarify, you mean that cargo build -p foo -p bar would reproduce one artifact rather than multiple?

The SBOM should also contain a subset of items from Cargo.toml:

If information can be acquired directly, should the SBOM still provide it?

(Optional) Timestamp of when it was created

Could you expand on the role/value of this?

btw if we did one file, I expect it'd be like --timings and we'd have a timestamp in the file name with a symlink to a more general name

And a hash of the dependency would be great... that's a broader discussion though.

There is some related discussion at `cargo metadata`: expose checksum in packages · Issue #12818 · rust-lang/cargo · GitHub

In general it'd also be great if we could add a sentence on reproducability: As in, we'd like to guarantee that the same sbom file is generated for the same inputs (ideally bit-by-bit)

Wouldn't that run counter to timestamp? If you mean across machines, that could run into other issues as well. This is one of the reasons we were modeling this off of debug info as that might not need to be as reproducible as the actual binary.

First time I've heard of the concept of SBOM. However, I often save the current Cargo.lock with the binary in order to accomplish a similar purpose (logging and tracing the exact dependency version of the binary build). Is it possible to extend Cargo.lock appropriately to accomplish this?

1 Like

A Cargo.lock includes all potential dependencies. What people are needing is the exact set of dependencies used for the current build, along with other build-specific metadata. Adding that to the Cargo.lock would mean it would change on every invocation which would be bad as its a file intended to be committed to your VCS.

I don't mean extending Cargo.lock itself, but extending its format to make it usable for the BOM.

The exact format is not specified in the RFC. It's we could base it on:

  • An extension of the cargo-auditable format.
  • An extension of the Cargo.lock format.
  • An extension of the cargo metadata format.

The intent of this RFC is to get consensus on the idea of emitting it as part of the build output and a starting point for what data should be included.

Regardless of what format we use, it should be easy for consumers to parse, which makes JSON-based formats slightly preferable to TOML.

1 Like

I've created cargo auditable and worked on cargo cyclonedx, so I feel I should chime in.

What information should be recorded

To elaborate on some of the items you listed:

Component hashes. The biggest missing piece. Mandatory for SBOMs in some jurisdictions, currently not recorded by cargo cyclonedx. There is no reasonable way to obtain this information right now. The regulations that require hashes do not specify what these should be hashes of, so that's fun! I would hash the parts that get passed to the linker, because AFAIK there is no way to reasonable way to hash the source code (unless we add a cargo hash or something that keeps track of all files used to influence the build and hashes them all, which is a whole separate project).

Resolved dependency trees. You have called out feature resolver V2 already. Right now cargo metadata does not expose the reality of Cargo having two (or far more?) resolved dependency trees. Normal and build dependencies have different dependency trees with features from one not affecting the other; and dev-dependencies also exist but don't take part in the build - except when you build benchmarks or tests, in which case they do. And IIRC examples can also have additional dependencies just for them. You'll probably need to record normal and build dependency trees for every binary generated.

Resolved build configuration. If you want to capture the build configuration, e.g. RUSTFLAGS, then you have to collect it from a bunch of different sources. They can be defined as an environment variable, or as a command-line argument to Cargo, or in a project-specific .cargo/config.toml, or in a global .cargo/config.toml. Reimplementing the Cargo configuration algorithm is tedious and error-prone, so it would be great if it could be just dumped at the end. Although the parts that get converted into rustc arguments can just be ignored and the rustc command called for each component could be recorded instead; that results in both less work for the implementation and more detail for the SBOM.

When should the file be generated?

It is somewhat tricky to make the generated info useful for both cargo auditable and cargo cyclonedx simultaneously. auditable needs to know the build configuration before the final binaries are generated while cyclonedx would much rather know the hash of the final binaries.

The solution I see is generating the file just before invoking the linker, and including the path where the final binary would be placed. This would enable RFC PR #2801 by making the info available before linking starts, and let SBOM tools such as cargo cyclonedx calculate whatever hashes they please instead of restricting them to the ones Cargo emits.

Add a callback?

The out-of-tree cargo auditable will not benefit from this mechanism unless it it can register a callback to be called with the generated SBOM file as input, and emit additional flags for rustc to pass to the linker.

An in-tree implementation of RFC PR #2801 maaaybe could work without this? I'm not really sure. Embedding info into the executables requires writing object files, which only rustc knows how to do (cargo auditable uses code copy-pasted from rustc). There is no easy way to teach Cargo to do it without also exposing it to the full complexity of target info within rustc. So there either needs to be a command for rustc to write an object file with the specified contents for a given target + a callback to actually run it before linking, or the whole writing code should be within rustc and then you don't need a callback at all; but that requires rustc to have knowledge of origins of crates (crates.io/git/custom registry/local filesystem) which it doesn't have and shouldn't have. Could we just uplift the logic into rustc and sacrifice origin information? This is feasible if rustc knows at least about crate versions. This also makes it difficult to evolve the embedded info in the future by e.g. recording licenses.

A callback would also give tools such as cargo cyclonedx an opportunity to hash all the constituent parts before they are linked together with the algorithms of their choosing, rather than the one Cargo chose. There still needs to be a post-build step to record the hash of the final binary, but that's not too hard to implement as long as it knows there the binaries are going to be placed.

This is also where anything that would verify the hashes against an older SBOM could run.

No additional callbacks will have to be introduced if the file is already placed by the time RUSTC_WORKSPACE_WRAPPER runs, and the path to the file is passed to it in an environment variable. This is the stage at which cargo auditable currently injects itself into the build.

1 Like

Something doesn't add up. "Hashing isn't defined, so let's make up our own thing"? Could this be a policy that we leave to the tool in question rather than arbitrarily hashing something and then people finding out no one cares about it?/

Note that do capture mtime for workspace members and I'd love for us to switch that to a hash. We skip this for registry dependencies since they are immutable.

And IIRC examples can also have additional dependencies just for them

Examples can use dev-dependencies but yes, if the build target is an example, then we should include that in the report.

Resolved build configuration. If you want to capture the build configuration, e.g. RUSTFLAGS, then you have to collect it from a bunch of different sources. They can be defined as an environment variable, or as a command-line argument to Cargo, or in a project-specific .cargo/config.toml, or in a global .cargo/config.toml. Reimplementing the Cargo configuration algorithm is tedious and error-prone, so it would be great if it could be just dumped at the end. Although the parts that get converted into rustc arguments can just be ignored and the rustc command called for each component could be recorded instead; that results in both less work for the implementation and more detail for the SBOM.

This Pre-RFC is for a feature that would be implemented within cargo, so there should be little work to capture the passed-in rustc flags because we can capture it after its already been collected.

The benefit of capturing just the passed in rustc flags (and other cargo configuration) is you get an idea of inputs to the system. If we focused instead on capturing the literal rustc invocation that would make it hard to tell what came from another piece of captured input and what is important for reproducing / auditing things.

cyclonedx would much rather know the hash of the final binaries.

Wanted to double check. You are saying "cyclonedx will be hashing the final binaries so it doesn't need the cargo's document earlier than that"? Or do you expect cargo to be hashing the binaries? If the latter, I chalk that up with other information in the report: cargo should be responsible for giving tools enough information to generate tool-specific information and shouldn't be responsible for tool-specific information.

This is also where anything that would verify the hashes against an older SBOM could run.

Could you expand on this? I've not seen this brought up as a use case yet.

If this is the case, I highly recommend making the config option a string:

[build]
sbom = "alpha-cargo-1"

I would prefer a json file with a version field to make it extensible over time. In the toml:

[sbom]
dir = "sbom"
format = ""
...

Since debug info is part of our inspiration for this, it makes me wonder if this should similarly be in [profile]. This would let it both be in config and the manifest.

1 Like

Speaking of configuration, there also must be an environment variable to enable/disable this behavior. In fact, I expect this the most common way for this to be enabled/disabled.

If you run cargo cyclonedx to generate a SBOM, it shouldn't require you to edit the Cargo.toml first before you can run the tool (or edit it itself). You want the SBOM to be created only when the tool runs, not during regular builds; and the best way to accomplish that seems to be an environment variable.

Ditto for cargo auditable.

What I meant is more like "hashing isn't defined, so we get to do the most sensible thing".

Hashing emitted code before linking is the easiest to accomplish and ticks the box of having hashes in a SBOM; but I am not sure how useful that is going to be in practice.

The way the hash should be calculated is not entirely straightforward. This runs into design questions such as whether the hidden directories like .git or files not tracked by a VCS are included or ignored. Both could theoretically influence the build via include_bytes!, proc macros or build scripts. There are also issues around some filesystems being case-sensitive while others aren't, potentially resulting in different sets of files.

This is a big part of why leaving this up to third-party tools is tricky in practice. Reimplementing this hashing in a way that does not diverge with Cargo itself is going to be difficult.

Yes, I agree that giving cargo cyclonedx knowledge of where the final binary will be placed and letting it hash it in whatever way it wishes is best.

I think for registry packages we should take the hash that is stored in the lockfile. For git repos we could take the git commit and for local crates there isn't really a way to find all files that have been used. The build script may read anything. As such for local crates I think we should leave it empty for third party tools to fill it in.

We could unconditionally generate the SBOM. We already unconditionally generate a .d file for Make to consume when invoking cargo and I don't think generating an SBOM will noticably hurt compile times and disk usage.

1 Like