Pre-RFC: Cargo SBOM

We need to make it clear when referring to SBOMs if we are talking about

  • The proposed file cargo outputs
  • A third-party file format
  • What is or will be required by regulators.

Regulators or the third-party formats may require something but that doesn't necessarily put a requirement on what cargo provides directly so long as it can be gotten indirectly.

3 Likes

I stated above that I live too close to the BSI. My SBOMs must include the author. I bet there are regions/countries where the SBOM must not include the author. Must cargo become location/regulation aware?

I would expect cargo to write a cargo-specific JSON file to the disk.

It's possible to get the author data (for crates that still include it) from cargo metadata once you have the package ids that were used during the build.

I notice that currently the format in the OP contains other duplicate information from the metadata (name, version, source etc.). Rather than having to duplicate this information (and decide what information from the metadata should be duplicated), should the SBOM instead just focus on the data that is build-specific, like the exact dependency graph that was built, with activated features, and the current build environment. (Essentially something similar to --unit-graph). Maybe it could also record the entire metadata structure, but not try to interleave that data in with the build-data, leaving it up to third-party tooling to query based on the ids.

I said that my SBOM must contain the author. I cannot get it from a side-channel later on. I would expect cargo to write the author into the JSON file.

The BSI guideline refers to RFC 2119. So I think it means a SBOM MUST contain those fields. But the "Ersteller der Komponente" field in section 5.2.2 says it's an URI which merely SHOULD contain an email address. This is not the same as "Ersteller der SBOM" in section 5.2.1

So it's possible that a link to an organization or a project page would be sufficient, especially for open source projects.

Someone else can put their neck out if they want to be the sbom author.

Why must cargo generate a final SBOM for exactly your needs rather than having a tool adapting it, along with side channel information, into your final SBOM?

1 Like

I am not a lawyer! Maybe for each region/state/... I need my own post-processing tool to adapt the SBOM to my legislation.

The point is that Cargo's "SBOM" is not your final SBOM. Cargo physically cannot comply with every possible idea of what the SBOM should contain. What cargo can do is collect the information most likely to be useful or difficult to acquire otherwise (e.g. the exact package list which contain runtime linked CUs) and provide ways (e.g. cargo-metadata) to acquire whatever other information might be wanted for the SBOM, which can then be aggregated and put into the correct format/location by tooling (e.g. cargo-auditable) as part of your distribution packaging workflow (e.g. cargo-dist).

Think of the Cargo produced artifact as an "SBOM fragment". You then combine that fragment with your domain knowledge/requirements to produce the full SBOM.

cargo-publish is not the only step in packaging an executable for distribution. As much as the ideal of a portable single file executable is nice, software more interesting than a CLI tool needs auxiliary resource files which are outside of Cargo's scope to package. cargo-build is the "make compile" step to be integrated into the larger (and project dependent) packaging workflow. And we're constantly working on making that more practical to do.


While emitting the {artifact}.cargo-sbom.json as another artifact of compilation makes sense, extending the cargo-metadata output with the appropriate information also makes a good deal of sense. The crate information is available without compilation, only the build environment is transient. That cargo-metadata isn't able to distinguish between target and host dependency resolution is a limitation that should be fixed.

It'd be nice to see a bit more stated on the alternative/rationale on why making the SBOM as part of the build is preferred to a separate step, assuming that metadata does grow the ability to differentiate between package trees. I do believe that perfect capture of build environment is justification enough, I'd just like to see that laid out a bit more explicitly.

4 Likes

That seems to be the right direction. The RFC should clearly state that due to local legislation the creation of the SBOM will be a multi-step process. The first step is cargo dropping the JSON on the disc.

This wouldn't be possible with the future-possibility of allowing build scripts to add additional data to be recorded. (I'm not sure whether there's any existing data from build-scripts that should be recorded without any explicit changes, should things like link-args be recorded to know of system dependencies?).

1 Like

I think you already laid it out pretty nicely. If it's part of the build process it can't get closer to the ground truth. We're moving more and more into a world where software development is going to be regulated (whether we like it or not, across jurisdictions). This includes things like proving the provenance of the builds, the compilers, etc. and the closer we can get it from the source the better. I can imagine that compilers - in the future - might even provide attestations of builds directly (e.g. signatures). Rust will never be able to comply with all regulations and I don't think that should be the goal either. But to make it users as easy as possible to comply we should help them out as much as we can.

Sure, we can tell users to look at Cargo.toml and Cargo.lock and this new sbom file and piece together what they need and that's what we'll do if needed. As a user I'd be very happy if Rust did as much as possible for me, this would also make a great selling point for Rust and goes nicely with its "security" story.

We now have two SBOM formats really (CycloneDX and SPDX), both of which would need the logic to look at Cargo.toml etc., if we have one format (the new SBOM format discussed here) that is enough for both it'd lower the implementation burden on everyone producing SBOMs.

I do understand that there is something to be said about keeping this new format minimal but I think there are also good reasons to make it as useful as possible.

If we're not producing a final SBOM then what makes this important? Likely the number of consumers will be relatively low while the requirements will likely be ever evolving and likely contradictory. I suspect the amount of data that these implementations will be able to get from being less than minimal will be small compared to the amount of data that will be needed that cargo shouldn't or can't provide and the complexity of getting it will likely be low. I'm not seeing a huge cost for pushing this onto the consumers of this format but speculatively adding content that has impedance mismatches with the final output will leave us stuck with that forever. If we also focus on giving you the resources to get the information yourself, you also avoid being blocked for 6-12 weeks for the feature to be stable (assuming we can insta-stable it).

Hi everyone! Coming a little later to the party, but wanted to both summarize what I read above, and make some points I think will be helpful.

Summary

  • Cargo's existing commands and data written to disk don't adequately expose dependency information on a per-build basis, which makes constructing a proper SBOM more difficult. The goal is to fix that by exposing the proper information.
  • For now, supporting non-Rust code for this is out of scope, although it may be addressed in the future
  • cargo-auditable, cargo-cyclonedx, and other tools exist in this space already and have experience with the technical challenges here, many of which Shnatsel has raised (and linked to issues about).
  • This new Cargo command would probably not embed things in the binary how cargo-auditable does; this new thing has different goals from cargo-auditable.
  • The UX of this (what config settings are involved, is it settable by an environment variable, what is the output file named and where is it placed) still need to be ironed out.
  • There's an open question around hashing; what should be hashed, how should it be hashed?

Clarifying Points

  • Cargo, under this proposal, would be producing what I'll call an "SBOM precursor" rather than an SBOM. It will follow a Cargo-specific format and will not be usable for distribution as an SBOM for others to consume. The goal of it would be to provide build-specific data which is hard to get otherwise, and other tools could consume it, plus possibly other information sources, to produce SBOMs in one of the existing or future standard formats (CycloneDX, SPDX, etc.)
    • This also addresses the regulation issue raised above; the intent would be that this is categorically not an SBOM and thus is not appropriately under SBOM regulations.
    • If I'm right in my description, it's maybe a good idea not to label things here "SBOM," lest you confuse people about the purpose / quality / readiness for use of the output.
  • I'm involved in the OmniBOR Working Group, where we're trying to define a standard mechanism for hashing software artifacts and their granular dependencies! This seems to fit nicely with the questions around hashing in the conversation above. You can check out our draft spec, and we'd also be happy to talk with you all about the hashing mechanism OmniBOR defines, our thoughts on what's included vs. not, and how OmniBOR might fit here as part of the solution to the hashing questions.
5 Likes

I understand the desire to keep the new format minimal. But I wonder if there's already an impedance mismatch.

The indispensable information for this cargo-sbom feature is the specific tracking of exactly which features and flags were enabled when compiling a specific binary. Everyone agrees on that.

This sub-discussion seems to be about what additional information, beyond that minimal build-specific information, should be included. It sounds like Eric is suggesting that cargo metadata could still be the source for non-build-specific dependency information.

My question about that is, isn't cargo metadata then delivering an uneasy combination of build-specific and non-build-specific information? Right now, cargo metadata expects to have some combination of features enabled. Would that set have to be exactly the same set of features used to generate the cargo-sbom output? Or would cargo metadata --all-features always be a strict superset of all needed information?

Basically, if we require users to "join" the output of cargo metadata with the output of this sbom feature, we will need to have a very clear description of exactly what cargo metadata command they should run, and when, and with what features, to align correctly with the cargo-sbom.json files. Do we have clarity on what that would be?

If they would have to run cargo metadata with the exact same features used in the build to get the same resolution information, then I would argue that's a good reason to extend the cargo-sbom.json format with that metadata. Because even if it were possible for them to do it, it'd be a lot of difficult coordination.

If on the other hand they could just run cargo metadata --all-features once for the workspace, and join that with any cargo-sbom.json output from any binary, then that might be simple enough to be fine.

From my understanding, most of what we are looking to cargo metadata for shouldn't need a complex interaction of flags. Those would be for the dependency tree which would be in the SBOM. Is there something I'm missing?

I don't know of anything you're missing, and it sounds like you more or less agree that cargo metadata --all-features would have everything that would not be in the SBOM.

But we should validate that, and probably an actual RFC should include a proof-of-concept script to emit some SBOM format from cargo-sbom.json plus all-features.cargo-metadata.json joined together. Providing a working example to validate the concept seems pretty important to make sure it is actually possible.

But we should validate that, and probably an actual RFC should include a proof-of-concept script to emit some SBOM format from cargo-sbom.json plus all-features.cargo-metadata.json joined together. Providing a working example to validate the concept seems pretty important to make sure it is actually possible.

Generally something like that would be done post-RFC but pre-stabilization.

FYI the RFC is up