[Pre-RFC] `cargo package` should include fewer files by default

Hi everyone! I was thinking about opening an RFC and wanted to test the waters here a little bit. :slight_smile:

In short, I want to propose that we limit the files that are packaged by cargo package by default before uploading a crate to crates.io. This would not only save bandwidth/resources, but also make reviewing crates as part of a supply chain easier. Since that would be a breaking change, I reckon that the upcoming 2027 Edition would be a great opportunity to include such a change.

A bit of context: At my company, it has become part of our routine to review every Rust dependency that is updated in our code base. This means that we have looked at hundreds of project structures and diffs over the months and from time to time we find crates that include files that aren’t really required. Be it gifs or pngs, pdfs, or maybe binary test data. For some of them, it is worth discussing if they should actually be shipped with the crate, e.g. to locally verify that the tests are running, and that is ok, but some files are truly not required and balloon the size of the crate without any benefits.

Whenever we encounter such cases, we try to open PRs to exclude these files from the shipped crate. Here are some examples for the typical three scenarios:

Optimizing the size of crates saves resources (bandwidth, storage) on both the users’ side and the side of crates.io. The crate-size example linked above is more of an extreme case, yet I believe that many crates exist where such files are included by accident, and that wastes a lot of resources.

On top of that, having more files in a crate also means that reviewing them is more involved, and in the case of binary files, also much harder or impossible to do. Thus reducing the number of shipped files not only saves resources, but also makes it easier to ensure the supply chain safety of the Rust ecosystem by limiting the number of attack vectors and reducing the time needed to review crate updates.

Package managers that include only a limited set of files by default are, for example, pip or the Swift Package Manager. Here the files are limited to mostly python or swift source files, except for the usual suspects like a README or LICENSE file.

For Rust, I would start by proposing the following option: Besides the files that are always required (build.rs, cargo related files), and files we really should include (readme, change log, licenses) we could limit any other files to Rust source files (.rs) and Markdown files (.md). The include = [..] (overwrite the default list) and exclude = [..] behavior could stay the same.

This way, we make sure that any images, binary files, or scripts are included deliberately and not by accident.

Another possible option would be to limit the included files to the typical directories of a Rust project (/src, /tests, /benches), which should reduce the number of unneeded files as well, but in my opinion not as effectively as the first option.

These are not yet 100% technically accurate descriptions since I have only briefly looked at the packaging code, but I think this is enough to convey the general idea here.

In any case, since that would be a breaking change, the upcoming Rust Edition 2027 would offer an opportunity to ship such an adjustment. Since 2027 is not far off, I thought it would be a good point in time to discuss such a change.

Some quick thoughts

  • It would be good to go into specifics on prior art. If Swift includes LICENSE, what about LICENSE-MIT?
  • It would be good to enumerate the current behavior and some of the existing conversations around it in case there are opportunities for these ideas to build off of each other
  • We want a way for users to tell cargo about license files for specified licenses which could get them automatically included when packaged but there is some design work there. I'd recommend this as per-requisite work so we don't have to make assumptions about license files (which we've already shied away from with cargo new).
  • If we start with default cargo locations, what we can instead do is enumerate all build targets and use their parent directories
  • If we go by extension, this will likely miss a lot of test data that people won't notice is missing because they don't run cargo test on the package. I wonder if we should exclude test/bench build targets
  • Generally editions are supposed to have an automated migration path though we have flexed a bit on that in the past.
  • I have a preference for expressing edition changes as changing of defaults for existing settings. For example, Cargo.tomls resolver = "3" changes the default for .cargo/config.tomls resolver.incompatible-rust-version.
  • Excluding test targets would not affect the presence of #[test]s in the lib sources, so it is not a means to avoid publishing non-functional tests. (Though I imagine it is more common for tests which need external data files to be test targets.)

  • Most packages’ tests do not take up significant amounts of disk space (or have any non-code files at all). Keeping tests, when they are cheap, is valuable for “wait, did this published package ever work as intended? If so, on which Rust versions?” kinds of investigations. I also think that it would be valuable if cargo publish verification ran tests to reduce the chances of publishing a broken-as-published package — all the more so if the default file exclusion rules were stricter.

    So, I think that tests which have data files (large or not) should be treated as a non-default situation which you might have to configure the manifest for.

    That is, I’d prefer the following behaviors:

    • Files not known to be likely part of any target’s source code (or licenses, etc) are excluded from the package by default.
    • cargo publish/cargo package runs tests, not just a build.
    • If your tests don’t work under these default conditions, you are encouraged to explicitly configure excluding the relevant tests, both code and data, from the package.
      • Having such explicit configuration available (perhaps something like test.*.publish = false) would also benefit users who would like to exclude all test files, but currently can’t do so because it leaves the manifest with a [[test]] entry missing its crate root source file (which, IIRC, is currently an error whether or not the test is being run).

Another possibility might be to require more explicit inclusion/exclusion:

  • Have default rules for which files should be included (source files, LICENSE*), extended by package.include
  • Have default rules for which files should be excluded (.*), extended by package.exclude
  • Error if any files are found that is not covered by either
1 Like

Certainly it would be bad if we can't use crater any more because most things end up with broken tests.

3 Likes

Another option would be to pick some arbitrary file size cutoff and refuse to publish if any non-source files are larger than that cutoff unless a CLI flag is passed to acknowledge their inclusion as deliberate.

No, thank you. Crates may have more than "the machine I use to publish with the Rust toolchain I happened to publish from" as interesting targets where the tests need to pass. And without build isolation, you still have the "oops, works on my machine" problem. I have our publishes done in a CI job that depends on all relevant configurations testing successfully; I don't think cargo publish is ever going to understand that for anything beyond "needs to work on a Linux"-scoped crates.

If the "the build is sensitive to this file" worked better (I don't have the issues handy, but basically if you use them, Rust's default file collection is completely ignored, so you have to implement it just to add one file to the list), then I think cargo publish could at least rely on build.rs and cargo check reporting. But again, if a file is only used on Windows, a pass on Linux is going to miss it.

It's good to see a lot of feedback already. Thank you all for chiming in!

There are many things to get into, so lets hope I don't forget anything.

TL;DR:

  • I think verifying tests during packaging/before publishing is out of scope for this RFC. I would stay with the current situation where tests are not guaranteed to work as part of the shipped crate.
  • I wouldn't want to fully exclude tests as default behavior, but I think it is reasonable to expect more complex test setups to explicitly include files that are needed for their tests to work.
  • I've looked further into the Swift Package Manager and pip and described their behavior.
  • I think we are already finding some common ground between our expectations/wishes and how other package managers handle this topic by default.
  • There are some open questions regarding migrating to a new edition and how to handle license files.

Verifying shipped tests

First of all we got a discussion going on around verification of shipped tests, also related to crater. I can share a bit more of our experiences in that regard. When we started out with reviewing our supply chain, we discussed with crate maintainers the possibility to remove tests from shipped crates if they had no actual use case for them in mind (we've since moved away from that position). We got some of the same concerns as shared here in this thread (important for crater, needed for verification on local system, etc.), but when we checked the tests in these crates or the last crater runs that had them included, a lot of them where already failing today. Combined with the limitations of architecture dependend code, I'm not sure if test verification is in the scope of the RFC envisioned here. I think it would absolutely be a useful thing to have in the future though.

For this RFC, I would keep the current situation in that there is no guarantee that shipped tests automatically work and crate maintainers need to put some effort in if that is something they deem important. If we import tests based on a glob imports (e.g. everything in /tests/**/*.rs), most of the simpler test setups should just work and thus contribute to crater runs. Crates with more complex test setups would have to check if tests still work and manually include files they want to make available for that (which I would consider the way to go, since maintainers then would have to think about that a bit more deeply). I wouldn't want to "sabotage" crater by excluding tests by default, although that is something that I would appreciate during reviews. :wink:

Specifics about how other package managers handle file imports

I've looked a bit more into the Swift Package Manager (SPM) and pip. For the SPM, I've actually misremembered some things. Since there is no central package repository, the SPM basically downloads whole git repositories at a time and then applies filters afterwards to figure out what the package is. Nevertheless, I think we can draw some inspiration from that filtering.

The docs are not super conclusive on that one, so I dug around in the source code a bit.

  • There are predefined lists of directory names (lines 286 - 292) and file extensions (lines 701 - 865). Basically they include directories and file extensions typical for swift projects.
  • The directory names are used to find the "main part" of the package, if not further specified by the package creator.
  • The file extensions are used to determine which files are part of the package and in which category they belong (mainly source, test, or resources). Files that are not part of this predefined list are ignored if not explicitly included by the package creator (this would include things like binary files or images).
  • Since this selection is relevant for the following build step, license or md files are not handled, so we can't learn anything in that regard.

For pip a lot of information can be found in their documentation. Files that are included are:

  • Module/package directories derived from the project toml (these are mostly the *.py files)
  • A number of accompanying files (C, scripts, data) as explicitly defined in various places
  • Glob patterns for tests (tests/test*.py , test/test*.py)
  • License files as defined in the project toml (license_files). If not defined, all files that match LICEN[CS]E* , COPYING* , NOTICE* , AUTHORS**. This is in addition to a license field where the license needs to be defined as SPDX value.
  • A number of project relevant files
  • README , README.txt , README.rst or README.md

So i think we can see that for both cases that the default list of included files is based on predefined directories/file extensions combined with project relevant paths to look in the right places. Files that don't match predefined extensions can be added by the user if necessary, but are not included by default.

Btw I just wanted to briefly mention the behavior for npm and Go, for completeness sake.

  • npm works closer to cargo in the sense that it includes all the files by default and the user has to explicitly define if files are not supposed to be included.
  • Go seems to use a simpler version of the SPM approach (download source from some repository) in which everything is downloaded all the time, but then the build system figures out what it needs during compilation. So the additional filter step of the SPM is skipped here before handing files to the build system.

Reacting to specific comments

Sorry does that mean the current behavior of cargo or of the other package managers?

I think as seen for the pip example, it would be ok to go with a glob import pattern as well if license-file is not set? This way we could handle the license = "MIT OR Apache-2.0" situation as both license files would be included?

This sounds similar to the pip approach in that it takes the actual relevant target directories. Sounds good to me, yet I don't know how difficult it would be to get that kind of information during the packaging step.

Yes that would be something to think about. Adding a general include = ["**/*.*"] would circumvent current omissions (like files/directories defined in the .gitignore file) I think, so we would have to come up with a more sophisticated approach.

I am not sure if the include field in .cargo/config.toml might work for that purpose or if we even want to utilize it in that way. Something I would need to have a look into. Otherwise I a missing the experience of past edition changes here, so suggestions are welcome. :grin:

That is certainly the most effective option to reduce the number of e.g. images/gifs that are included by accident. I fear that this would miss the smaller scripts or binary files that make the review process of the supply chain harder. For the RFC that I had envisioned, I would hope that we catch these files as well.

In regard to the post by @kpreid, there are many things I agree with or that are present in the other package managers as described above (except for the automatic verification of tests, as written above). Notably

I think throwing an error if files are found that are not covered by any rule goes too far, but maybe a warning would be a good tool to let the user know "there are files that are not included, are you aware of this". Personally I would've gone without any error or warning because I would assume if the crate builds, it can't be too bad. But of course that is something that can be discussed. :slight_smile: