Idea: introduce `project` field to Cargo.toml to make micro-crate designs less scary

In this discussion I've ended up with the following idea. I think it's worth to flesh out it a bit more and have more eyes on it.

Motivation

Recently the problem of "dependencies explosion" have been raised many times. A natural reaction is to push for limiting or even reversing split of projects into smaller crates (e.g. tokio).

While small-crate approach indeed has its issues (e.g. in some cases it can increase maintenance cost a bit or make reviews harder, makes life of Linux package maintainers harder), it also has its undeniable advantages (e.g. in some cases it allows to dramatically reduce a total amount of LoC in user projects, thus making it easier to do reviews, opens doors to incremental stabilization, faster API iterations, etc.). So as usual there is a golden middle way between huge monolithic crates and one-liner "nano" crates to be found, which will vary between projects.

But in addition to technical merits and demerits, I think there is a significant hidden factor which influences discussions of this problem: a total number of crates is the only metric which users constantly see when they build Rust projects. So instead of focusing on (arguably) far more important metrics like total amount of LoC in a project or number of groups which maintain project dependencies, they tend to use size of project's dependency tree as a complexity metric. So when a monolithic crate in a dependency tree splits into a number of smaller crates, they perceive it as an unwarranted complexity explosion which increases risks for their project, even though a total amount of code and a number of maintainers who you have to trust haven't changed. Thus in my opinion this psychological factor has a significant influence other people position in micro-crates discussions, which ideally should be suppressed in order for people to focus on technical points more.

Proposed Solution

I propose to add the project field to Cargo.toml:

[package]
name = "rand_core"
version = "0.5.1"
project = "rand"

This field will be used to indicate that this crate is part of the given crates umbrella, thus it will be much easier to establish that for example rand, 'rand_chacha', rand_core and getrandom are all part of the single project and thus can be essentially viewed as a single crate when analyzing risks for a project.

To make this information more visible cargo can by default group crates using the project name. For example if we'll take sha3 crate as an example, instead of:

Compiling typenum v1.11.2
Compiling byteorder v1.3.2
Compiling byte-tools v0.3.1
Compiling opaque-debug v0.2.3
Compiling keccak v0.1.0
Compiling block-padding v0.1.4
Compiling generic-array v0.12.3
Compiling block-buffer v0.7.3
Compiling digest v0.8.1
Compiling sha3 v0.8.2

You will see:

Compiling typenum v1.11.2
Compiling byteorder v1.3.2
Compiling generic-array v0.12.3
Compiling project crates: crypto (byte-tools v0.3.1, opaque-debug v0.2.3,
    keccak v0.1.0, block-padding v0.1.4, block-buffer v0.7.3,
    digest v0.8.1, sha3 v0.8.2)

This way users during build will see per-line not a total number of crates, but a number of groups which maintain their dependency trees, which is as stated earlier is a more important metric to look after. If only project has only one crate in a dependency tree, then to make output less noisy grouping will not be used.

To protect from project name hijacks we can mandate that only crate names can be used as a project name. To publish a crate with project name "foo", you have to hold write access to the "foo" crate, otherwise crates.io will deny such upload.

Extension

Since many crates do not belong to a big multi-crate project, it may be useful to add an extension which will allow to use author username as a "project" name, with the same grouping behavior during builds. For example:

[package]
name = "byteorder"
version = "1.3.2"
project = "user:BurntSushi"

You will be able to upload such crates only if "project" user is registered by crates.io as its owner .

23 Likes

I like the idea of having a project field, but I don't like the user addition. It makes sense to group up by project, as those crates are (probably) related. This does not hold for multiple crates by the same user.

How do you want to realize grouping in general by the way? Only group up crates, when no other crates are compiled meanwhile?

As an example, assume the following project: Crate A and B, B depends on A, and crate X and Y, Y depends on X. MyCrate depends on Y, thus X and Y get grouped. It also depends on B, thus A and B are grouped as well, but both groups may appear at a different time when compiling. How would you solve this? Display the project twice or magically group them together?

5 Likes

There is no magic required, cargo will have all necessary information to perform grouping at the start of compilation process. Meaning that if it encounters a first crate in a group, it will display the whole group immediately, even though only one crate from this group is compiled at the moment. To make this process a bit more informative we could use a color indication of the compilation status (it will require cursor support from the terminal), e.g.: green crate name for "compilation is finished", yellow for "compilation is under way" and grey for "compilation has not started yet".

We had a good conversation with BurntSushi about this problem.

I'm planning to add a lot of metrics to cargo-crev to address it.

First, recursive accumulative counts: line, geiger, etc. These are easy to implement and IMO more accurately describe the heaviness of each particular crate.

Second, more to the point that BurntSushi was making, I was planing to count number of distinct "trust domains" (which roughly corresponds to "project" here). I am still not sure if I could just deduce that information from the crate owner list for each crate (and then merge ones with clearly shared ownership), or do I need some separate "project" idea.

While for cargo-crev uses-case project field in Cargo.toml would be great - it has one shortcoming. What is stopping anyone from setting project = tokio or project = user:BurntSushi in their own project? Unless this is somehow authenticated, it's not very trustworthy. So at minimum crates.io would have to have some ability to check that. In crev I was thinking about introducing separate artifact ("Ownership proof", that states "User X (author of proof), says that user Y is an owner (/maintainer/developer) of project Z".

3 Likes

This would only make sense to me if cargo had namespacing.

But even with namespacing, I don't see a benefit and heaps of problems and confusion here.

If the only result is to change they way cargo check/build display things the mindset here seems to be "don't show the user what's going on so they don't complain about too many dependencies".

(I'm going to sidestep the splitting up crates discussion here. It's a nuanced and complex issue that would derail the topic, but it's a lot more complicated than just "many crates = bloat = bad"! Some of my thoughts are here: https://github.com/tokio-rs/tokio/issues/1318#issuecomment-514536711)

  • Does the project name need to be tied to the crate name? Eg tokio project can only contain tokio_* crates?
  • What if a user reserves the tokio project name for his "render gummybears on the the terminal" gummybears crate? The crates.io policy has been very hands off, this could introduce a lot of confusion and moderation issues
  • So, cargo shows I'm building the "futures" project, but I'm only depending on async-timers (hypothetical). Why is that?
  • Am I using the tokio crate, or the tokio project with 5 subcrates? What's the difference?
  • Can I depend on a project? Why not? What are the features of a project? It's subcrates? Why not?
  • futures-exector provides quite orthogonal functionality to the rest of the futures_* crates. Would it be part of the same project?
  • ....

This would introduce a very minimal kind of namespacing with the only purpose of hiding some lines on the terminal.

4 Likes

I agree with the motivation. Using a project field may help, but it seems the more general problem is namespacing, e.g. preventing an unrelated party from registering tokio-foo.

Regarding the packaging problem: as has been pointed out, the only thing preventing a distro from bundling multiple crates into a single RPM/DEB/... is the lack of a universally-applicable version number. If one is prepared to appropriate/fabricate a version number for the whole bundle and restrict from using the full combination of crate versions, a multi-crate package is still viable IMO. This is partially related, in that the package field or similar might be used to build multi-crate shared libs or packages in the future.

3 Likes

This sounds like a really cool idea!

For the question about how to group projects when their dependency trees do not allow displaying them at once: If the terminal does not have cursor support, one could still just display the crates as it is done now in that case, and just prefix them with the project name. For example, write Compiling rand:rand_chacha or so.

One thing I fear because of this is that we might end up with platform effects, that would lead to monopolies of a few developer groups. Suddenly for example the one game engine maintained by a single person might be much more attractive than the other one, where a single person just combined dependencies in a clever way.

But in an open-source ecosystem we actually want people to recombine what others have made, to provide more powerful mechanics to their crates clients. So it is really important to keep in mind that we want to promote plurality, even if it comes with trust problems. We should therefore rather try to solve those trust problems in a way that promotes plurality, than in a way that promotes monopolies.

1 Like

I like the idea, but I think it would be better to list all the crates once in the "top" crate's .toml, instead of having every crate declare what project it's a part of.

[package]
name = "rand"
version = "0.7.0"
is_project_of = ["rand_core", "rand_chacha", "getrandom"]

But I don't think we get any real benefit out of trying to enforce that all crates in a "project" have the same owner, or more generally are part of the same "trust domain". cargo-crev trying to detect or identify trust domains sounds very useful, but also orthogonal, since AFAICT any fleshed-out system for this would require crev proofs of domain membership anyway. I also don't think it makes sense to consider all crates with the same maintainer as a single logical project for either trust or UI purposes.

So my second suggestion is to make this feature not just about "projects" where N crates all have the same maintainers, but the more general "facade" pattern where one crate is acting as a wrapper of several other crates. And the only impact this would have in practice is making the UI of cargo (and perhaps also crates.io) group them together, to help people better understand "what they depend on" at a very high level.

[package]
name = "rand"
version = "0.7.0"
is_facade_of = ["rand_core", "rand_chacha", "getrandom"]
[package]
name = "tokio"
version = "0.3.0"
is_facade_of = ["tokio-executor", "tokio-io", "tokio-net", ...]

In a hypothetical future with std-aware cargo...

[package]
name = "std"
version = "1.39.0"
is_facade_of = ["core", "alloc", "compiler_builtins", "panic_unwind", ...]

And an example where there is no unified maintenance team:

[package]
name = "stdx"
version = "0.0.1"
is_facade_of = ["bitflags", "log", "lazy-static", "num", "rand", "semver", ...]

Note that this implies a single crate can be part of multiple facades. That seems inherent in this design (just as it's inherent in the OP's design that any crate can "join" any project "without permission"). It also implies that a facade (rand) can be part of another facade (stdx), but I have no strong feelings on whether that should be supported (only that we ought to consider it; it seems easy to allow or disallow).

Also, there is a trust benefit to this design (whatever we call it) in that the owners of a single crate actually control what that crate is a project/facade of; no one else can "join" the project/facade without their involvement. I think that's the best we can hope for from cargo changes alone. For cases like rand and tokio where there is a single trust domain due to shared maintenance, I'd imagine cargo-crev could check if the maintainers (well, the ones that use crev) gave high thoroughness/expertise reviews to all the crates in the project/facade, and if so that's a reasonable basis for calling it a single domain, but it shouldn't be the only way it tries to find these trust domains.

5 Likes

+1 I would like to see this for better organization on lib.rs. And I agree that optics of crate bloat matter.

I already try to guess that grouping from git URLs (treating crates living in the same repo as part of a group), but that's a heuristic. Another heuristic could be workspaces:

[workspace]
members = [
    "rand_core",
    "rand_distr",
    "rand_jitter",
    "rand_os",
    "rand_isaac",
    "rand_chacha",
    "rand_hc",
    "rand_pcg",
    "rand_xorshift",
    "rand_xoshiro",
    "tests/wasm_bindgen",
]
1 Like

I would much rather have people use better metrics (LoC, cyclomatic complexity, number of maintainers) than try to manually group crates. Tools like cargo-crev or cargo-geiger should fill this niche.

1 Like

I'll jump in on the bandwagon with @Ixrec and @kornel: I think a bundle/facade/pack/workspace with a single unified version number would be better in this regard.

I used to work at a company with many, many, internal libraries and to manage the complexity of figuring out compatible versions of libraries (dll hell) at some points the idea of packs were introduced and it made everything much much simpler for users as suddenly they only had to talk about a handful of packs (and their versions) rather than hundreds of libraries (and their versions).

I would even go further and enforce access to the libraries through the packs, although allowing selecting only a subset:

[dependencies]
crypto = { version = "0.8.2", libraries = ["byte-tools", "keccak", "digest"] }
tokio = { version = "0.1.22", libraries = ["core", "io", "net"] }

A pack would consist of pinned versions of the libraries, allowing one to put their trust into the packager with regarding to the packed libraries quality and security, with an expectation that the packager would have performed at least a minimal audit of all libraries.

On the other hand, packs would be held to the same standards as regular libraries, especially with regard to SemVer, and thus depending on both "0.1.21" and "0.1.22" of tokio would lead cargo to pick "0.1.22".

And thus, in this sense, a pack or facade could be thought of as a single entity library, even when composed of independent parts.


Another possibility, of course, would be to go the reverse way. Allowing a seemingly monolithic crate such as tokio "0.1.22" to be composed of multiple library binaries that can be mixed and matched.

The main benefit is that we need not introduce a new term, and cargo will continue working in terms of crates just as before. However, authors would be allow to create a crate composed of multiple libraries (with inter-dependencies) and downstream library authors invited to specify only the subset of libraries that they actually use to minimize compilation time.

It achieves the same benefit as "project", with no issue of rogue membership. On the other hand, it does not allow 3rd-party to repackage independent libraries together with a seal of approval.

2 Likes

So if I am not using the "top" crate (e.g. I only want rand_chacha and as a consequence rand_core), I (and my users) will not get any benefits of the grouping? Don't forget that many will not use "top" level crate, for various benefits which I listed in the OP. Same problem with your facade idea. Or what about rand_distr, which is a higher level than rand (i.e. it depends on rand and not the other way around)? Should we allow listing crates which depend on a given crate? Looks messy to me. I think in this case many-to-one relation works better and more convenient than one-to-many.

This will not work well for example for RustCrypto, which has many repositories. For example I believe that chacha20poly1305 users should not care that it assembled by combining a number of building blocks from different crates kept in different repos maintained by the same group. rand also keeps getrandom in a separate repo and previously had a dedicated repo for PRNG crates.

I don't think this approach is really different from "meta" crates which re-export other crates. And how do you propose to change logs during compilation?

I think in practice it will now work so well as you envision. Let's take hmac crate which provides a generic HMAC implementation over hash functions. Should I make a "pack" for it? But some users will use HMAC-SHA256. Create packs for popular variants? Looks like too much pain for quite little gain. But wait, it goes better, pbkdf2 is generic over MAC functions and can use HMAC or any other MAC function. MAC functions also can be used together with stream or block ciphers (for the latter you will have to use some kind of "block mode") in generic AEAD constructs.

To summarize: as someone who works on two micro-crate projects (RustCrypto and rand), I simply don't see how your idea will work in practice.

Well, one prominent example of a project which does not use the namespace approach is RustCrypto. We try to use "common" names for our crates (i.e. sha2 for SHA-2 hash functions), since users (especially not so experienced ones) usually reach first for crates with such names before trying other available alternatives.

2 Likes

This reminds me somewhat of withoutboats' proposal to be able to include several crates in a single package

2 Likes

I don't think this approach is really different from "meta" crates which re-export other crates. And how do you propose to change logs during compilation?

It is similar to "meta" crate, with the exception that from the point of view of Cargo, there a single crate, and therefore there is a single version, a single log line duration compilation, etc...

I think in practice it will not work so well as you envision.

Well, as I mentioned, this is the design that was used in my previous company.

We had ~5,000 developers, and the team I worked on depended on 4 packs:

  • 3rd-party: a pack of open source libraries, about 20 to 30.
  • core and middleware: packs from the middleware team, about 50 and 100 libraries respectively.
  • reservation: pack from the reservation team, about 500 libraries.

The pack system was used for all 9 years I worked there, and was generally considered an improvement by any developer who previously had to resolve DLL hells by hand; pushing the onus on pack maintainers.

So I can say with a certain degree of confidence that it works well in practice.

Let's take hmac crate which provides a generic HMAC implementation over hash functions. Should I make a "pack" for it? But some users will use HMAC-SHA256. Create packs for popular variants? Looks like too much pain for quite little gain. But wait, it goes better, pbkdf2 is generic over MAC functions and can use HMAC or any other MAC function. MAC functions also can be used together with stream or block ciphers (for the latter you will have to use some kind of "block mode") in generic AEAD constructs.

There are multiple levels here:

  • Cryptographic hashes.
  • Stream and Block ciphers.
  • Cryptographic algorithms.

A pack could be released which contains "known good" (as per the maintainers' opinions) cryptographic hashes, stream and block ciphers and cryptographic algorithms; you'd depend on this "crypto-blocks" pack and cherry pick the hashes, ciphers and algorithms that you care about. The main advantage you'd have as a user is that you could trust in the maintainers' having checked the quality of the implementation and focus on which functionality you need; furthermore, over time and releases, the maintainers would be culling out deprecated libraries and including new known good ones, so you'd have an up-to-date list of primitives from which to build.

Or alternatively you'd depend on an easier-to-use library which itself depends on the pack but re-export "known good" combinations for particular use cases, using feature toggles to selectively cherry-pick the libraries it needs.

To summarize: as someone who works on two micro-crate projects (RustCrypto and rand ), I simply don't see how your idea will work in practice.

You wouldn't be the one deciding whether those micro-crate belong to any pack; though you could lobby a pack maintainer for inclusion.


Real-life examples of packs:

  • std can be thought of as a pack.
  • In the C++ world, Qt and Boost can be thought of as packs.
1 Like

I think you are trying to solve a somewhat different problem than I am. I want to make micro-crate designs less scary in eyes of people, by making it visible, that having 30 crates in your dependency tree does not mean you have to trust 30 groups of developers, that this number is often significantly smaller. Plus I want to keep all advantages of micro-crate designs, with which you can depend only on those parts which you really need, while your approach, if I understand it correctly, has essentially the same drawbacks as monolithic crates.

You on the other hand try to move trust points from crate maintainers to meta/pack maintainers, so instead of trusting 10 developers, you will have to trust just a couple of crate reviewers instead. It's indeed a viable approach, which tackles the issue from a different angle. And I don't think that both approaches are mutually exclusive.

Although I think using meta/pack crates is a roundabout way of doing what we really want and instead we should use a proper review framework, which will allow us to declare for our project restrictions like "use only crates and their versions which were approved by reviewers A, B and C, never use crate versions which were blacklisted by reviewers D and F". The former reviewers (A, B, C) can be some trusted face in community, your company own registry of approved crates, or even paid subscriptions with liability guarantees. While the latter (D and F) may be vulnerability databases like RustSec.

3 Likes

To explicitly state some of the implicit assumptions I was making in my last post: The problem of "user depends on N crates, and N seems like a lot to them, but it's really not" seems to require that the user is not familiar with those N crates. If the user has already gone to the trouble of specifically depending on just tokio-io and tokio-timer instead of the whole tokio bundle, then they're probably familiar enough with the tokio ecosystem not to be spooked or misled by seeing multiple tokio-related crates in their cargo output. That's why I don't see any conflict between the two suggestions in my previous post and the advantages of being able to depend on exactly the microcrates you actually use.

Crates often indicate their place in the hierarchy of a project with their name; for example rand_chacha is a part of the rand library, futures-util is a part of futures project.

Perhaps this could be formalised? A "path separator" could be standardised (perhaps "/" rather than hyphen or underscore?), which could be used to provide a hierarchy to crates in a project. This could provide cargo, crates.io and docs.rs with information to accurately represent a hierarchy of crates.

Publishing a crate like rand/chacha could depend on ownership of the rand crate. In this way authorship is assured, and a namespace for the project's crates is created which could be browsable on crates.io and docs.rs.

There have been previous proposals to allow you to claim myprefix- or myprefix_, which I thought was a good idea, and requires no other changes to Rust

I believe this is the most relevant past discussion:

3 Likes

We should not be hiding crate dependencies because if any reason for crate separation exists then a reason for displaying the separate crate exists. In particular, if you build sha1 then sha1 should show up, not just hashes.

As others said, if/when cargo adds namespaces then yes cargo should list namespaces when listing crates.

In practice, we often choose not between feature flags and subcrates, but between feature flags and both subcrates and feature flags, which gets redundent. If your subcrate requires a feature flag then maybe you should only have the feature flag?

1 Like