Discussion: Improved UX for Distributed/De-centralized Development

I am interested in improving Cargo's dependency resolution and downloading for distributed development.

In distributed development, developers do not necessarily share [much] infrastructure. A famous example of this is the linux kernel. There is no "one place" where linux lives. There is no single git repo to which all developers git push at the end of the day. For the kernel, a significant amount of developer communications takes place on mailing lists through patches.

It is easy and straightforward to pass around a single git repository, regardless of how de-centralized your workflow is. Problems arise when dependencies are introduced between crates from different repositories. At present, I believe that there are only two basic options for referring to crates. Please correct me if I am wrong:

  1. Reference by absolute URL:

    • Crate registries like crates.io, and alternative/private registries, are centralized places which must be common to all developers. The crate registry is essentially a mechanism for turning a crate name + version into an absolute URL.

    • Git repositories are another option, but they too must be referenced by absolute URLs that are common among all developers.

  2. Reference by local filesystem: Workspaces and relative file paths are intended to assemble together crates which are part of the same repository. Such crates are, by necessity, always versioned and released together.

Neither of these methods work well for distributed development.

The requirement for using absolute URLs means that a crate's identity (name + version + origin) is inexorably linked to where it can be found. In cases where developers can't share—or don't want to share—centralized server infrastructure, this makes things difficult. There is no [easy] way to tell Cargo,

  • "Look for this dependency relative to something else, like my repo's origin."
  • "Use this dependency by name, but require (or allow) the top-level crate to declare where it can be found."

Some workarounds I believe are non-workarounds include:

  • Path-based dependencies, in conjunction with git submodules. Submodules are awkward to work with. They also lead to lower-level projects declaring dependencies for higher-level projects, which isn't good.

  • Local registries. These require that each and every developer publish all crates locally. This may entail a non-trivial amount of "manual labor," as crates have to be published in dependency order. Unpublished development snapshots cannot be used.

  • Vendored dependencies. These are perfectly workable, but the present-generation tools require an existing registry or an absolute-URL repository source.

Is there community interest in supporting distributed workflows? Does anyone have any suggestions for improving these workflows? I will gladly assist with writing an RFC, if one is called for.

Related topics / further reading:

  • #6859: feature request for relative paths to git repositories. I have commented here.

  • #6713: "Want config option for definitively specifying local crate paths"

  • path overrides cannot be used "to tell Cargo how to find local unpublished crates"

  • Git distributed workflows.

(EDIT: Crates are also identified by their origin, and not just name + version.)

1 Like

Cargo also includes the origin of crate into identity:

That is, regex 1.0.0 from crates.io and regexes 1.0.0 from local file system can coexist in the same crate graph.

1 Like

@matklad, you're right about the crate origin being part of its identity. I will make a minor edit to reflect this. I am not the most familiar with cargo's internals, so please bear with me.

The valid forms for [dependencies] toml appear to be:

  • foo = "0.1.0": Use foo-0.1.0 from crates.io
  • foo = {version = "0.1.0", registry = "mine"}: Use foo-0.1.0 from an alternative registry defined in your .cargo/config file
  • foo = {version = "0.1.0", path="baz/foo"}: Use foo-0.1.0 from a local relative path, which must be a plain directory that does not require any VCS manipulation.
  • foo = {git = "https://somewhere.org/foo.git"}: Use foo from the given git repository, identified by absolute URL.

If that is correct, then—with the exception of alternative registries—a crate cannot declare a dependency without also specifying its absolute URL. A higher-level project can override the source… but the crate still has to reside on infrastructure that is "globally accessible" for all its users.

It may be a good idea (or a very bad one) to soften this requirement a bit. This is especially true of git repository sources, which can easily be mirrored and relocated while maintaining their integrity. We could use relative URLs, and resolve them to absolute ones during SourceID creation. We could also use non-URL identifiers such as URNs or UUIDs, if that might be needed to break ambiguities.

I find this particular notion of "distributed/decentralized" quite specific and a little vague. I think it would be very helpful if you could outline your particular use case: the way you would like to set things up for one of the projects you participate in, what the constraints are for that project, and what makes things hard. As is, you have a fairly long description that does not (to me at least) make the problem very clear.

3 Likes

I haven't tried and seen where it fails, but here's a sketch of the workflow I think works best for something like this:

Dependencies are defined with package = { rev = "hash" }. This identifies a specific revision without providing a source. Then, the user adds in a [patch] table the actual path to the git repository with package = { git = "path" }. If they need to temporarily change the revision target, they can do so in either location.

How do you usually solve this problem? What would you do if Cargo wasn't used?

@djc, the specific things I am interested in doing include:

  • "Publish" to git repositories instead of to a registry. This is already well-supported in Cargo. I am not interested in setting up a registry.

  • Take my repositories "on the road," by mirroring them, so that I can use them in the field without needing a network connection back to my server. I want to be able to do full development work on any or all of my crates; not just a top-level crate. I'd like this to be as painless as possible: ideally, I could mass-correct all my git [dependencies] by changing one line in a config file under ~/.cargo or the like. This also needs to work across URL schemes: my main git server uses ssh://, but when I am on the road I might want to use file://.

When is say "distributed workflows," I mostly mean that I want to take advantage of the fact that git is decentralized and is easily mirrored and forked in multiple places.

@kornel, at present I use git submodules with [path] dependencies. Submodules can have relative URLs, and the URL is interpreted relative to that of the remote tracking branch. To change locations, all I need to do is change where my origin in the top-level project points. This solution is acceptable for projects with simple, one-level dependency graphs. The minute you add more than one level of dependency, things get ugly very quickly. One winds up redoing a lot of the packaging work done by the lower-level projects.

The solution proposed by @CAD97 is certainly an alternative. In this case, however, I have to [patch] whatever I want to build. This requires introducing more commits whose sole purpose is to change URLs. I would rather avoid doing this, as it adds to clutter, merge conflicts, and said commits will definitely break things for others if accidentally merged. I would much rather be able to redirect en masse without needing to touch my VCS.

Would allowing [patch] tables at the level of ~/.cargo/config.toml solve the problem? That would be a central place to tell cargo about where git repositories are and then versioning can be done by revision hashes.

(That said, I was thinking something similar could be done for individual projects outside the repo by adding a virtual manifest one directory up to workspace just that crate and [patch] it. Only works if it's not already a workspace, though, I suppose.)

@CAD97, [patch] in ~/.cargo/config could be made to work, so long as it would properly union the URL in ~/.cargo/config.toml with the tag/branch/rev specifier in Cargo.toml. I certainly don't want to be required to specify versions in ~/.cargo/config.

Needing to specify each crate this way may not scale well. Any new git crate added anywhere would probably need an entry in this file. The user would have to walk the dependency graph themselves to see which crates need patching, and patch them.

What about adding very basic URL rewriting capability? Something like:

# ~/.cargo/config
[rewrites]
"ssh://whereever.org/git" = "file:///home/me/git"

which would perform a prefix match on the key URL and replace with the value URL, like so:

ssh://whereever.org/git/mycrate.git => file:///home/me/git/mycrate.git

The above TOML syntax isn't very environment-variable friendly. We could use a more verbose syntax instead:

# ~/.cargo/config
[rewrites]
enable = true     # default

[rewrites.one]
enable = true     # default
prefix = "ssh://whereever.org/git"
replace = "file:///home/me/git"

Alternatively, we could restrict rewrites to special URLs which serve as placeholders. For example,

# Cargo.toml
[dependencies]
foo = {git = "placeholder://SOME_UNIQUEISH_IDENTIFIER/foo.git", branch = "blah" }

# ~/.cargo/config
[placeholders]
SOME_UNIQUEISH_IDENTIFIER = "file:///home/me/git"

The placeholder approach may side-step some match issues, such as usernames and case-sensitivity. It also ensures that a URL will never be assumed—the user has to explicitly define one. (This may or may not be a good thing.)

Why can't you just use

foo = { version = "1.0.4", path = "../foo" }

Then if you don't put a foo in the parent directory cargo sources from crates.io otherwise it uses your local copy. I do this frequently for simultaneous development on two related crates. No need to use git submodules or anything tricky.

@droundy, I never would have expected cargo to fall back to crates.io if it was instructed to search for a file path.

There are two reasons why this solution is less than ideal for me:

  1. A cross-repository path dependency is under-specified. It could refer to any commit which carries the specified version number. While that's fine for local development, confusion will arise if I try to share this with others. "Which version of foo should I check out?" The dependency graph doesn't say.

  2. Cargo cannot assist the user in obtaining or checking out the correct version.

In addition, my use-case covers crates which are not stored in any registry, including crates.io. There is no registry version on which to fall back.

Is there a specific reason why a monorepo doesn't work for you? The project you gave as your "famous example", Linux, works with a single source tree, and so do many other projects, like Firefox, that have used and influenced the design of Cargo.

3 Likes

@notriddle, there are a number of factors which make a monorepo attractive and appropriate. I don't believe it's a good fit for me, however.

Monorepos are well-suited for monolithic projects, like linux, which are tightly integrated and are used "all at once." These products typically do not need to concern themselves with "out-of-project" code re-use. Is the kernel's DMAEngine, for example, really useful without the rest of the kernel infrastructure around it? Probably not.

Conversely, when the goal is to write a re-usable library or eight, it may be difficult to convince downstream consumers to ingest a monorepo. When all I need is the ability to read PNGs, I will likely opt for something like libpng over GDAL. The latter is much arguably much more capable… but if I don't need spatial data support or the ability to read TIFFs, then all GDAL does is add API surface, code mass, and build headaches.

It has been my personal experience that drawing hard borders between projects leads to better-planned, better-versioned APIs with smaller surfaces. Projects which stand apart can evolve without too many constraints from their up- or down-stream dependencies. One substantial advantage of individual repos is the ability to specify versions "loosely." Consumers are not forced to use "whatever version was current as of time X" of every package or crate.

Human factors are also important here. A monolithic project probably has similar contribution guidelines and review procedures for every part of the codebase. There is probably a committee, a benevolent dictator, or such, who watches over the entire thing. If there is no such entity—perhaps because the parts of the system are too disparate to have a single stakeholder, or because they are never designed to all fit together—then there is trouble ahead. These projects likely do not belong together.

In my opinion, tightly-coupled projects can reap the benefits of sharing a repository. Loosely-coupled projects, which may lack a single purpose or an overarching maintainer of the "system of systems," probably should not.

Of course, the minute we start gluing projects together from multiple repositories, we start to need some kind of package manager.

One issue i believe is that given an arbitrary git repository, there is no mechanism to know which commits correlate to releases on crates.io. If that were the case, we could write a tool which given a top-level directory of git repositories, spins up an instance of crates.io, and runs cargo with the default registry set to the local instance instead of crates.io.

Perhaps if Cargo.toml or some other file (Releases.toml?) existed to correlate semver and version control, but without that I wouldn't see it providing any benefit beyond just using paths really.

There is a .cargo_vcs_info.json in recently published crates that contains commit hash of the checkout at the time of publishing. However, there is never any guarantee about relationship between code on crates.io and git (there are exclude directives in Cargo.toml, and of course it's always possible to reset git repo to any commit and publish something completely different).

BUT "spinning crates.io" is an entirely unnecessary complication you never need. You can specify mixed crates.io+path and crates.io+git dependencies. There are [patch.crates-io] and [patch.'git url'] directives that can be set for the entire workspace. And if all else fails, there's source replacement in .cargo/config that only needs a local path to a git clone of the index.

If you use dependency versioning diligently, then a wrong commit will cause a version mismatch, and from the error message it will be known which version is needed (although I must admit that when developing mostly within git I forget to bump deps' versions).

You can also specify rev or tag for dependencies in Cargo.toml requiring a specific commit that you want. If commit/tag is too specific, you can require a branch (and a version range if branch is too ephemeral).

But in the end if you want multiple repositories, and something on top of them that tracks them and their versions — that's git submodules, unfortunately.

It's possible to independently make a cargo workspace for crates living in different git repositories by using submodules and [patch.'git url'] dep = {path = "./dep"} overrides.

1 Like

Correct, I do not want to replicate crates.io.

I should add, however, that the source replacement in .cargo/config cannot be used to tell Cargo how to find local unpublished crates.

Thanks for pointing out .cargo_vs_info.json, It is a bit unfortunate that this exists in the published crate rather than a list of version/sha1 in the crates.io-index. As it is given an arbitrary git repository, to figure out the crates derived from it, you would need to read Cargo.toml, check crates.io download every cargo package, attempt to rebuild the published crate, check the chksum against the crates.io-index... as you say there is no guarantee that this last step will work (e.g. if the history was rewritten since the crate was released, or presumably any number of reasons).

I know about the Cargo.toml patch directives and .cargo/config, the main thing that is bothersome about these is that

  • Cargo.toml is checked into the repository, mixing local build stuff with remote build stuff, so I end up stashing these a lot.
  • There is no option afaict to specify the path to .cargo/config so we need to switch around filesystem state

Finally, I wasn't suggesting replicating all of crates.io, just populating an instance of a registry served from local git repositories

Correct, but I believe that we were discussing out-of-tree path dependencies instead. If one replaces a git source with a path source, I'm pretty sure that the rev and related information is ignored. The safe way to do things is, as you have mentioned, git submodules. Submodules are presently how I solve this problem. This approach does come with a number of drawbacks, as mentioned above.

@ratmice, are you trying to do something similar to me? If so, can you describe your current workflow?

1 Like

@cbs228 my workflow is currently basically ad-hoc modification via [patch], migrating at various times back and forth between

  • path="../foo",
  • git = "https://somewhere.org/foo.git",
  • version = "..." once everything finally gets upstreamed.

I suppose I am trying to do something similar to you, in that I think the current workflow i use is fairly tedious.

If i can just always use the last option along with the registry="some_local_registry" configured via .cargo/config, And ensure that the cargo publish stuff can be done reproducibly from a local git repository to my local registry, and checked against crates.io-index checksums it could eliminate a lot of fiddling and conflicts that currently arise from modifying Cargo.toml.