Discussion: Improved UX for Distributed/De-centralized Development

I haven't tried and seen where it fails, but here's a sketch of the workflow I think works best for something like this:

Dependencies are defined with package = { rev = "hash" }. This identifies a specific revision without providing a source. Then, the user adds in a [patch] table the actual path to the git repository with package = { git = "path" }. If they need to temporarily change the revision target, they can do so in either location.

How do you usually solve this problem? What would you do if Cargo wasn't used?

@djc, the specific things I am interested in doing include:

  • "Publish" to git repositories instead of to a registry. This is already well-supported in Cargo. I am not interested in setting up a registry.

  • Take my repositories "on the road," by mirroring them, so that I can use them in the field without needing a network connection back to my server. I want to be able to do full development work on any or all of my crates; not just a top-level crate. I'd like this to be as painless as possible: ideally, I could mass-correct all my git [dependencies] by changing one line in a config file under ~/.cargo or the like. This also needs to work across URL schemes: my main git server uses ssh://, but when I am on the road I might want to use file://.

When is say "distributed workflows," I mostly mean that I want to take advantage of the fact that git is decentralized and is easily mirrored and forked in multiple places.

@kornel, at present I use git submodules with [path] dependencies. Submodules can have relative URLs, and the URL is interpreted relative to that of the remote tracking branch. To change locations, all I need to do is change where my origin in the top-level project points. This solution is acceptable for projects with simple, one-level dependency graphs. The minute you add more than one level of dependency, things get ugly very quickly. One winds up redoing a lot of the packaging work done by the lower-level projects.

The solution proposed by @CAD97 is certainly an alternative. In this case, however, I have to [patch] whatever I want to build. This requires introducing more commits whose sole purpose is to change URLs. I would rather avoid doing this, as it adds to clutter, merge conflicts, and said commits will definitely break things for others if accidentally merged. I would much rather be able to redirect en masse without needing to touch my VCS.

Would allowing [patch] tables at the level of ~/.cargo/config.toml solve the problem? That would be a central place to tell cargo about where git repositories are and then versioning can be done by revision hashes.

(That said, I was thinking something similar could be done for individual projects outside the repo by adding a virtual manifest one directory up to workspace just that crate and [patch] it. Only works if it's not already a workspace, though, I suppose.)

@CAD97, [patch] in ~/.cargo/config could be made to work, so long as it would properly union the URL in ~/.cargo/config.toml with the tag/branch/rev specifier in Cargo.toml. I certainly don't want to be required to specify versions in ~/.cargo/config.

Needing to specify each crate this way may not scale well. Any new git crate added anywhere would probably need an entry in this file. The user would have to walk the dependency graph themselves to see which crates need patching, and patch them.

What about adding very basic URL rewriting capability? Something like:

# ~/.cargo/config
[rewrites]
"ssh://whereever.org/git" = "file:///home/me/git"

which would perform a prefix match on the key URL and replace with the value URL, like so:

ssh://whereever.org/git/mycrate.git => file:///home/me/git/mycrate.git

The above TOML syntax isn't very environment-variable friendly. We could use a more verbose syntax instead:

# ~/.cargo/config
[rewrites]
enable = true     # default

[rewrites.one]
enable = true     # default
prefix = "ssh://whereever.org/git"
replace = "file:///home/me/git"

Alternatively, we could restrict rewrites to special URLs which serve as placeholders. For example,

# Cargo.toml
[dependencies]
foo = {git = "placeholder://SOME_UNIQUEISH_IDENTIFIER/foo.git", branch = "blah" }

# ~/.cargo/config
[placeholders]
SOME_UNIQUEISH_IDENTIFIER = "file:///home/me/git"

The placeholder approach may side-step some match issues, such as usernames and case-sensitivity. It also ensures that a URL will never be assumed—the user has to explicitly define one. (This may or may not be a good thing.)

Why can't you just use

foo = { version = "1.0.4", path = "../foo" }

Then if you don't put a foo in the parent directory cargo sources from crates.io otherwise it uses your local copy. I do this frequently for simultaneous development on two related crates. No need to use git submodules or anything tricky.

@droundy, I never would have expected cargo to fall back to crates.io if it was instructed to search for a file path.

There are two reasons why this solution is less than ideal for me:

  1. A cross-repository path dependency is under-specified. It could refer to any commit which carries the specified version number. While that's fine for local development, confusion will arise if I try to share this with others. "Which version of foo should I check out?" The dependency graph doesn't say.

  2. Cargo cannot assist the user in obtaining or checking out the correct version.

In addition, my use-case covers crates which are not stored in any registry, including crates.io. There is no registry version on which to fall back.

Is there a specific reason why a monorepo doesn't work for you? The project you gave as your "famous example", Linux, works with a single source tree, and so do many other projects, like Firefox, that have used and influenced the design of Cargo.

3 Likes

@notriddle, there are a number of factors which make a monorepo attractive and appropriate. I don't believe it's a good fit for me, however.

Monorepos are well-suited for monolithic projects, like linux, which are tightly integrated and are used "all at once." These products typically do not need to concern themselves with "out-of-project" code re-use. Is the kernel's DMAEngine, for example, really useful without the rest of the kernel infrastructure around it? Probably not.

Conversely, when the goal is to write a re-usable library or eight, it may be difficult to convince downstream consumers to ingest a monorepo. When all I need is the ability to read PNGs, I will likely opt for something like libpng over GDAL. The latter is much arguably much more capable… but if I don't need spatial data support or the ability to read TIFFs, then all GDAL does is add API surface, code mass, and build headaches.

It has been my personal experience that drawing hard borders between projects leads to better-planned, better-versioned APIs with smaller surfaces. Projects which stand apart can evolve without too many constraints from their up- or down-stream dependencies. One substantial advantage of individual repos is the ability to specify versions "loosely." Consumers are not forced to use "whatever version was current as of time X" of every package or crate.

Human factors are also important here. A monolithic project probably has similar contribution guidelines and review procedures for every part of the codebase. There is probably a committee, a benevolent dictator, or such, who watches over the entire thing. If there is no such entity—perhaps because the parts of the system are too disparate to have a single stakeholder, or because they are never designed to all fit together—then there is trouble ahead. These projects likely do not belong together.

In my opinion, tightly-coupled projects can reap the benefits of sharing a repository. Loosely-coupled projects, which may lack a single purpose or an overarching maintainer of the "system of systems," probably should not.

Of course, the minute we start gluing projects together from multiple repositories, we start to need some kind of package manager.

One issue i believe is that given an arbitrary git repository, there is no mechanism to know which commits correlate to releases on crates.io. If that were the case, we could write a tool which given a top-level directory of git repositories, spins up an instance of crates.io, and runs cargo with the default registry set to the local instance instead of crates.io.

Perhaps if Cargo.toml or some other file (Releases.toml?) existed to correlate semver and version control, but without that I wouldn't see it providing any benefit beyond just using paths really.

There is a .cargo_vcs_info.json in recently published crates that contains commit hash of the checkout at the time of publishing. However, there is never any guarantee about relationship between code on crates.io and git (there are exclude directives in Cargo.toml, and of course it's always possible to reset git repo to any commit and publish something completely different).

BUT "spinning crates.io" is an entirely unnecessary complication you never need. You can specify mixed crates.io+path and crates.io+git dependencies. There are [patch.crates-io] and [patch.'git url'] directives that can be set for the entire workspace. And if all else fails, there's source replacement in .cargo/config that only needs a local path to a git clone of the index.

If you use dependency versioning diligently, then a wrong commit will cause a version mismatch, and from the error message it will be known which version is needed (although I must admit that when developing mostly within git I forget to bump deps' versions).

You can also specify rev or tag for dependencies in Cargo.toml requiring a specific commit that you want. If commit/tag is too specific, you can require a branch (and a version range if branch is too ephemeral).

But in the end if you want multiple repositories, and something on top of them that tracks them and their versions — that's git submodules, unfortunately.

It's possible to independently make a cargo workspace for crates living in different git repositories by using submodules and [patch.'git url'] dep = {path = "./dep"} overrides.

1 Like

Correct, I do not want to replicate crates.io.

I should add, however, that the source replacement in .cargo/config cannot be used to tell Cargo how to find local unpublished crates.

Thanks for pointing out .cargo_vs_info.json, It is a bit unfortunate that this exists in the published crate rather than a list of version/sha1 in the crates.io-index. As it is given an arbitrary git repository, to figure out the crates derived from it, you would need to read Cargo.toml, check crates.io download every cargo package, attempt to rebuild the published crate, check the chksum against the crates.io-index... as you say there is no guarantee that this last step will work (e.g. if the history was rewritten since the crate was released, or presumably any number of reasons).

I know about the Cargo.toml patch directives and .cargo/config, the main thing that is bothersome about these is that

  • Cargo.toml is checked into the repository, mixing local build stuff with remote build stuff, so I end up stashing these a lot.
  • There is no option afaict to specify the path to .cargo/config so we need to switch around filesystem state

Finally, I wasn't suggesting replicating all of crates.io, just populating an instance of a registry served from local git repositories

Correct, but I believe that we were discussing out-of-tree path dependencies instead. If one replaces a git source with a path source, I'm pretty sure that the rev and related information is ignored. The safe way to do things is, as you have mentioned, git submodules. Submodules are presently how I solve this problem. This approach does come with a number of drawbacks, as mentioned above.

@ratmice, are you trying to do something similar to me? If so, can you describe your current workflow?

1 Like

@cbs228 my workflow is currently basically ad-hoc modification via [patch], migrating at various times back and forth between

  • path="../foo",
  • git = "https://somewhere.org/foo.git",
  • version = "..." once everything finally gets upstreamed.

I suppose I am trying to do something similar to you, in that I think the current workflow i use is fairly tedious.

If i can just always use the last option along with the registry="some_local_registry" configured via .cargo/config, And ensure that the cargo publish stuff can be done reproducibly from a local git repository to my local registry, and checked against crates.io-index checksums it could eliminate a lot of fiddling and conflicts that currently arise from modifying Cargo.toml.

@ratmice: Are you trying to use a local registry as an alternative to [patch] with a path or git repository? I've not tried it, but that sounds like it could be a lot of work.

If we had some form of .cargo/config "source replacement" for git repositories, it would be less necessary to switch between path and git patches. You could define the git URL via your own personal environment. It would still be necessary to use [patch] to switch to an unpublished version, but then you could staple it to a particular branch and just use that.

The existing source replacement feature in .cargo/config can only replace the crate index in its entirety, which is not what we want.

@cbs228 yes, a local registry as an alternative to [patch] basically.

But populating the local registry from git repositories, by having a mechanism to:

  • enumerate versions from some working dir.
  • a map from version to sha1.
  • ensuring bit for bit reproducible source release process.

So In essence my "local registry" is populated by seaching some root for ./*/Cargo.toml, going through the above 3 steps, and populating the registry contents from the repository contents.

Anyhow, i'm mostly just spit-balling because this mechanism has a very low bar from taking an arbitrary crate, and using it in a distributed fashion without really modifying it's [dependencies] section. The changes required are to making the above 3 steps feasible to do and check.

@ratmice, yes, it certainly sounds like the registry changes are a separate feature. Would easier local replacement of git sources help you at all?

It looks like PR 7199 is relevant. It discusses the possibility of adding [patch] support to the Cargo configuration files, as @CAD97 suggested. The use of paths = […] was recommended as a potential alternative, but this does not work for unpublished crates.

It would definitely be an improvement, in that I could just have a few config.local_fs config.local_git and swap .cargo/config via symlink as needed absent any kind of option to specify .cargo/config.

I don't really work with unpublished crates enough (except in the initial stages of creation), for that limitation to be a real nuisance.