LLVM is migrating to GitHub as a monorepo


#1

I attended the LLVM Dev Meeting in San Jose last week, and I participated in the round-table discussion on the GitHub migration. I don’t think there’s any recording, but here are some related llvm-dev threads too:

This is going to affect how we manage our LLVM-related submodules. Currently we’re based on the git-svn mirrors, which the LLVM project has published separately for each subproject. Going forward, the Source of Truth will instead look like this llvm-git-prototype, and it’s not clear if they will attempt to maintain the subproject read-only mirrors at all anymore.

One option for Rust is to just maintain our own subproject repos as needed. We rebase relatively infrequently, so an occasional git subtree split to pull out the paths we care about might be just fine.

We could also consider just embracing the monorepo as one big submodule for Rust. We currently have src/llvm, src/tools/clang, src/tools/lld, and src/tools/lldb, and this constitutes the bulk of the monorepo already. The big difference is that the ones in src/tools are currently optional, but as a monorepo everyone will have to pay the price of that larger git fetch and local storage.

There’s also src/llvm-emscripten, an older fork, but it’s not clear to me how long this will continue to exist, compared to wasm32-unknown-unknown. And there’s the sub-submodule src/libcompiler_builtins/compiler-rt which overlaps the monorepo too, but I don’t know if it’s worth trying to merge that.


attn @alexcrichton in particular, since you most often deal with our submodules. :slight_smile:


#2

Thanks for the heads up! I think it’s fine for us to eat the cost of the submodule and switch to the monorepo. We can probably implement optimized downloads (like we do for CI) if necessary for checkouts by default. I’d personally like to stick to “official LLVM” though to make sure we don’t diverge too much in terms of usage


#3

Well OK then – I thought this might be at least a little contentious… :relieved:

Their plan was to finalize the prototype today, freezing the commits such that it would be ready for people to migrate in earnest. It will still continue syncing from SVN, just won’t be re-generated from scratch anymore. It sounds like someone found a few issues, so that’s not ready quite yet, but when it is I can start experimenting with moving us over.


#4

Since we update the LLVM submodule so infrequently, I think using a subtree might be the way to go here.


#5

No, please don’t check the LLVM sources directly into the rust tree. A submodule makes it easier to reconcile with LLVM git history.


#6

At least on my part, I only meant to use subtree to extract the subprojects from the LLVM monorepo, and then still use submodules to include those in the rust repo.

But the surest way to stay close to LLVM history is to just use the monorepo directly as a rust submodule.


#7

The current checkouts for llvm and llvm-emscripten are about 600MB together. Is there a size estimate for the monorepo?


#8

I just made fresh clones of the llvm-git-prototype and rust with the current llvm submodules.

The disk usage of llvm-git-prototype is 1383M total – 601M in .git and 782M for the checkout.

$ du -BM -d1 | sort -h
1M      ./debuginfo-tests
1M      ./libunwind
1M      ./parallel-libs
4M      ./libclc
7M      ./libcxxabi
7M      ./openmp
11M     ./clang-tools-extra
15M     ./lld
29M     ./llgo
32M     ./polly
42M     ./compiler-rt
43M     ./libcxx
99M     ./lldb
136M    ./clang
363M    ./llvm
601M    ./.git
1383M   .

For rust submodules, remember that the metadata is separate under .git/modules.

$ du -BM -c -s src/{llvm{,-emscripten},tools/{clang,lld,lldb}} | sort -h
14M     src/tools/lld
99M     src/tools/lldb
135M    src/tools/clang
227M    src/llvm-emscripten
355M    src/llvm
828M    total
$ du -BM -c -s .git/modules/src/{llvm{,-emscripten},tools/{clang,lld,lldb}} | sort -h
31M     .git/modules/src/tools/lld
162M    .git/modules/src/tools/lldb
432M    .git/modules/src/tools/clang
881M    .git/modules/src/llvm-emscripten
882M    .git/modules/src/llvm
2386M   total

Again that those in src/tools/ are optional for general rust developers, as is llvm-emscripten. You really only need src/llvm, and not even that if you use external LLVM (but then you shouldn’t need the mono repo either).

But note that the monorepo’s .git is already much better packed, despite having more content! So the monorepo’s full 1383M can be directly compared to:

$ du -BM -c -s src/llvm .git/modules/src/llvm
355M    src/llvm
882M    .git/modules/src/llvm
1237M   total

So the monorepo actually reduces network use for the git data, and is only a little bit bigger in total disk usage with the checked out data.

EDIT: I usually have an alias du='du -h', but I think some of the rounding from M to G is confusing here. I’ve updated the numbers using du -BM for a more consistent comparison.


#9

Thanks, I thought it was somewhere else.

I also didn’t know about the other modules under src/tools. Looks like it’s actually somewhat of an optimization to just use the monorepo as a submodule!


#10

Any chance of committing our LLVM changes upstream?