[Pre-RFC] Lang-Level License Management

It's fair to add hooks to the compiler to expose information such as whether libstd is used, allow embedding of extra metadata in the executables, and add ability to force a specific crate to be never inlined and always dynamically linked (for LGPL). But these can be created as license-agnostic mechanisms for use by Cargo or other tools.

Licenses are more of a legal concept than a technical one, so I'd rather not burden the compiler with it. It's a big and hairy problem. The crate ecosystem hasn't even started dealing with commercial licensing yet, so it's better to leave it to tools that can still evolve.

6 Likes

There is prior art in Go - there's a tool that does exactly that: https://github.com/mitchellh/golicense

As pointed out above, an RFC to embed similar info in Rust binaries is currently open: https://github.com/rust-lang/rfcs/pull/2801 It needs to incorporate feedback from the discussion thread, but otherwise looks like it would solve your use case. I'd appreciate if you could check what info exactly Go compiler incorporates and comment on the RFC to make sure Rust includes something equivalent.

Once that's in, people can create license-related tooling on top of it without requiring it to go through the heavyweight process of including it in Rust itself.

OK, I think that there is a misunderstanding here. I think that cargo-about and cargo-deny collect all of the licenses, and then check to see if there is a license that the user's lawyers don't like, correct? But that's all, correct? My issue is that I know that there are licenses that are mutually incompatible, so you can have a situation like the following:

  • Crate A is multi-licensed under licenses a, b, and c.
  • Crate B is multi-licensed under licenses a, b, and d.
  • Crate C is multi-licensed under licenses a, e, and c.
  • Crate D is multi-licensed under licenses f, e, and c.
  • All licenses are individually acceptable to our lawyers, and so are on the accepted list.
  • Licenses a and f are mutually incompatible, as are b and e, and c and d.

Can either of the tools determine that licenses a, b, and c are sufficient to license all 4 crates? If they do, then you've already solved my problem! :+1:

My wish is for a tool that can search through all of the licenses and look for incompatibilities as shown above to see if there is a feasible set of licenses. Once one (or more) feasible sets are discovered, I can then work to find the set that our lawyers most prefer. Does this make sense?

Fortunately, when you fill in concrete Open Source licenses, you don't tend to run into arbitrary satisfiability/coverage problems.

In practice, I've only ever seen a handful of compatibility "sets", and they mostly tend to have simple relationships with each other.

Building a fully general tool that computes compatibility for you isn't necessarily an NP-complete satisfiability problem, but it is a complicated task. I would suggest that that's a separate task from the one of listing all the licenses that apply to a compiled work, providing a file containing all the license texts and similar attributions required for compliance, listing any other requirements (like copyleft) needed for license compliance, and notifying the user if those requirements change from what the user expects (e.g. introducing a proprietary license into a copyleft project or vice versa).


Regarding the license compatibility problem, the landscape tends to look roughly like this. (Note that "compatible" below refers to compatibility between separate modules rather than for code in the same file/library. Also note that the below is intended as a mental model, and for practical implementation purposes in a tool, you may want to factor licenses out into "license requirements" and declare compatibility between license requirements rather than between licenses.)

  • Permissive licenses, compatible with everything.
  • Weak copyleft licenses (LGPLv2, LGPLv3, MPL 2.0), also compatible with everything as long as you take some care how you construct combinations. (That care in construction isn't something a license-based tool can typically help you with, so these don't tend to affect license compatibility, just license compliance in that you need to supply source and satisfy their other requirements. As a notable exception to that, LGPL doesn't mix with proprietary licenses that completely prohibit reverse-engineering, but again, tools can't help very much with that unless you manually "tag" proprietary licenses as "allows reverse engineering for compatibility".)
  • Proprietary licenses. Since proprietary licenses tend to be entirely bespoke, license compatibility tools are unlikely to help you figure out any properties about them, but you can generally assume any bespoke license is incompatible with copyleft unless you specifically identify it otherwise.
  • Obnoxious licenses, incompatible with everything except permissive and weak-copyleft licenses. (e.g. the current OpenSSL license until they finish transitioning to Apache 2.0, or the old 4-clause BSD license with advertising clause)
    • In practice, you can also lump proprietary licenses in the same category, including things like the JSON license. Those have the same property: they don't mix with anything but permissive or weak-copyleft.
    • You can also lump "Copyleft licenses that aren't compatible with GPLv3" in this category, as they also typically only combine with themselves and permissive/weak. Fortunately, Open Source licenses in this category are relatively uncommon these days.
  • Licenses compatible with GPLv3 but not GPLv2, most notably Apache 2.0.
  • Copyleft licenses in the GPLv3 family (GPLv3 and AGPLv3).
  • GPLv2-only.
  • Dual/triple/multiple licenses, as well as "or later" licenses, which have the compatibility of any of the license options they offer. Pick the subset you're willing to comply with, then assume you have the maximum compatibility of any of those. (For instance, if you have code under GPLv2+, you have the compatibility of GPLv2 or GPLv3, so you can combine this with Apache 2.0. But if your application can't comply with GPLv3's requirements, you can only use GPLv2, so you can't combine this with Apache 2.0.)

There are, naturally, unusual exceptions to this. But in practice, the majority of licenses fit this model, and any license that doesn't typically warrants detailed review.

(Source: I review licenses and license compatibility professionally, though I am not a lawyer.)

Also, note again that the above is intended as a rough mental model, and a compliance tool would likely want to implement a model where licenses consist of requirements and the "compatible" relationship occurs between requirements.

7 Likes

I fully agree that the tasks should be separated, which is why I wanted to know all the licenses that govern a particular chunk of code, not the most preferred one. The rest is up to me (make PRs against various tools I guess...)

As for the rest of your comment, I'm way, way, way too familiar with it all; until I transferred my duties, part of my job was to get my organization to do more Open Source work, which involved figuring out these legal issues. Our organization's legal team is still working out which licenses are acceptable to the organization, and which ones are mutually incompatible with each other (since they are the ones that will have to defend the organization, we always defer to their judgement on what is compatible, and what is incompatible).

Agreed completely. Which license of multi-licensed code that you want to use will depend both on your requirements and on what other works under what other licenses you need to work with.

The cargo deny list command (via cargo-deny) will list crates that are available under more than one license under each license entry. From there you could run your set solver on it.

Yes cargo-about and cargo-deny solves this. Just say that you only approve of using licenses a , b , and c and as all the 4 crates support either of those licenses it will pass and generate an attribution file saying which of the 3 approved licenses the 4 crates are using (based on the priority order of the licenses).

So you should be good! Do try it out and let us know if you run into any issues.

I think you should re-read @josh's comments above. The problem is that I won't necessarily know that licenses a, b, and c are sufficient. I want the tool to figure that out for me.

Here's a question for you; are there other combinations of licenses that would be sufficient in my example? If so, what are they?

Ah I see, that is not a problem we've run into. Not yet at least. We start from the other direction around of manually creating and reviewing list of licenses we approve of and support (and that are compatible with each other) and then the work of cargo-deny is to enforce that only libraries with those licenses are used and the work for cargo-about is to generate license attribution.

Likely will have to manually pick specific license to use for certain libraries to make it fit into a compatible set but we haven't yet gotten to a point where we need that to be done automatically. But now we do have the base tooling for this so could be built on top of it

The base tooling is the important part! In the worst case, you can brute force the combinations to see if there is a satisfying solution. (And yes, I'm aware that this can take a long time, which is why I said 'in the worst case')

Thank you all for your thoughtful replies to this. There are various assorted things I'd like to say at this point, so I'll just dump them all out here.

The re-licensing situation is made more complex by the fact that libstd has a good amount of dependencies from the crates.io ecosystem. I haven't checked all of them, but I'd assume they're all made available under MIT/Apache-2.0 and would also need to be re-licensed if we wanted to get rid of the standard library's attribution requirement.

The RFC as written just uses the first license in the list as a default, since there has to be some mechanism for deciding what to use when the user doesn't specify anything. However, it also provides a mechanism for selecting preferred licenses, in the licenses.prefer section of Cargo.toml (see the License Specification->In Cargo section of the RFC). There are other ways we could potentially handle this, like warn if there's a license in the tree that hasn't been explicitly allowed, but I'm not sure if either solution is inherently better than the other.

Understanding the semantics of licenses is very much out-of-scope for this RFC, though I suppose that isn't explicitly stated anywhere. To clarify: this RFC only manages compiling and bundling licenses. License comprehension and incompatibility detection is left for other tools to handle, although we may want to consider having some sort of licenses.deny key in Cargo.toml at some point.

This is super cool! I wasn't aware that this tool existed when I first wrote the RFC; this seems like an excellent place to start when building out official support. I still think we should integrate a similar sort of tool into the official toolchain directly so we can more easily manage libstd's licenses, but this seems like good work so far. I don't particularly like using HTML as the default license format, but that's a bikesheddy thing that doesn't take away from the overall technical accomplishment.


Anyhow, there are various addendums I'd like to make to the current RFC's text and my comments elsewhere in the thread.

I still think it's a good idea have some sort of License struct embedded into the language - doing that would make HTML, other custom license formatting, or license conflict resolution far easier for external users - but the RFC's implementation is pretty intrusive to the compiler and similar goals can be accomplished with minimal or potentially no compiler integration. The include!() solution I posted in the thread also wouldn't work, since the licenses need to be added in at the end of the build process. Instead of passing licenses to the compiler and using a built-in macro to output the license list, we could have a LicenseList::licenses() function that accesses a licenses array that gets linked in at the end of the build process. This is already possible today - @dtolnay has pulled off some linker magic that lets you assemble arrays at link time, as seen in linkme (:heart:) - but if we use that solution we may want to have cleaner support for it built into the compiler directly.

build.rs files that access the license list should use the license list from the parent binary, not from the build script.

The RFC's syntax for external license specification in Cargo.toml is bad and should be replaced with something similar to what @repi proposed.

OK, but do you have a plan in place to collect all licenses a given artifact is licensed under? Like I said earlier, there are times when you have to calculate over several sets of licenses, rather than have one preferred license in the beginning.

We could have some sort of Licenses::everything() method. Cargo should also have some subcommand for retrieving license data outside of the source code, and we could let users choose to print out all licenses available for all crates if they so desire. If you want to perform arbitrary post-processing on the license list before it gets linked, we could conceivably add a build.rs key that lets you replace the license list that gets embedded in the codebase and output by the cargo subcommand.

Good! Other tooling can be build on top of this. I like the License type, it will make things easier.

My only suggestion there is the add another key for copyright holders. That way we don't have to parse "SPDX License Identifier: Optional Copyright Holder" to pull out the license and copyright holder separately, they are already separated by the tooling. Maybe make the copyright_holder: &[str] or something similar in case there are multiple holders?

I like that. I'd prefer to combine the license name and holder into a single type that gets included in the License struct to make it clear that the license name combined with the copyright holder(s) is the logical key in the license map, but separating the copyright holders from the license name is a good idea.

pub enum LicenseId {
    /// License is used by several different copyright holders.
    Common(&'static str),
    /// License text is customized to a particular set of copyright holders.
    Owned(&'static str, &'static [CopyrightHolder])
}

pub struct CopyrightHolder {
    pub year: &'static [i32],
    pub name: &'static str
}

EDIT: I've adjusted year to be an array of years rather than a particular year because there are licenses that do that

Thanks! And we've been working on this just for the last few weeks and released it today, so just happy circumstance that it is at the same time.

We've been meaning to further investigate how to include libstd and its dependencies also as that also needs to be attributed, so yeah something we are interested in as well to get as complete and accurate picture as possible automatically and out of the box for ourselves and anyone else using cargo-about and cargo-deny.

Forgot to mention that cargo-about outputs using handlebars so you can output in whatever format you want with a template you provide. So can be text, json, html, anything really.

1 Like

How about:

pub enum LicenseId {
    /// License is used by several different copyright holders.
    SpdxIdentifier(&'static str),
    /// License text is customized to a particular set of copyright holders.
    Proprietary(&'static str, &'static [CopyrightHolder])
}

I don't like what that naming implies. Having one key be a SpdxIdentifier implies that it is always a SPDX identifier and the other is not (counter-example: MIT) and that licenses with explicit copyright holders are always proprietary (also not MIT).

I understand, but I can't think of a better way of expressing that a given string is just the identifier and not the complete text. In addition, as far as I know the SPDX identifiers are the closest thing to a universally accepted set of standard license identifiers.

Alternatively, you could embed the license text into the CopyrightHolder struct so that everyone knows what the complete license is. Or, you could have an array of license strings, and each CopyrightHolder struct has an array of indices into the array of license strings.