SemVer Compatibility Tool - GSoC Proposal Draft and Discussion

Since I plan to apply for this GSoC project, I hereby present my ideas and a draft for my project proposal. Any discussion, comments and possible feedback is welcome. The following text section outlines my proposal’s concept.

Thank you for your time and input.

Project Proposal

I propose to implement a proof-of-concept tool to check SemVer compatibility of library crates in the Rust ecosystem, as suggested on the list of official projects. The expectation is for it to be integrated with cargo and surrounding infrastructure to maximize ease of use and adoption amongst library authors. To reach this goal, a solution to three key problems has to be derived:

  • The infrastructure necessary to obtain and represent the public interface of a crate during compile time needs to be carefully designed and integrated in cargo and a compiler plugin run by rustc.
  • Consequently, the interface description needs to be serialized and stored in a well-specified format in a standard location, to allow further processing by the tool and possibly other components in the future. Alternatively, an approach where no information is ever permanently dumped to disk can be pursued, but it in turn needs machinery to obtain interface descriptions for past versions of the currently compiled crate, as well as the current version.
  • Finally, the core of the tool, an algorithm to verify SemVer-compatibility of two interface descriptions and the corresponding version numbers, needs to be designed and implemented, with special consideration applied to ensure the robustness and flexibility of the resulting code.

This structure to the task is proposed because it allows for maximal separation of concerns in the final product, which in turn eases further refinements and additions to the design of individual components. This is especially useful when we consider that the features need to be implemented across multiple codebases, as some components need direct access to the compiler’s data structures, while others are interacting with cargo to obtain version information and possibly other metadata.

Based on this rough design, a few discussion points remain:

  • We need to determine the most beneficial way of obtaining the interface representation of the already published crate version. The simplest approach is to store it in a serialized form alongside (or even as part of) Cargo.lock, that is, in a file on disk. This has a few benefits, most prominently the ability to reuse the data stored that way in other tools and for other purposes. However, the decision to add more metadata to a typical crate’s repository should not be taken lightly. Alternatives include integration with a VCS, such as git, or pulling the reference version from crates.io or another source. Both options trade in some of the flexibility of the first approach for a less intrusive change to the already established functionality of cargo. Concrete implementation details for the components described above need to be decided upon. A through look at projects with similar requirements and structure, like clippy, and it’s integration with cargo, could serve as guidance here.

Obviously, it is also necessary to outline the basic working of the algorithm enforcing SemVer compatibility. Since this requires operating on crate interface descriptions, I assume a naive implementation of such a data structure providing a mapping between identifiers and type signatures, and definitions, as well as the set of available types and identifiers.

A rough sketch of the algorithm could look like the following:

  1. Identify the removal and additions of signatures and/or types between versions. The addition of any element will force at least a minor version bump. The removal of any element, in turn, forces a major version bump.
  2. For each identifier common between versions, examine the changes to it’s signature, if any. Changes that generalize the definition and can’t introduce any ambiguity in user code (which would result in a compile error), force at least a minor version bump. All other changes are considered breaking, and require a major version bump.

The actual implementation would have to consider a lot more details than presented here, as generics, traits, and different trait implementations are all subtly interacting here. It is important to note, however, that absence of specialization in a change does not imply backwards-compatibility. This hints at a core problem the proposed project will face: The generated version number is only a suggestion, since it’s generated in a conservative fashion whenever possible, but won’t include any information on function implementations in the first iterations of the project.

5 Likes

Regarding the serialization format, one idea would be to take the entire module/item structure and just print it out in a Rust-like syntax, except that function bodies and private types, fields, etc. are omitted.

Regarding the compatibility checking, addition can be a break change, for example, adding a function to a public trait without a default implementation. I think a more thorough analysis of what kind of change constitute a breaking change would be necessary.

You can’t just use the printed output, you have to canonicalize every use of a definition by using one of the paths it’s exported through, always.

And the canonicalization has to be done using an old canonical output and the new raw output, otherwise they’re not comparable, i.e. adding an export should be backwards-compatible.

Awesome proposal @twk, I’m super excited to see this problem get taken on!

In terms of getting source code to “diff” (e.g. check semver compat with) I’d recommend a strategy that looks like:

  • Assume one copy is on disk in the cwd. That is, I think can assume that the “diff to” is the current Cargo project.
  • For the “diff from” copy, I’d look in the crates.io index for the greatest version less than the local copy. This can then be downloaded and compiled directly from crates.io (using Cargo as a library)

That’ll sidestep the need for stabilizing a new on-filesystem format which should help provide this tool greater flexibility into the future. I think it can also reuse lots of existing pieces of Cargo to avoid reinventing downloading/parsing/etc.

Once those two crates are available you’re the left with a “semantic difference” between the two (and classifying if that’s breaking or a minor upgrade). You could use a compiler plugin to work through this but it may be somewhat difficult. I’d definitely recommend looking at clippy in terms of architecture, but you may also want to explore rustfmt which uses syntex_syntax on crates.io (a vendored copy of libsyntax, the rustc parser itself). I’m not sure how much analysis you’ll end up needing, but if you largely just need an AST that’s what I’d recommend.

Super excited to see this come to fruition!

You can’t use just the AST, it will have to be like clippy or a custom driver.

I suppose having two rustc threads at the same time in memory might make this easier, as you can transfer data between them as you need to compute the “canonical export map” and then, based on that, to compare declarations for equivalence.

Thank you all for your input so far.

I have to agree, using the last version published on crates.io seems to be the most ergonomic solution. It also seems like some more fine-grained analysis of the possible kinds of changes is still necessary.

However, I don’t think I fully understand @eddyb’s point. As far as I am aware, the interface of a crate can be roughly described as a subset of the AST of the complete source code of said crate. Thus, the concept of “canonical exports” seems to be not-quite-clearly defined, especially since we only care about the perceived compatibility (that is, whether changes to code using the library being checked will be necessary). If the path a function or type is exported through changes between versions, this is equivalent to an addition of a symbol in one place and the removal of a symbol with the same name in a different one. Both views are breaking changes (modulo a few very special cases).

So where exactly does this more simplistic view of the problem domain break down?

As far as the usage of rustc internals vs. syntex_syntax goes, it might also be beneficial to find a way of providing support for different versions of the toolchain, while stable should probably be a priority (?). The clippy project was searching for solutions to this problem as well, last time I checked.

FYI, a few years back, I was thinking about doing this by using rustdoc’s JSON output as an interface definition. Sadly, this JSON output has never worked which is why rustdoc no longer has this option at all.

I do think that https://github.com/rust-lang/rfcs/pull/1868 is highly related to this. Fundamentally, checking compatability across “time and space” (versions and cfgs) is very similar, and both are useful.

You can do a “library mashup” by making a #[cfg(old_ver)] #[cfg(not(old_ver))] on every definition. Then the same rules can check whether a specific downstream usage is “old_ver-portable”. The final step would only need to extend rustc to compare interfaces, not check usage—but this would share similar code.

This may seem roundabout, but we’d be able to use this for e.g. making sure that per-platform “imp” modules present the same interface.

The problem is that the following should all be considered identical:

pub struct Data {...}
pub type Error = String;
pub fn foo() -> Result<Data, String> {...}
pub struct Data {...}
pub type Error = String;
pub fn foo() -> Result<Data, Error> {...}
mod detail {
    pub struct Data {...}
    pub type Error = String;
    pub fn foo() -> Result<Data, Error> {...}
}
pub use detail::*;

Particularly, that last one requires that the path you compare, say, in foo's return type, is not detail::Data but rather just Data because that’s what is exported, i.e. the canonicalization for all of them is:

pub struct Data {...}
pub type Error = std::string::String;
pub fn foo() -> std::result::Result<Data, std::string::String>;

Now this is a deterministic algorithm that requires name resolution (so an AST isn’t enough), but there’s another hitch: what if you make detail public? Or what if it was from the start? Or you add a third way to access those 3 items.

The only solution, IMO, is to use an arbitrary choice (e.g. shortest past and lexicographic so rather Data than foo::Data than aaaaa::Data than detail::Data), for the older version, then if any of the exported items in the new version has the same export path as an item from the old version, it uses the old canonical path, otherwise it’s arbitrary.

You can get away with doing only some equivalences (i.e. not substituting type aliases, although this feels like it’d have a high false positive rate), but you still need name resolution to have completed to handle any shape of exports.

3 Likes

Thanks, this clarifies a lot. Seems like this indeed forces the implementation of such functionality into a compiler plugin. However, there are some edge cases left if we’re not careful. For instance, if detail was public in the older version and additionally reexported, but turns private in a new version, this is obviously a breaking change for all users importing it directly, and the canonicalization needs to reflect that fact. Alternatively, keeping track of different paths to the same definitions would neatly solve this issue. So the system ends up resolving all exported names and storing them either way to match them with a different version. I believe this to be a sane approach to this, but I see your concerns with my initially flawed assumptions about the problem. I think I should revisit the name resolution problem in greater depth in the proposal.

I just remembered that the canonical paths only matter if you’re trying to output something to diff, if you’re doing this in memory you can internally allocate indices for all exported items in the old version then map the new exports to the old indices - and yes, each item would be described by its API and export paths.

My real problem here is that I’ve discussed all this with @vibhavp on IRC and the details from those discussions stuck in my head as “the right solution” - they were going to output something from a lint or some such and then diff the outputs - and that’s where canonical paths (as opposed to integers) were needed.

I think in order to know if a change generalizes you also need to consider the whole environment of libraries.

Consider the following two functions:

fn to_addr(x: [u8; 16]) -> Ipv6Addr {
    // code
}

fn to_addr2<T: Into<Ipv6Addr>>(x: T) -> Ipv6Addr {
    x.into()
}

While it is a generalization in rust 1.9 and up, it would be breaking change in rust 1.8 and lower. I think the version of rust and all crates is necessary information to compute the semver compability.

1 Like

Agreed. I should keep flexibility of rulesets in mind to allow an implementation to handle cases like this one, especially if similar changes get added with time.

@jonasbb brings up a good point: you will need to check compatibility between concrete types and generics, and between different generics. This sounds like it might be pretty tough.

Yes, I think it’s a good idea to keep this similar use case in mind. However, I’d rather approach it by keeping the component(s) determining the two versions/sources to compare separate from the actual comparison code. That way the component can be reused without burdening the design of superficially unrelated functionality too much with details about the similar usage in a different place.

That seems fine. Certainly Cargo should be responsible for pulling libaries and dealing with the versions themselves, whereas rustc should just be given source of unknown provenance and asked whether one presents a subinterface of the other.

I've just been talking about the rustc side of things, which is the interesting part. Certainly there one should leverage related problems for code reuse as much as possible.

Hi @twk.

I’m really impressed with the legwork you’ve done here. This is a strong way to approach these GSoC proposals - it gives a lot of confidence.

It looks to me like you’ve gotten a lot of good advice.

Generally, I think you should focus on the algorithm. That is the hard problem here and I suspect one could spend much more than a single season getting it correct.

That’s not to say it shouldn’t have a usable interface. But it would not be the best use of your time getting bogged down in specifying new compiler interfaces.

Clippy is a good project to immitate here. It is a custom rustc driver that sets up the compiler pipeline in memory and modifies it to whatever ends it needs. You can do a lot this way, out of tree, and it’s an appropriate approach for speculative Rust compiler projects.

You might think a bit, and write a bit, about how you will validate and test your system.

You might also look to prior art for ideas. I think Elm does this.

2 Likes

Thanks for the feedback and encouragement (adressed at everyone, actually).

I agree that the focus should be on the algorithmic core and the verification of it’s correctness. Maybe I’ll be able to write quickcheck-style tests based on a specification of how exactly SemVer is to be interpreted in the context of Rust, i.e. what constitutes a breaking change, and what isn’t, whether we want a guaranteed minimal version number to be our output or only a best guess that could be too liberal, etc.

I’ve glossed over elm-package, and it seems there is only a semantic lesson to take from there, since Elm seems to be significantly easier regarding the analysis we want to perform, especially since polymorphism is heavily restricted, and no typeclasses/traits exist. So what elm-package does is simply check whether declarations have been added, if so it’s at least a minor version bump, if declarations have been changed or removed, it’s at least a major version bump, everything else is a patch (based on the assumption that some changes have been performed). This is probably a good first direction in terms of how exactly we should apply the results of our analysis to give the user some output.

1 Like

Can such API extracts be used not only for compatibility checks, but also for indexing?

I expect those extracts to be much smaller than actual crates, so it may be feasible to download API-extracts of entire crates.io for nice local searching. Like Debian’s apt-file and Contents.gz.

I don’t see why this shouldn’t be possible. Since I plan to separate the modules generating interface descriptions from the actual analysis, this sounds like a good example of a functionality that can reuse the former.