Project idea: datalog output from rustc


#1

I want to have a tool that would enable us to answer all kinds of queries about the structure of Rust code that exists in the wild. This should cover everything from synctactic queries like “How often do people write let x = if { ... } else { match foo { ... } }?” to semantic queries like “How often do people call unsafe functions in another module?” I have some ideas about how to build such a tool, but (I suspect) not enough time to pursue them. I’m looking for people who might be interested in working on it!

The basic idea is to build on Datalog. Datalog, if you’re not familiar with it, is a very simple scheme for relating facts and then performing analyses on them. It has a bunch of high-performance implementations, notably souffle, which is also available on GitHub. (Sadly, it generates C++ code, but maybe we’ll fix that another day.)

Let me work through a simple example of how I see this working. Perhaps we would like to answer the question: How often do people write tests in a separate file (foo/test.rs) versus an inline module (mod test { ... })?

We would (to start) have some hacked up version of rustc that serializes the HIR in Datalog form. This can include as much information as we would like. To start, we can stick to the syntactic structures. So perhaps we would encode the module tree via a series of facts like so:

// links a module with the id `id` to its parent `parent_id`
ModuleParent(id, parent_id).
ModuleName(id, name).

// specifies the file where a given `id` is located
File(id, filename).

So for a module structure like:

// foo/mod.rs:
mod test;

// foo/test.rs:
#[test] 
fn test() { }

we might generate the following facts:

// module with id 0 has name "" and is in foo/mod.rs
ModuleName(0, "").
File(0, "foo/mod.rs").

// module with id 1 is in foo/test.rs,
// and its parent is module with id 0.
ModuleName(1, "test").
ModuleParent(1, 0).
File(1, "foo/test.rs").

Then we can write a query to find all the modules named test which are in a different file from their parent module:

// module T is a test module in a separate file if...
TestModuleInSeparateFile(T) :-
    // ...the name of module T is test, and...
    ModuleName(T, "test"),
    // ...it is in the file T_File... 
    File(T, T_File),
    // ...it has a parent module P, and...
    ModuleParent(T, P),
    // ...the parent module P is in the file P_File... 
    File(P, P_File),
    // ...and file of the parent is not the same as the file of the child.
    T_File != P_File.

Anyway, I’m waving my hands here, and probably getting datalog syntax all wrong, but you get the idea!

Obviously my encoding here is highly specific for my particular query. But eventually we can start to encode all kinds of information this way. For example, we could encode the types of every expression, and what definition each path resolved to. Then we can use this to answer all kinds of interesting queries. For example, some things I would like to use this for right now (or in the recent past):

So, you interested? (Feel free to contact me over privmsg on IRC if you prefer, as well.)


Next steps for the unsafe code guidelines
#2

As I mentioned to Niko, there’s a really cool (proprietary) database called datomic which uses datalog as a query language. One aspect of datomic is that it’s designed to be append only, and have an explicit understanding of how data changes over time.

It would be cool, eventually, to have all of the code on crates.io represented in a datomic or similar database, with an explicit understanding that ‘foobar v1.1 is a new version of foobar v1.0’, and the time each crate was uploaded, so that we could examine not only the current state of Rust code, but also how that state changed over time.


#3

I think the rust ecosystem is such that a minimally viable, but limited version, should be pretty easy to cobble together in pure rust.


#4

See also: codeq


#5

A couple of things that might be of interest:

  1. Mozilla’s Mentat is “datomish” and in Rust.
  2. Frank McSherry’s differential dataflow for datalog

#6

I’d generally like ability to search all Rust code for some pattern. I needed this to show that #[no_mangle] is redundant (it’s a poor replacement for pub).

I don’t know Datalog though. Would that be easier than writing a program that walks AST?


#7

A bit offtopic, but how is it redundant?

// This is perfectly valid right now:
pub mod a {
    pub extern fn foo() {}
}
pub mod b {
    pub extern fn foo() {}
}

#8

@eddyb I don’t want to hijack this discussion, so I’ve created a new topic: #[no_mangle] vs modules and pub


#9

I would love to work on this project. It is actually an idea I have been toying around with for almost a year. I would love to get some input on what your needs would be as I get started, though.


#10

If anyone is interested, the differential dataflow stuff could use some exercise. :slight_smile:

The main feature differential dataflow has is the ability to efficiently update given changes in the inputs. This makes it best for maintaining e.g. Datalog queries over live data, but it should also work for static datasets. The main problem I think it has is that the Datalog queries get compiled down to Rust code, and that takes minutes. Not very interactive, unfortunately.

On the positive side, differential dataflow is more general than Datalog, which only supports monotonic fact sets (without stratification and other band-aids). DD is just “join, group, map, filter” plus an “iterate” method, so if you want to iteratively do “join, distinct” you get Datalog, but if you want to do more sophisticated things (“join, count, filter by count, repeat”) you’ll need something better.

I’m available to give a hand if there is interest.


#11

I’m glad you wrote in, since actually I’ve been meaning to ping you about this. Ever since I saw your talk I’ve been hoping to put some of your work to use. =) I don’t know that this work necessarily has the “evolving over time” aspect, at least not to start, but do you think that the differential dataflow work might be a scalable simply as a replacement for “plain-old-datalog”? i.e., if we just have a very large number of base facts we want to combine.


#12

It might be a decent scalable replacement for “plain-old-datalog”, though… in a perfect world there would be even better implementations. Our experience at ETHZ was that it (dd) was able to do things that the other systems we used (e.g. logicblox, socialite, dlv) fell down on, but not for any identifiable good reason. Mostly, the system (at the time) paid good attention to how data are laid out, did sequential scans though memory, stuff like that. We’re not sure why the other systems didn’t do as well; if there are known better solutions, I don’t want to make restrictive statements about them.

If the project is at / gets to the stage where all the data are at hand, and it is hard to find some compute layer that runs properly, I’m more than happy to port some rules to dd and demo it. I’ve also got a bit of experience dealing with cyclic joins in Datalog (e.g. triangles), which can end up being a pain point.

Maybe related, maybe not: it seems like a fair bit of the unification stuff might also fit under this heading. It could be that bespoke union-find is the best, but if you end up in a spot where you’d really like to put in some custom derivations, and the choices end up being (i) union-find, (ii) datalog, (iii) prolog, with a transition from “too restrictive” to “too general”, dd might be able to help out again, somewhere between Datalog and Prolog.


#13

FYI, pfff (from Facebook) does this using SWI Prolog, for quite a few languages.


#14

So, some time ago @joshuata wrote to me expressing some interest in this project, and we talked. They’ve done a bit of hacking (which I’ll put some links to below). In the meantime, a few other people have expressed interest, so I thought maybe we could start using this thread as a place to talk over the details.

First off, @joshuata can probably explain what they’ve done better, but they sent me a link to a fork of clippy that adds a new kind of clippy plug-in (an analysis).

My feeling is that it makes sense to simultaneously have a “general” vision for how to structure the “facts” (whether they be in datalog or not) but mostly with a focus on specifics. To that end, I think a good “first application” would be trying to evaluate lifetime elision – specifically, to try and uncover how often there are “hidden references” returned via elision. In particular, code like this:

struct Ref<'a> { x: &'a i32 } 
impl Foo {
    fn foo(&self) -> Ref { ... }
}

Here, the return type Ref actually takes a lifetime parameter, but we allow it to be elided. The more explicit form would be fn foo<'a>(&'a self) -> Ref<'a>. I’d like to figure out how often people “elide” lifetime parameters on structs that appear in return position and how often they write out the fully expanded form (once we get that working, I have some follow-up queries). =)

But perhaps that’s too complicate. An even simpler thing might be to evaluate (a) how many projects use unsafe blocks and where – in methods? in free functions? (Again I have plenty of follow-up queries…)

My general thought was that we would need some common threads to “bind together” the facts that we emit. The most natural thing to use is probably the HirId that @michaelwoerister recently landed in a PR. This is basically a kind of “absolute reference” into the HIR for a krate; it combines a DefId (in compiler terms), which is a unique identifier for a fn or other item, with a relative position within that DefId. Each HIR node in the compiler currently has a NodeId; this can be converted to a HirId. We plan to eliminate NodeId from the HIR altogether eventually and just use these relative identifiers.

My thought is that for every item-like thing in the HIR, we’d emit some facts classifying it. This is probably almost a “transcription” of the HIR. For example:

itemKind(defId1, enum). // item with id `DefId` is an `enum`
itemKind(defId2, struct). // etc
itemKind(defId3, fn). // etc

Then within each fn we can start enumerate out the AST.

I guess my concern is that generating this from the HIR seems sort of awfully rote. One could imagine some automatic analysis to generate these facts from the HIR.

Instead of using the HIR, then, another option might be to focus on the actual AST (and relate facts to that). This has the advantage that the AST is more tied to the lexical structure of Rust and hence more stable than HIR. (In that case, we would not want HirId but rather NodeId, I suppose, as the canonical identifier – probably combined with a krate identifier.)

Then we’d “enrich” the AST with types etc extracted from the HIR – I imagine that eventually we’ll also want to encode the MIR, at least for some of the analyses.

OK, these are still some preliminary thoughts, but I’d be curious to get feedback.


#15

Hello, any progress on this? I could not find anything that would be later than this post.


#16

There was some interesting, but I’m not aware of anything coming out of it. I’d still like to see progress here! One thing I would note is that this is very similar to the RLS’s save-analysis data, so it’s plausible that the rls-analysis crate could be a fruitful starting point.


#17

Thank you for the update. I am thinking of making a student project about this. I will let you know if I find anyone.


#18

Sounds great!


#19

I think doing this kind of analysis is precisely Semmle’s business model, so you might want to check them out, although their Datalog dialect/product (SemmleQL) is unfortunately proprietary. I also work on a generalized-Datalog language, Datafun, but I think it’s a bit early to say whether it would be usable for your use-case; it’s a very early prototype and we have no performance to speak of.