The Rust Project needs much better visibility into important metrics

Hey Rust.

This is a post about a bad problem we have in Rust, and another plea for help. It’s a braindump, just hoping to get the need out there and see if anybody can help push us in a better direction.

Very often we have questions about the health of the project, about whether various things are getting better or worse, and we mostly just guess. It would be so much better if we had numbers.

We have lots of automated systems and lots of bots floating around the project. A lot of the pieces exist, but they don’t exist in one place.

I read recently that Miguel de Icaza was interested in improving Xamarin’s issue triage using Power BI, and after just looking at the front-page screenshot I immediately wished we had that.

I’m looking for somebody interested in creating such a system for Rust. We can start by creating a SQL database and a batch process for scraping the data, picking some numbers that are easy to get ahold of, and setting up a simple web page to graph them.

The rest of this message is just a catalog if important information we need better visibility into. Feel free to mention more.

CI metrics

  • Opened PRs per day
  • Closed PRs per day
  • Time from PR open to PR close
  • Number of “@bors retries” per PR
  • CI errors per buildbot builder. Which builders fail the most?
  • Time per ‘auto-’ builder run
  • Release channel health. What’s the current release? How many days has nightly been broken?

Bug metrics

  • Opened issues per day
  • Closed issues per day
  • Time from issue open to issue close
  • P-high issue count
  • Regression issue count

Performance metrics

  • Compile-time / runtime % change per day
  • Performance improvements/losses per contributor.

Some of this information is generated by perf-rustc, which has an undocumented API that can be used to retrieve data.

Crate metrics

  • Downloads per crate per day
  • Crate publications per day

User metrics

  • Downloads per day
  • Downloads per day per artifact
  • Twitter mentions
  • urlo / reddit posts per day
  • Top error codes

Download information is hard to get without backend help.

Approximations may be achievable with rustup telemetry, though not all downloads will ever go through rustup and not everybody will turn on telemetry.

Error codes also can be collected through rustup.

cc @bstrie @jntrnr @nikomatsakis @nrc people who love metrics

8 Likes

Here’s a “release health” dashboard that Firefox rolled out frequently. It would be great to have the same for Rust.

It’s not perfect by any means, but https://github.com/rust-lang/rust/pulse has some of this kind of info, in a way.

This is definitely something I’d be interested in contributing to, although I am not familiar with much of the Rust infra so I might not be as efficient as others who’d be interested.

Also, do you mean using PowerBI specifically? It’d be a shame for Rust to lock itself into proprietary BI software.

1 Like

Thinking out loud here about the source for each of these metrics.

I'm pretty sure most of these could come from the GitHub, TravisCI, crates.io, Twitter, reddit, and buildbot APIs. I can definitely dig into the perf-rustc and crates.io repos to figure out those APIs. There are however, a few of these for which I don't see a clear path forward:

Release channel health. What's the current release? How many days has nightly been broken?

Is there an API for releases? It doesn't look like buildbot exposes that. Perhaps scraping the archive page?

Here's a "release health" dashboard that Firefox rolled out frequently. It would be great to have the same for Rust.

Are there GitHub Issues tags that would provide this info? It looks like the Firefox dashboard relies on specific affected versions metadata in bugzilla. I only see generic regression labels in GitHub Issues, but I'm not super familiar.

Compile-time / runtime % change per day

I think perf-rustc is just measuring compile time. If that's the case, I don't think it's a very good for runtime performance because the implementation shifts day-to-day. I don't know if this is the only discussion, but a separate runtime benchmark suite came up in issue 31265.

Also, it looks like perf-rustc is updated pretty consistently. I assume it's automated? Is it running on metal, or is it virtualized? I suspect running runtime microbenchmarks on a VM could be problematic.

Performance improvements/losses per contributor

Are the per-PR builds running on infrastructure with predictable performance, and are perf-rustc style numbers exposed for those? I don't see any timing numbers on the buildbot pages, but I am not really familiar with those interfaces.

  • User metrics
  • Downloads per day
  • Downloads per day per artifact

Do you mean in the context of rustup and website downloads? Is this the backend support you refer to needing?

Top error codes

Where will the rustup telemetry be stored and will it be accessible over an API?

Another possible metric is the total number of Rust lines in github repositories. Repos from rust-lang and servo should be excluded I think.

Scraping the channel manifests is the way to do it: https://static.rust-lang.org/dist/channel-rust-nightly.toml is for nightly. This file contains the version number and the publication date.

There's not enough data or process for that, no. The best we can do right now is probably take the regression-from-X-to-X tags and try to correlate that to regressions on specific channel.

Yep, we have no runtime benchmarks running now.

It's automated and running on dedicated hardware.

Performance data for the 'auto-' builders won't be useful. There's not enough data to correlate performance changes to contributors now. What I might expect is to use the perf-rustc data to find interesting ranges then do another pass to bisect down to specific commits. Future work.

We would want to get this from cloudfront, but so far we don't know how to do it.

Undetermined yet.

Obviously there's a lot of infrastructure missing to collect some of this data. I'd suggest that the CI and issue tracking metrics are the best starting point. Lamenting the turnaround time for landing PRs was the impetus for this post, so just getting insight there is super useful.

Ooh, yeah, scraping Rust repos has a lot of promise.

Nah, we have to use open source software.

I'm happy for you to run with this @dikaiosune. It will need to run on AWS eventually. I'd suggest writing it in Rust and using a postgres backend since postgres is well supported on AWS.

@brson OK, sounds good.

I'm currently messing with the GitHub API, hopefully have some progress there soon.Haven't pushed anything yet, but here's the repo I'm using for now in case anyone wants to follow along:

No promises, but I'll keep on eye on this thread and post back here once I've got something useful.

If you scrape the data into some machine-parseable format, a Shiny application application will give you the power of the open source R data analysis platform (i.e., almost no statistic or visualization will be unpractical to obtain), and the results are presented dynamically. It appears to be similar to Power BI.

@brson Now that I’ve got GitHub scraped (mostly, need to insert issue labels to the database still), I’m tinkering with queries for these metrics. On these two:

  • Time from PR open to PR close
  • Time from issue open to issue close

How should this be presented? The open times for these items which were closed within the last 7 days? The average per-PR/issue open time for all of history (since 1.0?)? With ~4400 PRs and ~40000 issues updated since 1.0, I don’t think it’s feasible to show this on a per-issue or per-PR basis. I’m thinking about splitting each of these into two possible metrics:

  1. Mean and median age of PRs/issues which were closed in the last 7 days (perhaps as a time series for the last year?)
  2. Mean and median age of PRs/issues which are still open (again with some historical data presented)

How do those sound?

Also, for “number of @bors retries per PR” – based on data from 2015-05-15 to 2016-04-19 (last night), I count 416 PRs which included a bors retry, so probably too many to just present as a list. I’m curious what sort of aggregate metric would be useful to the Rust teams, as I don’t think a simple average would.

Another question, although this time about release manifests. It looks like the historical releases (the YYYY-MM-DD folders) are just omitted if there wasn’t a release for that day. Is that true? If so, is that sufficient for seeing whether the overall nightly build was broken?

Is it worth parsing the manifest to get an exact breakdown of which platforms are missing? Is it common for a nightly release to go out when not all platforms are available?

Now that I've got GitHub scraped (mostly, need to insert issue labels to the database still), I'm tinkering with queries for these metrics.

Nice! Any chance you could post a script to scrape the data ourselves, or upload the database somewhere. I'd love to tinker with the data!

On these two: Time from PR open to PR close Time from issue open to issue close How should this be presented?

Apart from point estimates like mean/median. You could try some graphs. In this case, I would plot some sort of histogram where the X axis are categories like: "closed within one hour", "one day", "one week", etc and the Y axis are the number of issues/PR in that category.

Also, for "number of @bors retries per PR"

Here, an histogram would also help. X axis: number of retries: 1, 2, 3, 4, 5+. Y axis: number the PR that had N retries. Another interesting number would be: % of PR merged that didn't need a "retry".

I think once we have the data we have to try to visualize it in different ways to see what it can tell us.

My understanding (i.e. you should wait for @brson confirmation):

It looks like the historical releases (the YYYY-MM-DD folders) are just omitted if there wasn't a release for that day. Is that true?

AFAIK, yes.

If so, is that sufficient for seeing whether the overall nightly build was broken?

Yeah, you could get a percentage out of this. I think "over the past 365 days we've uploaded 360 nightly releases" sounds nicer though.

Is it worth parsing the manifest to get an exact breakdown of which platforms are missing? Is it common for a nightly release to go out when not all platforms are available?

AFAIK, all the builds (each and every platform) have to succeed to get a nightly release. If you go ahead and parse the manifests, you'll find out when the first binary release for each platform was produced though.

Thanks!

Any chance you could post a script to scrape the data ourselves, or upload the database somewhere. I'd love to tinker with the data!

Yep, all the code is on the repo, and there's a bootstrap.sql which will build a PostgreSQL DB with the GitHub data (it doesn't include CREATE DATABASE but if you point it at an empty database it will init a schema and insert all the scraped values). I'm using pg 9.5 on my machine, but I don't think I'm relying on any behavior not available in 9.3/9.4 (I could be wrong though). The code is still super rough as is the database schema, but it mostly works for now! I also uploaded a file (queries.sql) that has some test queries for most of these metrics that you can try out and mess with. Issue reports are more than welcome if you find something wrong while poking around.

Apart from point estimates like mean/median. You could try some graphs. In this case, I would plot some sort of histogram where the X axis are categories like: "closed within one hour", "one day", "one week", etc and the Y axis are the number of issues/PR in that category.

I agree that graphs are useful! I've been thinking that most of the metrics discussed here should be presented in a time series of some sort (in addition to histograms/pie charts/etc. which I agree would be super useful). I think time series are useful for dashboards because they allow one to see incremental creep that might not be noticeable otherwise. Only problem is that to have a good time series, one needs to have a decent metric for each point (thus my suggestion of a scalar metric for open time).

Yeah, you could get a percentage out of this. I think "over the past 365 days we've uploaded 360 nightly releases" sounds nicer though.

In terms of visualizations, I was picturing something like a 15x30 (or whatever) grid with red and green squares to give a quick at a glance representation (eyes are good at telling proportions, IME, and it'd be easy to see "streaks" that way too).

AFAIK, all the builds (each and every platform) have to succeed to get a nightly release. If you go ahead and parse the manifests, you'll find out when the first binary release for each platform was produced though.

I'm just finishing up right now a scraper that says "if we hit the manifest URL and get a 404, it failed, if we get a 200 the nightly build released." If it'd be useful to have a per-platform/per-binary breakdown (maybe for tier-2/3 builds?), I can add that in, but I imagine there are more important data points hiding in TravisCI and the buildbots.

1 Like

This is perfect. We definitely want to see a graph over time. As you say, seeing the last year of data in 1 week chunks should be good. Though I'm definitely curious how it's looked over the entire life of the project too.

For the currently open PR's just the number representing the current average age is fine, though if possible (I'm guessing not) it would also be great to see that over time.

These two data points can probably be presented together - show the graph of average age of closed PR's then next to it show the average age of all currently-open PRs.

Personally, just the mean is fine by me.

I'm again interested in changes over time, so I'd suggest quantifying these as retries per PR per week (or day maybe?), where the time is the PR closing time. The actual PRs don't matter that much, though for investigative purposes it could be very nice to click through to the worst ones.

An amazing version of this might let you click on the data point in the time series, then next to the chart show the top ten PRs for that time slice.

Yes that is true and it is sufficient. Every day in the archives without a nightly manifest can be considered a broken nightly.

A histogram does sound useful. Again it might be good to link two visualizations together. One one side you have a chart of mean performance over time, quantized to the week or day; when you click on a week it then adjusts the histogram to be for that time slice.

@alexcrichton and my main concern with nightly bustage is that we just never know it's happening until somebody happens to point it out, so just a big label saying either "It's been X days since nightly broke" or "Nightly has been broken for X days" would be a big improvement.

This is not quite true. There are a few platforms that are allowed to fail. I don't have any immediate need to know this information though.

I'm so excited for this.

1 Like

I have some other questions about the discussion here, but I just want to quickly throw this one out there: should the dashboard care about the duration of failed CI builds? I would assume that the times would vary heavily depending on the stage of the failure, and so wouldn’t be useful, but I wanted to check before I throw away times for failed builds.

A not-so-quick update:

I’ve been busy wrapping up my semester, so I’m not as far along as I’d hoped, but this is getting close to an end-to-end prototype.

My big question right now is what to use for the presentation layer. After a conversation on IRC with @brson, I think the dashboard would work best as a JS SPA in front of the Rust API server, rather than dealing with the deployment complexity of a separate server-side application like Shiny. But there remains a panoply of routes that direction. Should I expect this to grow into a larger interactive application which needs something like Angular or Ember? Should I worry about that later and just build a dashboard MVP with jQuery? Not sure myself and I’d be eager to hear any input (if anyone is still reading this thread, that is :slight_smile: ).

Working:

  • Bootstrapping the database with data from GitHub, nightly releases, and BuildBot
  • A scraper daemon which scrapes those three data sources at configurable intervals
  • A JSON API server with one working endpoint (/summary/) which serves a summary of the last 120 days of activity
  • A bunch of queries written mostly in Diesel (many many many thanks to @sgrif for his help), yielding pretty good performance. After parallelization, the full summary endpoint (15-20 queries + JSON marshalling) takes ~80ms per request on my desktop with an SSD, down from ~150ms when done serially (which is still pretty good though).

TODO before a deployment:

  • Clean up the code for the summary endpoint – parallelizing it has improved the speed, but it’s ugly and I’m sure there’s a better way. Would love to hear if anyone has better ideas about how to elegantly farm out the queries to threads.
  • Extend the summary endpoint to have a configurable date range
  • Write a simple JavaScript page/app/thing to grab the JSON, parse it, and graph it (still need to decide how to do this – the KISS in me says jQuery + Highcharts/Highstock)
  • Split the scraper daemon and web server out into separate executables with separate configuration (to keep separate API credentials from being known to the API server and allow for easily using a read-only DB account for the API server)
  • Some other stuff I’m surely forgetting

Down the road:

  • Everything else listed in this thread and in the issues on the repository, including but not limited to:
    • Convert some of the scalar metrics to time series (not sure how best to do this efficiently because SQL is a bit complex for windowing over dates)
    • Scrape many more data sources and add them to the summary
    • Include medians for the average scalar metrics
    • Many other things I’m forgetting
2 Likes

I would go with Ember. I think you would be able to prototype faster with Ember than with plain jQuery (not very reliable data point, I've touched Ember last time about two years ago). Also, crates.io is implemented in Ember.

Hi @dikaiosune!

That number doesn't interest me right now.

I don't know enough about Ember to say. Personally I'd first figure out how to display the data graphically to prove that you can get the data all the way to the screen. If Ember helps with that then use Ember.

Do you have a sense for how much data is being transferred in the scrapes? Do subsequent scrapes retransfer all the same data or are they incrementalized somehow?

How large is the summary data now?

That's so cool you are using Diesel!

SGTM.

The data you've collected already has been interesting and useful!