I've started prototyping with ember and highcharts:
It ain't pretty yet, but it's a start. I'm hoping to have all of the core graphing in place sometime this weekend, time permitting.
Do you have a sense for how much data is being transferred in the scrapes? Do subsequent scrapes retransfer all the same data or are they incrementalized somehow?
The scrapes of GitHub and the nightly releases are incremental, and the incremental scrapes are pretty quick. GitHub provides a since
keyword on most of their query strings, so I can just search the database for the most recent update, and ask for any updates since then. It works OK, although if the scraper daemon is killed in the middle of doing GitHub, then it's necessary to run the bootstrap
command to get all data since the last known good scrape (that command takes a date, so it'll grab some extra, but that's OK too -- it'll just update the existing data in place). The nightly releases are super simple to check, so each scrape just looks for a YYYY-MM-DD/index.html
that occurred after the last successful release. The downside there is that if a release is yanked for some reason, if the scraper found the release before the yanking, it won't update that, but that's the cost of incremental updates I think, and probably not that big a deal.
The buildbot scraping is very much not ideal -- it needs to go to each builder and ask for all builds on record, and unfortunately it doesn't look like the current API supports anything more granular. This takes about 5-15 minutes per run. I'm currently running it once every ~90 minutes. @alexcrichton kept an eye on the build cluster when I first ran it a little while ago, and it didn't seem like the API (despite taking forever) puts a lot of load on the build machines. But that may be worth revisiting. Also, to make things more interesting, the 0.9 series of buildbot (IIRC Rust is running 0.8.10) will remove the JSON API that I'm currently using and replace it with something much better. But that module will require a rewrite if/when the Rust CI system is updated past the 0.8.* series of buildbot.
It remains to be seen how incremental I can make the scrapes for the other data sources. I decided that the 3 I've got were enough to build the API server and web frontend, so I haven't read up much yet on the other sources discussed.
How large is the summary data now?
If I'm reading curl's output correctly, it looks like it's about 66KB that's currently transferred from the summary endpoint for 120 days of data. I'm in the process of reworking the database functions so that they return data in a format that's directly graphable without transformation on the front-end, so that number could go up or down by a bit.
$ curl http://localhost:8080/summary > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 65882 100 65882 0 0 1452k 0 --:--:-- --:--:-- --:--:-- 1462k
That's so cool you are using Diesel!
It's been a bit of a learning curve, but I think that it's been the best example of Rust's special sauce that I've used yet. Strong typing with clean abstractions and screaming performance.
The data you've collected already has been interesting and useful!
Glad to hear it!