Idea: Using BigQuery's public Github data for the change impact analysis

Hi !

I’ve just heard about partnership between Google and Github to open up all of Github’s public repo data on the BigQuery platform. I think this might be worth looking into from the change impact analysis standpoint. It should be able to answer questions on not only how many times certain API was used, but also how it was used and whether e.g. certain API might be hard to use, as lines involving its use change a lot. This is just an idea, I haven’t had time to experiment with it yet, but are looking to actually doing so in some future.

References:
https://changelog.com/209/ - interview with people from Github and Google on the topic.
https://github.com/blog/2201-making-open-source-data-more-available - Github’s annoucement.
https://cloud.google.com/bigquery/public-data/github - dataset itself

Neat idea. One piece of data I’d like that may be easy to get out of this is which unstable features are used most. It’s basically matter of grepping for #[unstable(. If somebody wants to experiment with this tool, that would be a good query to go for.

cc @dikaiosune

Good point! Since I had something like this on my mind when they announced, I asked in the HN thread how frequently it’d be updated, and someone from GitHub (I think?) said they are shooting for weekly updates. So we could potentially even get a nice time series going if we find some queries we like that are under the monthly allotment.

I’d also be interested in just getting a better idea of std’s API usage frequency, how certain crates are used, error handling behavior, etc.

I need to look at how easy it is to filter down a query to just Rust repos, otherwise it’d be very easy to blow through the free BigQuery allotment.

It seems that you can create a new dataset of just some files : Analyzing Go code with BigQuery. Recently my colleague Felipe Hoffa told… | by Francesc Campoy | Google Cloud - Community | Medium . I will try to take a look at that if time allows.

As for the update frequency - in the interview they say it's weekly for now but can change to more frequent. The data already contains all commits/history, so even if the update frequency is not regular we can do a nice timeseries imho - not sure how this would influence the ability to use the free tier though, as intuitively grepping through the history might be touching quite some data.

An idea to filter for Rust repos is to look for Cargo.toml. Concretely,

select repo_name
from [bigquery-public-data:github_repos.files]
where path = 'Cargo.toml'

If we want to run many queries, it is desirable to create a new dataset from contents table including only Rust repos in order to save billing. But dataset creation itself can’t be done in free quota, and dataset storage is also not free.

1 Like

I played around with it a little bit. I created a table of rust files by:

  • Filtering through repos for which GitHub claims has Rust files
  • Then filtering files that end with ‘.rs’

Maybe I can make this available to the public? I don’t know, yet. The content of all the rust files is about 1.4 GB, so storage costs for maintaining that table is practically free. However, I don’t think you can get around the fact that you need to use 1.5 TB to collect the rust files, so updating the database will be about $7 (about $2.50 if you only do it once a month).

Anyway, I don’t know much about SQL except that you need to SELECT stuff FROM something, but these results may be interesting nonetheless. If you have a specific query in mind, let me know, or perhaps I’ll just figure out how to make the table public.

Fun note: This table sources data from 11,947 repos, which contain 840,778 160,447 rust files, consisting of 1.40GB.

#[unstable(…)]

I selected everything that had (optional !)[unstable(...)]. I guess I forgot to add a # (shrugs). Inside the ... I selected all the features, stuff inside features = "...". I turns out that everybody except for one person actually used a space between features and =, so that was nice. Here is the query:

SELECT line, count(*) as n
FROM (
  SELECT
    REGEXP_EXTRACT(
      REGEXP_EXTRACT(
        content, 
        r'(?s)!?\[unstable\((.*?)\)\]'
      ),
      r'feature = \"(.*?)\"'
    ) as line
  FROM rust_lang.contents
)
GROUP BY line
ORDER BY n DESC

Here are the results: https://pastebin.mozilla.org/8884291. So the top 10 are:

  1. core (1429)
  2. rustc_private (542)
  3. std_misc (313)
  4. collections (220)
  5. alloc (88)
  6. rand (84)
  7. hash (73)
  8. unicode (71)
  9. io (55)
  10. as_slice (52)

#[features(…)]

Figure knowing which features are being used could potentially be useful. I first selected things that looked like [feature(...)], and from there I split by commas and stripped whitespace. Here is the query:

SELECT line, count(*) as n
FROM (
  SELECT
    REGEXP_REPLACE(
      SPLIT(
        REGEXP_EXTRACT(
          content, 
          r'(?s)!?\[feature\((.*?)\)\]'
        ), 
        ','
      ),
      r'[\r\n\s]+',
      ''
    ) as line
  FROM rust_lang.contents
)
GROUP BY line
ORDER BY n DESC

Here are the results: https://pastebin.mozilla.org/8884293.

Please feel free to let me know if I screwed up any of the queries :slight_smile:

cc @brson

2 Likes

I think I made the table public: https://bigquery.cloud.google.com/dataset/rust-fun:rust_lang, If someone wants to give it a try and let me know if I succeeded in making it public. It’s only 1.4GB, so I think you can use about 700 queries for free each month.

Wow, this is awesome @cbreeden ! Will try to play with the data myself later :slight_smile:

As for the refresh costing money - can we setup a fund for things like these ? Or ask Google to maybe chip in ? What do you think ?

It’s only $3 or so if you only do it once a month. After that the queries are dirt cheap, and the storage is essentially free. Apparently there is a decent API for this kind of stuff as well (even a crate! https://crates.io/crates/google-bigquery2). Perhaps we can come up with some queries that we run once a month to create a nifty little report.

I wouldn’t mind refreshing the data once a month. But I’m worried that I might have done something wrong. The numbers don’t seem to add up. I’ll post my queries here in case someone wants to double check them.

First I selected the repos that had “Rust” apart of the language tag, and stored this in table rust-fun:rust_lang.repos:

SELECT *
FROM [bigquery-public-data:github_repos.languages]
WHERE language.name = 'Rust'

Then I selected all files from these repos that had an extension of ‘.rs’ and store this in table rust-fun:rust_lang.files:

SELECT *
FROM [bigquery-public-data:github_repos.files]
WHERE RIGHT(path, 3) = '.rs' 
AND repo_name IN (SELECT repo_name FROM [rust_lang.repos])

Then I selected all file contents that from this result and stored it in rust-fun:rust_lang.contents:

SELECT *
FROM [bigquery-public-data:github_repos.contents]
WHERE id IN (SELECT id FROM rust_lang.files)

The odd thing about this is that there are 840,778 rows in the rust-fun:rust_lang.files table and 160,447 rows in the rust-fun:rust_lang.contents table. I don’t know what to make of that. I guess there are duplicate files IDs? I can imagine some being from copies of files having the same hash or what not, but the numbers don’t seem to add up to me.

Cool! I’m hoping to have time to play with this tomorrow. I’m definitely interested in implementing some queries on GitHub data for http://rusty-dash.com (for example https://github.com/dikaiosune/rust-dashboard/issues/7, but also things like feature usage, crate usage, and many more), so that may be a decent place to host some of the query results. The server also already has some timed tasks which could be generalized to rebuild the dataset on a schedule and to run the queries when updated.

Badass. Thanks @cbreeden! I wonder why plugins doesn’t show up at all. That’s the one feature we know is a huge problem.

Edit: Oh, heh I was looking at your first ‘unstable’ list, ‘plugin’ is obviously in the second ‘feature’ list.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.