GitHub Data

How to query public data sets using BigQuery

BigQuery is a fully managed data warehouse and analytics platform. Public datasets are available for you to analyze using SQL queries. You can access BigQuery public data sets using the web UI the command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, .NET, or Python.

To get started using a BigQuery public dataset, create or select a project. The first terabyte of data processed per month is free, so you can start querying public datasets without enabling billing. If you intend to go beyond the free tier, you should also enable billing.

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform project.

    Go to the Manage resources page

  3. Enable billing for your project.

    Enable billing

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, Enable the BigQuery API.

    Enable the API

Dataset overview

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

You can start exploring this data in the BigQuery console:

Go to the GitHub public dataset

Sample Queries

Here are some examples of SQL queries you can run on this data in BigQuery. For more tips, updated resources, and community content, see the updated resource list at https://medium.com/@hoffa/b3576fd2b150

Most commonly used Go packages

SELECT
  REGEXP_EXTRACT(line, r'"([^"]+)"') AS url,
  COUNT(*) AS count
FROM
  FLATTEN( (
    SELECT
      SPLIT(SPLIT(REGEXP_EXTRACT(content, r'.*import\s*[(]([^)]*)[)]'), '\n'), ';') AS line,
    FROM (
      SELECT
        id,
        content
      FROM
        [bigquery-public-data:github_repos.sample_contents]
      WHERE
        REGEXP_MATCH(content, r'.*import\s*[(][^)]*[)]')) AS C
    JOIN (
      SELECT
        id
      FROM
        [bigquery-public-data:github_repos.sample_files]
      WHERE
        path LIKE '%.go'
      GROUP BY
        id) AS F
    ON
      C.id = F.id), line)
GROUP BY
  url
HAVING
  url IS NOT NULL
ORDER BY
  count DESC
LIMIT 10

Most commonly used Go packages

Most commonly used Java packages

This query uses one of the smaller sample tables to find the most popular Java packages.

SELECT
  package,
  COUNT(*) count
FROM (
  SELECT
    REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') package,
    id
  FROM (
    SELECT
      SPLIT(content, '\n') line,
      id
    FROM
      [bigquery-public-data:github_repos.sample_contents]
    WHERE
      content CONTAINS 'import'
      AND sample_path LIKE '%.java'
    HAVING
      LEFT(line, 6)='import' )
  GROUP BY
    package,
    id )
GROUP BY
  1
ORDER BY
  count DESC
LIMIT
  40;

Most commonly used Java packages

How many times 'This should never happen' appears

This query uses a smaller sample table to find how many times the comment "this should never happen" is present.

SELECT
  SUM(copies)
FROM
  [bigquery-public-data:github_repos.sample_contents]
WHERE
  NOT binary
  AND content CONTAINS 'This should never happen'

Hint: if you run this query against the entire dataset the answer is around 1,000,000!

How many times 'This should never happen' appears

About the dataset

Dataset Source: GitHub

Category: Technology, Social

Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://help.github.com/articles/github-terms-of-service/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Update Frequency: Weekly

View in BigQuery: Go to GitHub data

Send feedback about...