Edit on GitHub
Report issue
Page history

Author(s): @{[ username ]}  Published: {[ TutorialCtrl.tutorial.date_published | date:'mediumDate' ]}

title: Querying the Most Popular NPM Packages on GitHub with Google BigQuery description: Learn how to find the most popular NPM packages on GitHub by querying Google BigQuery's public GitHub datasets. author: JustinBeckwith tags: BigQuery, GitHub, NPM date_published: 2017-02-08


This tutorial shows how to query the public GitHub dataset on Google BigQuery in order to determine the most popular NPM packages among public projects on GitHub.

Objectives

  • Run a simple query to determine how many package.json files exist on GitHub
  • Run a complex query to determine the most popular top-level NPM imports on GitHub

Costs

This tutorial uses billable components of Google Cloud Platform, including:

  • Google BigQuery

Before you begin

  1. Select or create a Google Cloud Platform Console project. Go to the projects page.
  2. Enable billing for your project. Enable billing.

The BigQuery console

Navigate to bigquery.cloud.google.com/welcome/YOUR_PROJECT_ID to start running queries.

When querying data BigQuery charges $5 (USD) per terabyte of data processed, where the first terabyte (1 TB) per month is free.

The queries in this tutorial will process more than 1 TB, which is why billing must be enabled.

How many package.json files are on GitHub?

  1. In the BigQuery console click Compose Query and paste in the following SQL query:

    SELECT
      COUNT(*) AS num_files
    FROM [bigquery-public-data:github_repos.files]
      WHERE
        RIGHT(path, 12) = "package.json"
    
  2. Click Run Query to execute the query, which should only take a few seconds.

  3. You should see a result like the following:

    query-1

Which NPM packages are imported the most on GitHub?

  1. In the BigQuery console click Compose Query and paste in the following SQL query:

    SELECT
      COUNT(*) as times_imported, package
    FROM
      JS(
        (SELECT content FROM [bigquery-public-data:github_repos.contents] WHERE id IN (
          SELECT id FROM [bigquery-public-data:github_repos.files] WHERE RIGHT(path, 12) = "package.json"
        )),
        content,
        "[{ name: 'package', type: 'string'}]",
        "function(row, emit) {
          try {
            x = JSON.parse(row.content);
            if (x.dependencies) {
              Object.keys(x.dependencies).forEach(function(dep) {
                emit({ package: dep });
              });
            }
          } catch (e) {}
        }"
      )
    GROUP BY package
    ORDER BY times_imported DESC
    LIMIT 1000
    
  2. Click Run Query to execute the query, which should take 2-3 minutes and process several terabytes of data.

  3. You should see a result like the following:

    query-2

There you have it, the most imported NPM packages on GitHub!

Next steps

See more by @{[ username ]} and more tagged {[ tag ]}{[ $last ? '' : ', ' ]}

Submit a Tutorial

Share step-by-step guides

SUBMIT A TUTORIAL

Request a Tutorial

Ask for community help

SUBMIT A REQUEST

GCP Tutorials

Tutorials published by GCP

VIEW TUTORIALS

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies. Java is a registered trademark of Oracle and/or its affiliates.