Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Using BigQuery to Analyze PHP on GitHub

Thursday, September 22, 2016

Posted by Brent Shaffer, Developer Programs Engineer

This has been an exciting summer for Google BigQuery, with the release of Standard SQL (beta) and the availability of the Github Public Dataset. Additionally, many of us have been working hard to spread the love for PHP within Google. With that in mind, we decided to use BigQuery and Standard SQL to uncover data about some of the most popular open-source languages and libraries. Read on to see what we discovered — and for inspiration on how to use BigQuery and Standard SQL in your own queries.

Query #1: Programming languages with the best test coverage

Which programming language has the best test coverage? There are several ways to go about answering this question; our approach will check each repo for known Continuous Integration (CI) service files, specifically travis.yml, Jenkinsfile, circle.yml, shippable.yml and appveyor.yml. While tests may exist that don’t include those CI service integration files, targeting a specific file is less likely to generate false positives.

#StandardSQL
SELECT
  (
    SELECT ANY_VALUE(name)
    FROM UNNEST(language)
    WHERE bytes = most_bytes
  ) AS language_name,
  ROUND(
    SUM(
      CAST(has_tests AS INT64)
    )/COUNT(repo_name) * 100, 1
  ) AS percent,
  COUNT(repo_name) as count
FROM (
  SELECT
    l.repo_name,
    has_tests,
    language,
    (
      SELECT MAX(bytes)
      FROM UNNEST(language)
    ) as most_bytes
  FROM (
    SELECT
      repo_name,
      SUM(
        CAST(
          REGEXP_CONTAINS(
            path, 
            r'^\.travis\.yml|^Jenkinsfile|^circle\.yml|^shippable\.yml|^appveyor\.yml'
          ) AS INT64
        )
      ) > 0 AS has_tests
    FROM
      `bigquery-public-data.github_repos.files`
    GROUP BY
      repo_name
    ) f
  JOIN
    `bigquery-public-data.github_repos.languages` l
  ON
    l.repo_name = f.repo_name ) l2
GROUP BY
  language_name
HAVING
  COUNT(repo_name) > 50000
ORDER BY
  percent DESC

Here are our results:

language_name percent count
1 Ruby 30.9 192337
2 Go 29.5 64500
3 JavaScript 25.8 534327
4 PHP 20.2 223780
5 Python 18.7 290011
6 Objective-C 16.4 70555
7 Java 13.5 263704
8 C++ 11.2 122605
9 HTML 10.1 100850
10 C 9.7 130414

With BigQuery’s new Standard SQL syntax, repeated fields can no longer be queried using nested syntax or keywords like WITHIN RECORD. These fields (see language_name and most_bytes in the query above) are now selected using a subquery and UNNEST. To determine whether a repository has tests, we ran a fairly simple regex subquery on the files table and joined the results to the languages table to determine the primary language used in the repository. Finally, we aggregated the results to determine the percent coverage for each language.

PHP comes in 4th place, ahead of Python, Java and Objective-C. Well done, PHP community. A quick calculation using the count column shows PHP only needs to add 12,532 tests to pass Javascript and capture the bronze. Let’s get to it!

Query #2: The most popular packages in Composer

Using User Defined Functions (UDFs), BigQuery can parse the contents of PHP’s package management system Composer, by examining composer.json and compiling the list of dependencies for each repository. The UDF field in BigQuery’s legacy SQL is no longer valid for Standard SQL. Instead, functions are declared inline with the rest of the Standard SQL Query.

#StandardSQL
CREATE TEMPORARY FUNCTION parseComposer(json STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS """
  var result = []
  try {
    var json = JSON.parse(json)
    for (key in json.require) {
      result.push(key)
    }
    delete json;
  } catch(e) {
  }
  return result;
  """;

SELECT
  package_name,
  COUNT(package_name) AS count
FROM (
  SELECT
    parseComposer(c.content) AS package_name
  FROM
    `bigquery-public-data.github_repos.contents` c
  JOIN (
    SELECT
      id
    FROM
      `bigquery-public-data.github_repos.files`
    WHERE
      path = "composer.json" ) f
  ON
    f.id = c.id ) packages,
  UNNEST(package_name) AS package_name
WHERE
  package_name IS NOT NULL
GROUP BY
  package_name
ORDER BY
  count DESC
LIMIT
  20 

Here are the top packages, according to our query:

package_name count
1 php 90216
2 doctrine/orm 17028
3 symfony/symfony 16945
4 doctrine/doctrine-bundle 15871
5 symfony/monolog-bundle 15771
6 sensio/distribution-bundle 15708
7 sensio/framework-extra-bundle 15607
8 symfony/swiftmailer-bundle 15458
9 symfony/assetic-bundle 14545
10 twig/extensions 13549
11 incenteev/composer-parameter-handler 13212

To get the package owner instead of the package name, only use the part of the string preceding the forward slash (line 7):

CREATE TEMPORARY FUNCTION parseComposer(json STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS """
  var result = []
  try {
    var json = JSON.parse(json)
    for (key in json.require) {
      result.push(key.split("/")[0]) // line 7
    }
    delete json;
  } catch(e) {
  }
  return result;
  """;

Here are the top package owners:

owner count
1 symfony 105511
2 php 90217
3 doctrine 54377
4 sensio 39301
5 yiisoft 20585
6 zendframework 18274
7 twig 18058
8 incenteev 13213
9 illuminate 10992
10 jms 9310
11 sonata-project 8773

The most popular package is doctrine/orm (as “php” represents the required PHP version and not a package). symfony is the most popular package owner, meaning packages prefixed with “symfony” are more common than any other package prefix. Thanks to all these package owners for creating and maintaining these great libraries!

Query #3: The most popular packages across PHP, Python and Ruby

A UDF can parse a single language’s requirements file, so why not parse more than one language’s requirements file? The following UDF selects from all repositories containing composer.json, requirements.txt or Gemfile


#StandardSQL
CREATE TEMPORARY FUNCTION parsePackageFile(path STRING, json STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS """
  var result = []
  switch (path) {
    case 'requirements.txt':
      requirements = content.split('\\n')
      for (i in requirements) {
        if (requirements[i].trim().length > 0) {
          req = requirements[i].split(/==|>|<|>=|<=|,/)
          result.push('Python - ' + req[0].toLowerCase())
        }
      }
      delete requirements
      break;
    case  'Gemfile':
      regexp = /gem (?:\"([^\"]+)\"|'([^']+)')/g
      while ((match = regexp.exec(content)) != null) {
        result.push('Ruby - ' + match[1].toLowerCase())
      }
      delete match
      break;
    case 'composer.json':
      try{
        var json = JSON.parse(content)
        for (key in json.require) {
          result.push('PHP - ' + key.toLowerCase())
        }
        delete json
      }catch(e){
      }
      break;
  }
  return result;
  """;

SELECT
  package_name,
  COUNT(package_name) AS count
FROM (
  SELECT
    parsePackageFile(f.path, c.content) AS package_name
  FROM
    `bigquery-public-data.github_repos.contents` c
  JOIN (
    SELECT
      id,
      path
    FROM
      `bigquery-public-data.github_repos.files`
    WHERE
      path IN ("composer.json", "requirements.txt", "Gemfile")
    ) f
  ON
    f.id = c.id 
  WHERE
    content is NOT NULL ) packages,
  UNNEST(package_name) AS package_name
WHERE
  package_name IS NOT NULL
GROUP BY
  package_name
ORDER BY
  count DESC
LIMIT
  21

And here are our results:

row package_name count
1 PHP - php 91388
2 Ruby - rails 41761
3 Ruby - rake 29662
4 Ruby - jquery-rails 23775
5 Ruby - rspec 23274
6 Ruby - uglifier 22836
7 Ruby - sass-rails 21970
8 Ruby - coffee-rails 19710
9 Python - requests 18628
10 Ruby - pg 17690
11 Ruby - therubyracer 17639
12 PHP - doctrine/orm 17142
13 PHP - symfony/symfony 17063
14 Ruby - unicorn 17033
15 Ruby - sqlite3 16658
16 Ruby - jbuilder 16228
17 PHP - doctrine/doctrine-bundle 15974
18 PHP - symfony/monolog-bundle 15879
19 PHP - sensio/distribution-bundle 15806
20 PHP - sensio/framework-extra-bundle 15699
21 PHP - symfony/swiftmailer-bundle 15552

While Ruby does have the lion’s share of the top 10 packages, PHP has a strong presence in the top 20. It may not be very significant to compare requirements in this way, due to differences between language communities, but this should be encouraging for PHP, as Composer is the youngest package manager of them all!

Query #4: The most commonly used languages with PHP

One interesting aspect of the GitHub data is each repository contains a row for each language it contains. This allows us to query for repositories primarily containing PHP code and discover what other languages are used the most in tandem with PHP. Consider this query:
#StandardSQL
SELECT
  auxiliary_languages.name,
  COUNT(auxiliary_languages.name) AS count
FROM (
  SELECT
    language,
    (
      SELECT
        ANY_VALUE(name)
      FROM
        UNNEST(language)
      WHERE bytes = most_bytes
    ) AS primary_language
  FROM (
    SELECT
      language,
      (
        SELECT
          MAX(bytes)
        FROM
          UNNEST(language)
      ) AS most_bytes
    FROM
      `bigquery-public-data.github_repos.languages` 
    ) t1 
  ) t2,
  UNNEST(t2.language) AS all_languages
WHERE
  primary_language = "PHP"
  AND auxiliary_languages.name != "PHP"
GROUP BY
  auxiliary_languages.name
ORDER BY
  count DESC
Here are the results:
Row auxiliary_languages.name count
1 CSS 92690
2 JavaScript 88308
3 HTML 46153
4 ApacheConf 41167
5 Shell 37849
6 Ruby 9617
7 Perl 9532
8 Batchfile 8002
9 Smarty 5894
10 Python 5472
11 Makefile 5371
12 XSLT 4966
13 C++ 4344
14 C 2632

The top three are HTML, CSS and Javascript, which is not too surprising, followed by Apache configuration files and Shell scripts. Ruby comes in at a distant sixth.

Wrap up

The GitHub public dataset is incredibly fun to query, and BigQuery makes doing so incredibly fast! The longest of the above queries took 40 seconds to complete, and that’s with parsing 2TB of data with UDFs. So go forth and start querying!
  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.

TRY IT FREE

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.