Google Cloud Platform

Using BigQuery to Analyze PHP on GitHub

This has been an exciting summer for Google BigQuery, with the release of Standard SQL (beta) and the availability of the Github Public Dataset. Additionally, many of us have been working hard to spread the love for PHP within Google. With that in mind, we decided to use BigQuery and Standard SQL to uncover data about some of the most popular open-source languages and libraries. Read on to see what we discovered — and for inspiration on how to use BigQuery and Standard SQL in your own queries.

Query #1: Programming languages with the best test coverage

Which programming language has the best test coverage? There are several ways to go about answering this question; our approach will check each repo for known Continuous Integration (CI) service files, specifically travis.yml, Jenkinsfile, circle.yml, shippable.yml and appveyor.yml. While tests may exist that don’t include those CI service integration files, targeting a specific file is less likely to generate false positives.

  #StandardSQL
SELECT
  (
    SELECT ANY_VALUE(name)
    FROM UNNEST(language)
    WHERE bytes = most_bytes
  ) AS language_name,
  ROUND(
    SUM(
      CAST(has_tests AS INT64)
    )/COUNT(repo_name) * 100, 1
  ) AS percent,
  COUNT(repo_name) as count
FROM (
  SELECT
    l.repo_name,
    has_tests,
    language,
    (
      SELECT MAX(bytes)
      FROM UNNEST(language)
    ) as most_bytes
  FROM (
    SELECT
      repo_name,
      SUM(
        CAST(
          REGEXP_CONTAINS(
            path, 
            r'^\.travis\.yml|^Jenkinsfile|^circle\.yml|^shippable\.yml|^appveyor\.yml'
          ) AS INT64
        )
      ) > 0 AS has_tests
    FROM
      `bigquery-public-data.github_repos.files`
    GROUP BY
      repo_name
    ) f
  JOIN
    `bigquery-public-data.github_repos.languages` l
  ON
    l.repo_name = f.repo_name ) l2
GROUP BY
  language_name
HAVING
  COUNT(repo_name) > 50000
ORDER BY
  percent DESC

Here are our results:

language_namepercentcount
1Ruby30.9192337
2Go29.564500
3JavaScript25.8534327
4PHP20.2223780
5Python18.7290011
6Objective-C16.470555
7Java13.5263704
8C++11.2122605
9HTML10.1100850
10C9.7130414
php-on-github-8esek.PNG

With BigQuery’s new Standard SQL syntax, repeated fields can no longer be queried using nested syntax or keywords like WITHIN RECORD. These fields (see language_name and most_bytes in the query above) are now selected using a subquery and UNNEST. To determine whether a repository has tests, we ran a fairly simple regex subquery on the files table and joined the results to the languages table to determine the primary language used in the repository. Finally, we aggregated the results to determine the percent coverage for each language.

PHP comes in 4th place, ahead of Python, Java and Objective-C. Well done, PHP community. A quick calculation using the count column shows PHP only needs to add 12,532 tests to pass Javascript and capture the bronze. Let’s get to it!

Query #2: The most popular packages in Composer

Using User Defined Functions (UDFs), BigQuery can parse the contents of PHP’s package management system Composer, by examining composer.json and compiling the list of dependencies for each repository. The UDF field in BigQuery’s legacy SQL is no longer valid for Standard SQL. Instead, functions are declared inline with the rest of the Standard SQL Query.

  #StandardSQL
CREATE TEMPORARY FUNCTION parseComposer(json STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS """
  var result = []
  try {
    var json = JSON.parse(json)
    for (key in json.require) {
      result.push(key)
    }
    delete json;
  } catch(e) {
  }
  return result;
  """;
SELECT
  package_name,
  COUNT(package_name) AS count
FROM (
  SELECT
    parseComposer(c.content) AS package_name
  FROM
    `bigquery-public-data.github_repos.contents` c
  JOIN (
    SELECT
      id
    FROM
      `bigquery-public-data.github_repos.files`
    WHERE
      path = "composer.json" ) f
  ON
    f.id = c.id ) packages,
  UNNEST(package_name) AS package_name
WHERE
  package_name IS NOT NULL
GROUP BY
  package_name
ORDER BY
  count DESC
LIMIT
  20

Here are the top packages, according to our query:

package_namecount
1php90216
2doctrine/orm17028
3symfony/symfony16945
4doctrine/doctrine-bundle15871
5symfony/monolog-bundle15771
6sensio/distribution-bundle15708
7sensio/framework-extra-bundle15607
8symfony/swiftmailer-bundle15458
9symfony/assetic-bundle14545
10twig/extensions13549
11incenteev/composer-parameter-handler13212
php-on-github-11emnl.PNG

To get the package owner instead of the package name, only use the part of the string preceding the forward slash (line 7):

  CREATE TEMPORARY FUNCTION parseComposer(json STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS """
  var result = []
  try {
    var json = JSON.parse(json)
    for (key in json.require) {
      result.push(key.split("/")[0]) // line 7
    }
    delete json;
  } catch(e) {
  }
  return result;
  """;

Here are the top package owners:

ownercount
1symfony105511
2php90217
3doctrine54377
4sensio39301
5yiisoft20585
6zendframework18274
7twig18058
8incenteev13213
9illuminate10992
10jms9310
11sonata-project8773
php-on-github-10fkh1.PNG

The most popular package is doctrine/orm (as “php” represents the required PHP version and not a package). symfony is the most popular package owner, meaning packages prefixed with “symfony” are more common than any other package prefix. Thanks to all these package owners for creating and maintaining these great libraries!

Query #3: The most popular packages across PHP, Python and Ruby

A UDF can parse a single language’s requirements file, so why not parse more than one language’s requirements file? The following UDF selects from all repositories containing composer.json, requirements.txt or Gemfile

  #StandardSQL
CREATE TEMPORARY FUNCTION parsePackageFile(path STRING, json STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS """
  var result = []
  switch (path) {
    case 'requirements.txt':
      requirements = content.split('\\n')
      for (i in requirements) {
        if (requirements[i].trim().length > 0) {
          req = requirements[i].split(/==|>|<|>=|<=|,/)
          result.push('Python - ' + req[0].toLowerCase())
        }
      }
      delete requirements
      break;
    case  'Gemfile':
      regexp = /gem (?:\"([^\"]+)\"|'([^']+)')/g
      while ((match = regexp.exec(content)) != null) {
        result.push('Ruby - ' + match[1].toLowerCase())
      }
      delete match
      break;
    case 'composer.json':
      try{
        var json = JSON.parse(content)
        for (key in json.require) {
          result.push('PHP - ' + key.toLowerCase())
        }
        delete json
      }catch(e){
      }
      break;
  }
  return result;
  """;
SELECT
  package_name,
  COUNT(package_name) AS count
FROM (
  SELECT
    parsePackageFile(f.path, c.content) AS package_name
  FROM
    `bigquery-public-data.github_repos.contents` c
  JOIN (
    SELECT
      id,
      path
    FROM
      `bigquery-public-data.github_repos.files`
    WHERE
      path IN ("composer.json", "requirements.txt", "Gemfile")
    ) f
  ON
    f.id = c.id 
  WHERE
    content is NOT NULL ) packages,
  UNNEST(package_name) AS package_name
WHERE
  package_name IS NOT NULL
GROUP BY
  package_name
ORDER BY
  count DESC
LIMIT
  21

And here are our results:

rowpackage_namecount
1PHP - php91388
2Ruby - rails41761
3Ruby - rake29662
4Ruby - jquery-rails23775
5Ruby - rspec23274
6Ruby - uglifier22836
7Ruby - sass-rails21970
8Ruby - coffee-rails19710
9Python - requests18628
10Ruby - pg17690
11Ruby - therubyracer17639
12PHP - doctrine/orm17142
13PHP - symfony/symfony17063
14Ruby - unicorn17033
15Ruby - sqlite316658
16Ruby - jbuilder16228
17PHP - doctrine/doctrine-bundle15974
18PHP - symfony/monolog-bundle15879
19PHP - sensio/distribution-bundle15806
20PHP - sensio/framework-extra-bundle15699
21PHP - symfony/swiftmailer-bundle15552
php-on-github-9yqdv.PNG

While Ruby does have the lion’s share of the top 10 packages, PHP has a strong presence in the top 20. It may not be very significant to compare requirements in this way, due to differences between language communities, but this should be encouraging for PHP, as Composer is the youngest package manager of them all!

Query #4: The most commonly used languages with PHP

One interesting aspect of the GitHub data is each repository contains a row for each language it contains. This allows us to query for repositories primarily containing PHP code and discover what other languages are used the most in tandem with PHP. Consider this query:

  #StandardSQL
SELECT
  auxiliary_languages.name,
  COUNT(auxiliary_languages.name) AS count
FROM (
  SELECT
    language,
    (
      SELECT
        ANY_VALUE(name)
      FROM
        UNNEST(language)
      WHERE bytes = most_bytes
    ) AS primary_language
  FROM (
    SELECT
      language,
      (
        SELECT
          MAX(bytes)
        FROM
          UNNEST(language)
      ) AS most_bytes
    FROM
      `bigquery-public-data.github_repos.languages` 
    ) t1 
  ) t2,
  UNNEST(t2.language) AS all_languages
WHERE
  primary_language = "PHP"
  AND auxiliary_languages.name != "PHP"
GROUP BY
  auxiliary_languages.name
ORDER BY
  count DESC

Here are the results:

Rowauxiliary_languages.namecount
1CSS92690
2JavaScript88308
3HTML46153
4ApacheConf41167
5Shell37849
6Ruby9617
7Perl9532
8Batchfile8002
9Smarty5894
10Python5472
11Makefile5371
12XSLT4966
13C++4344
14C2632
php-on-github-7y7fa.PNG

The top three are HTML, CSS and Javascript, which is not too surprising, followed by Apache configuration files and Shell scripts. Ruby comes in at a distant sixth.

Wrap up

The GitHub public dataset is incredibly fun to query, and BigQuery makes doing so incredibly fast! The longest of the above queries took 40 seconds to complete, and that’s with parsing 2TB of data with UDFs. So go forth and start querying!