Using BigQuery to Analyze PHP on GitHub
Brent Shaffer
Developer Programs Engineer
This has been an exciting summer for Google BigQuery, with the release of Standard SQL (beta) and the availability of the Github Public Dataset. Additionally, many of us have been working hard to spread the love for PHP within Google. With that in mind, we decided to use BigQuery and Standard SQL to uncover data about some of the most popular open-source languages and libraries. Read on to see what we discovered — and for inspiration on how to use BigQuery and Standard SQL in your own queries.
Query #1: Programming languages with the best test coverage
Which programming language has the best test coverage? There are several ways to go about answering this question; our approach will check each repo for known Continuous Integration (CI) service files, specificallytravis.yml
, Jenkinsfile
, circle.yml
, shippable.yml
and appveyor.yml
. While tests may exist that don’t include those CI service integration files, targeting a specific file is less likely to generate false positives.Here are our results:
language_name | percent | count | |
1 | Ruby | 30.9 | 192337 |
2 | Go | 29.5 | 64500 |
3 | JavaScript | 25.8 | 534327 |
4 | PHP | 20.2 | 223780 |
5 | Python | 18.7 | 290011 |
6 | Objective-C | 16.4 | 70555 |
7 | Java | 13.5 | 263704 |
8 | C++ | 11.2 | 122605 |
9 | HTML | 10.1 | 100850 |
10 | C | 9.7 | 130414 |
With BigQuery’s new Standard SQL syntax, repeated fields can no longer be queried using nested syntax or keywords like WITHIN RECORD. These fields (see language_name and most_bytes in the query above) are now selected using a subquery and UNNEST. To determine whether a repository has tests, we ran a fairly simple regex subquery on the files table and joined the results to the languages table to determine the primary language used in the repository. Finally, we aggregated the results to determine the percent coverage for each language.
PHP comes in 4th place, ahead of Python, Java and Objective-C. Well done, PHP community. A quick calculation using the count column shows PHP only needs to add 12,532 tests to pass Javascript and capture the bronze. Let’s get to it!
Query #2: The most popular packages in Composer
Using User Defined Functions (UDFs), BigQuery can parse the contents of PHP’s package management system Composer, by examining composer.json and compiling the list of dependencies for each repository. The UDF field in BigQuery’s legacy SQL is no longer valid for Standard SQL. Instead, functions are declared inline with the rest of the Standard SQL Query.Here are the top packages, according to our query:
package_name | count | |
1 | php | 90216 |
2 | doctrine/orm | 17028 |
3 | symfony/symfony | 16945 |
4 | doctrine/doctrine-bundle | 15871 |
5 | symfony/monolog-bundle | 15771 |
6 | sensio/distribution-bundle | 15708 |
7 | sensio/framework-extra-bundle | 15607 |
8 | symfony/swiftmailer-bundle | 15458 |
9 | symfony/assetic-bundle | 14545 |
10 | twig/extensions | 13549 |
11 | incenteev/composer-parameter-handler | 13212 |
To get the package owner instead of the package name, only use the part of the string preceding the forward slash (line 7
):
Here are the top package owners:
owner | count | |
1 | symfony | 105511 |
2 | php | 90217 |
3 | doctrine | 54377 |
4 | sensio | 39301 |
5 | yiisoft | 20585 |
6 | zendframework | 18274 |
7 | twig | 18058 |
8 | incenteev | 13213 |
9 | illuminate | 10992 |
10 | jms | 9310 |
11 | sonata-project | 8773 |
The most popular package is doctrine/orm (as “php” represents the required PHP version and not a package). symfony is the most popular package owner, meaning packages prefixed with “symfony” are more common than any other package prefix. Thanks to all these package owners for creating and maintaining these great libraries!
Query #3: The most popular packages across PHP, Python and Ruby
A UDF can parse a single language’s requirements file, so why not parse more than one language’s requirements file? The following UDF selects from all repositories containingcomposer.json
, requirements.txt
or Gemfile
And here are our results:
row | package_name | count |
1 | PHP - php | 91388 |
2 | Ruby - rails | 41761 |
3 | Ruby - rake | 29662 |
4 | Ruby - jquery-rails | 23775 |
5 | Ruby - rspec | 23274 |
6 | Ruby - uglifier | 22836 |
7 | Ruby - sass-rails | 21970 |
8 | Ruby - coffee-rails | 19710 |
9 | Python - requests | 18628 |
10 | Ruby - pg | 17690 |
11 | Ruby - therubyracer | 17639 |
12 | PHP - doctrine/orm | 17142 |
13 | PHP - symfony/symfony | 17063 |
14 | Ruby - unicorn | 17033 |
15 | Ruby - sqlite3 | 16658 |
16 | Ruby - jbuilder | 16228 |
17 | PHP - doctrine/doctrine-bundle | 15974 |
18 | PHP - symfony/monolog-bundle | 15879 |
19 | PHP - sensio/distribution-bundle | 15806 |
20 | PHP - sensio/framework-extra-bundle | 15699 |
21 | PHP - symfony/swiftmailer-bundle | 15552 |
While Ruby does have the lion’s share of the top 10 packages, PHP has a strong presence in the top 20. It may not be very significant to compare requirements in this way, due to differences between language communities, but this should be encouraging for PHP, as Composer is the youngest package manager of them all!
Query #4: The most commonly used languages with PHP
One interesting aspect of the GitHub data is each repository contains a row for each language it contains. This allows us to query for repositories primarily containing PHP code and discover what other languages are used the most in tandem with PHP. Consider this query:Here are the results:
Row | auxiliary_languages.name | count |
1 | CSS | 92690 |
2 | JavaScript | 88308 |
3 | HTML | 46153 |
4 | ApacheConf | 41167 |
5 | Shell | 37849 |
6 | Ruby | 9617 |
7 | Perl | 9532 |
8 | Batchfile | 8002 |
9 | Smarty | 5894 |
10 | Python | 5472 |
11 | Makefile | 5371 |
12 | XSLT | 4966 |
13 | C++ | 4344 |
14 | C | 2632 |
The top three are HTML, CSS and Javascript, which is not too surprising, followed by Apache configuration files and Shell scripts. Ruby comes in at a distant sixth.