Data Analytics

Ibis and BigQuery: scalable analytics with the comfort of Python

Ibis Python

Hedge funds have long sought to build increasingly robust and accurate quantitative models. Today, live or up-to-date third-party datasets provide a wealth of information that needs to be correctly analyzed at a scale that has traditionally required custom solutions and many person-hours of work. On top of that, sharing these analyses is critical for avoiding duplicate work.

This blog post demonstrates how Ibis works with BigQuery to provide an elegant and flexible Python interface for composing SQL queries.

As an example, consider the following SQL query that computes the acquisition rate of a number of different companies. Here the acquisition rate is defined as the number of acquisitions (or IPOs) divided by the number of investments a particular venture capital (VC) firm has made. It’s a complex query that proves to be as difficult to debug as it is to compose:

Language: SQL

   	 acq_ipos / num_investments AS acq_rate
          	  ELSE i.investor_name
              END AS “Investor name”,
     	   COUNT(DISTINCT c.permalink) AS num_investments,
               CASE WHEN c.status IN (‘ipo’, ‘acquired’) THEN c.permalink
                    END) AS acq_ipos
  FROM crunchbase_companies
  LEFT JOIN crunchbase_investments
      ON c.permalink = i.company_permalink
) t

There’s a bug in this query, and it’s pretty difficult to spot.

Spoiler alert: the tables in the “t” subquery’s FROM clause are missing aliases that are referenced elsewhere in the query.

The error in this query is difficult to spot because it’s a single character in a sea of text. SQL has evolved through years of engineering to take on more capabilities with more syntax and while it has grown more powerful, it has also grown in complexity.

However, SQL isn’t going anywhere anytime soon. There are some extremely powerful analytic SQL engines available including Apache Hive, Cloudera Impala, Amazon’s Redshift, and of course: Google BigQuery. The common thread among all of these systems is SQL.

One of the downsides of using SQL is that it’s a closed system: you can’t easily use another language to solve the same problem at the same scale in the same amount of time. This leads to code that passes opaque SQL strings around, often to the chagrin of the folks reading and writing it. Imagine the query above but with template parameters and additional string processing. It quickly becomes hard to manage this kind of code, let alone turn it into something that can be trusted and widely used across a large company.

Over in the Python world, libraries such as pandas provide convenient and performant APIs for wrangling data. From joins to time series analysis, pandas makes it convenient to do complex data wrangling and analysis from start to finish. One downside of pandas is that it’s challenging to use as you encounter the need to distribute your computations.

How Ibis helps you scale

Ibis is a Python analytics library designed to provide the convenience of pandas’ APIs with the scalability of analytic SQL engines like BigQuery. It does this in a type-safe way, letting you build analytics expressions that compile to SQL and run on your favorite large-scale SQL engine. When you execute Ibis expressions, they turn into pandas DataFrames, which gives you access to the ecosystem of Python data libraries once you don’t need the scale that distributed SQL provides.

With Ibis, you no longer need to think about certain classes of bugs like the aliasing problem we saw earlier. Invalid operations (like taking the sum of a string column) are simply not available in the API.

Pandas command line in IPython and Ibis
Passing large, templated SQL strings around is a thing of the past.

Another goal of Ibis is that the expressions it creates should be composable. In this scenario, composable means that expression methods return other expressions. The following Ibis code chains together expressions to compute the percentage of GitHub repositories created between the hours of midnight and 4 AM, relative to the time zone reported in the created_at field of the table:

Language: Python

  con = ibis.bigquery.connect(billing_project, 'bigquery-public-data.samples')
t = con.table('github_nested')
expr = (
         # get the hour in UTC during which a repo was created
         hour=lambda t: t.created_at
                         .to_timestamp('%Y/%m/%d %T %z')
         # compute the UTC offset to adjust in the next expression
         utc_offset=lambda t: t.created_at.split(' ')[2].cast('int64') // 100)

     # group by the adjusted hour, count and sort by descending count
     .groupby(hour=lambda t: t.hour + t.utc_offset)

     # sum up the number of repos that were created between midnight and 4 AM
     # local time
         total=lambda t: t['count'].sum(),
         night_owl=lambda t: t['count'].sum(where=t.hour.between(0, 4)))

     # compute the percentage of repos created between midnight and 4 AM
     .mutate(night_owl_perc=lambda t: t.night_owl /

df = expr.execute()

Looking at df—the result of executing the expression—gives the following result:

  night_owl    total  night_owl_perc
0     422315  2541639        0.166159

This example shows the composability of Ibis and just how concisely you can write queries.

BigQuery and Ibis comprise an extremely powerful combination. The Ibis team closely collaborates with the folks working on the BigQuery Python interface and we aim to keep it that way!

Check out the documentation here, and the community tutorial to get a feel for what it’s like to use Ibis.