Ibis and BigQuery: scalable analytics with the comfort of Python
Hedge funds have long sought to build increasingly robust and accurate quantitative models. Today, live or up-to-date third-party datasets provide a wealth of information that needs to be correctly analyzed at a scale that has traditionally required custom solutions and many person-hours of work. On top of that, sharing these analyses is critical for avoiding duplicate work.
This blog post demonstrates how Ibis works with BigQuery to provide an elegant and flexible Python interface for composing SQL queries.
As an example, consider the following SQL query that computes the acquisition rate of a number of different companies. Here the acquisition rate is defined as the number of acquisitions (or IPOs) divided by the number of investments a particular venture capital (VC) firm has made. It’s a complex query that proves to be as difficult to debug as it is to compose:
acq_ipos / num_investments AS acq_rate
SELECT CASE WHEN i.investor_name IS NULL THEN ‘NO INVESTOR’
END AS “Investor name”,
COUNT(DISTINCT c.permalink) AS num_investments,
CASE WHEN c.status IN (‘ipo’, ‘acquired’) THEN c.permalink
END) AS acq_ipos
LEFT JOIN crunchbase_investments
ON c.permalink = i.company_permalink
GROUP BY 1
ORDER BY 2 DESC
There’s a bug in this query, and it’s pretty difficult to spot.
Spoiler alert: the tables in the “
FROM clause are missing aliases that are referenced elsewhere in the query.
The error in this query is difficult to spot because it’s a single character in a sea of text. SQL has evolved through years of engineering to take on more capabilities with more syntax and while it has grown more powerful, it has also grown in complexity.
However, SQL isn’t going anywhere anytime soon. There are some extremely powerful analytic SQL engines available including Apache Hive, Cloudera Impala, Amazon’s Redshift, and of course: Google BigQuery. The common thread among all of these systems is SQL.
One of the downsides of using SQL is that it’s a closed system: you can’t easily use another language to solve the same problem at the same scale in the same amount of time. This leads to code that passes opaque SQL strings around, often to the chagrin of the folks reading and writing it. Imagine the query above but with template parameters and additional string processing. It quickly becomes hard to manage this kind of code, let alone turn it into something that can be trusted and widely used across a large company.
Over in the Python world, libraries such as pandas provide convenient and performant APIs for wrangling data. From joins to time series analysis, pandas makes it convenient to do complex data wrangling and analysis from start to finish. One downside of pandas is that it’s challenging to use as you encounter the need to distribute your computations.
How Ibis helps you scale
Ibis is a Python analytics library designed to provide the convenience of pandas’ APIs with the scalability of analytic SQL engines like BigQuery. It does this in a type-safe way, letting you build analytics expressions that compile to SQL and run on your favorite large-scale SQL engine. When you execute Ibis expressions, they turn into pandas DataFrames, which gives you access to the ecosystem of Python data libraries once you don’t need the scale that distributed SQL provides.
With Ibis, you no longer need to think about certain classes of bugs like the aliasing problem we saw earlier. Invalid operations (like taking the sum of a string column) are simply not available in the API.
Another goal of Ibis is that the expressions it creates should be composable. In this scenario, composable means that expression methods return other expressions. The following Ibis code chains together expressions to compute the percentage of GitHub repositories created between the hours of midnight and 4 AM, relative to the time zone reported in the
created_at field of the table:
con = ibis.bigquery.connect(billing_project, 'bigquery-public-data.samples')
t = con.table('github_nested')
expr = (
# get the hour in UTC during which a repo was created
hour=lambda t: t.created_at
.to_timestamp('%Y/%m/%d %T %z')
# compute the UTC offset to adjust in the next expression
utc_offset=lambda t: t.created_at.split(' ').cast('int64') // 100)
# group by the adjusted hour, count and sort by descending count
.groupby(hour=lambda t: t.hour + t.utc_offset)
# sum up the number of repos that were created between midnight and 4 AM
# local time
total=lambda t: t['count'].sum(),
night_owl=lambda t: t['count'].sum(where=t.hour.between(0, 4)))
# compute the percentage of repos created between midnight and 4 AM
.mutate(night_owl_perc=lambda t: t.night_owl / t.total)
df = expr.execute()
df—the result of executing the expression—gives the following result:
night_owl total night_owl_perc
0 422315 2541639 0.166159
This example shows the composability of Ibis and just how concisely you can write queries.
BigQuery and Ibis comprise an extremely powerful combination. The Ibis team closely collaborates with the folks working on the BigQuery Python interface and we aim to keep it that way!