Data Analytics

Ibis and BigQuery: scalable analytics with the comfort of Python

September 19, 2018

https://storage.googleapis.com/gweb-cloudblog-publish/images/Ibis_python.max-2600x2600.png

Phillip Cloud

Software Engineer, Two Sigma

Hedge funds have long sought to build increasingly robust and accurate quantitative models. Today, live or up-to-date third-party datasets provide a wealth of information that needs to be correctly analyzed at a scale that has traditionally required custom solutions and many person-hours of work. On top of that, sharing these analyses is critical for avoiding duplicate work.

This blog post demonstrates how Ibis works with BigQuery to provide an elegant and flexible Python interface for composing SQL queries.

As an example, consider the following SQL query that computes the acquisition rate of a number of different companies. Here the acquisition rate is defined as the number of acquisitions (or IPOs) divided by the number of investments a particular venture capital (VC) firm has made. It’s a complex query that proves to be as difficult to debug as it is to compose:

lang-sql

There’s a bug in this query, and it’s pretty difficult to spot.

Spoiler alert: the tables in the “t” subquery’s FROM clause are missing aliases that are referenced elsewhere in the query.

The error in this query is difficult to spot because it’s a single character in a sea of text. SQL has evolved through years of engineering to take on more capabilities with more syntax and while it has grown more powerful, it has also grown in complexity.

However, SQL isn’t going anywhere anytime soon. There are some extremely powerful analytic SQL engines available including Apache Hive, Cloudera Impala, Amazon’s Redshift, and of course: Google BigQuery. The common thread among all of these systems is SQL.

One of the downsides of using SQL is that it’s a closed system: you can’t easily use another language to solve the same problem at the same scale in the same amount of time. This leads to code that passes opaque SQL strings around, often to the chagrin of the folks reading and writing it. Imagine the query above but with template parameters and additional string processing. It quickly becomes hard to manage this kind of code, let alone turn it into something that can be trusted and widely used across a large company.

Over in the Python world, libraries such as pandas provide convenient and performant APIs for wrangling data. From joins to time series analysis, pandas makes it convenient to do complex data wrangling and analysis from start to finish. One downside of pandas is that it’s challenging to use as you encounter the need to distribute your computations.

How Ibis helps you scale

Ibis is a Python analytics library designed to provide the convenience of pandas’ APIs with the scalability of analytic SQL engines like BigQuery. It does this in a type-safe way, letting you build analytics expressions that compile to SQL and run on your favorite large-scale SQL engine. When you execute Ibis expressions, they turn into pandas DataFrames, which gives you access to the ecosystem of Python data libraries once you don’t need the scale that distributed SQL provides.

With Ibis, you no longer need to think about certain classes of bugs like the aliasing problem we saw earlier. Invalid operations (like taking the sum of a string column) are simply not available in the API.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/image1_Ol4oWMv.gif

Passing large, templated SQL strings around is a thing of the past.

Another goal of Ibis is that the expressions it creates should be composable. In this scenario, composable means that expression methods return other expressions. The following Ibis code chains together expressions to compute the percentage of GitHub repositories created between the hours of midnight and 4 AM, relative to the time zone reported in the created_at field of the table:

lang-py

con = ibis.bigquery.connect(billing_project, 'bigquery-public-data.samples')
t = con.table('github_nested')
expr = (
    t.mutate(
         # get the hour in UTC during which a repo was created
         hour=lambda t: t.created_at
                         .to_timestamp('%Y/%m/%d %T %z')
                         .hour(),
         # compute the UTC offset to adjust in the next expression
         utc_offset=lambda t: t.created_at.split(' ')[2].cast('int64') // 100)

# group by the adjusted hour, count and sort by descending count
     .groupby(hour=lambda t: t.hour + t.utc_offset)
     .count()
     .sort_by(ibis.desc('count'))

# sum up the number of repos that were created between midnight and 4 AM
     # local time
     .aggregate(
         total=lambda t: t['count'].sum(),
         night_owl=lambda t: t['count'].sum(where=t.hour.between(0, 4)))

# compute the percentage of repos created between midnight and 4 AM
     .mutate(night_owl_perc=lambda t: t.night_owl / t.total)
)

df = expr.execute()

Looking at df—the result of executing the expression—gives the following result:

This example shows the composability of Ibis and just how concisely you can write queries.

BigQuery and Ibis comprise an extremely powerful combination. The Ibis team closely collaborates with the folks working on the BigQuery Python interface and we aim to keep it that way!

Check out the documentation here, and the community tutorial to get a feel for what it’s like to use Ibis.

Posted in

https://storage.googleapis.com/gweb-cloudblog-publish/images/whats_new_data_cloud_fWg4bKK.max-700x700_L5HMKic.png

Data Analytics

What’s new with Google Data Cloud

By The Google Cloud Data Analytics, BI, and Database teams • 10-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/cool_things_july.max-700x700.png

Customers

Cool stuff customers built, July edition: Data agents, deep financial research, hospital fall detection & more

By Google Cloud Content & Editorial • 11-minute read

Data Analytics

The Dataproc advantage: Advanced Spark features that will transform your analytics and AI

By Newton Alex • 7-minute read

Data Analytics

AI/ML-ready Apache Spark with Dataproc

By Vaibhav Sethi • 6-minute read

Ibis and BigQuery: scalable analytics with the comfort of Python

Phillip Cloud

How Ibis helps you scale

Related articles

What’s new with Google Data Cloud

Cool stuff customers built, July edition: Data agents, deep financial research, hospital fall detection & more

The Dataproc advantage: Advanced Spark features that will transform your analytics and AI

AI/ML-ready Apache Spark with Dataproc