Google Cloud

Improve BigQuery ingestion times 10x by using Avro source format

March 15, 2016

Sam McVeety

Technical Lead, Google Cloud Platform

In the 1.5.0 release of the Dataflow SDK, you'll find that the speed at which Google BigQuery data is available for processing has significantly increased. In fact, our internal benchmarks show that these segments of a pipeline execute as much as ten times faster. The contrast is pretty stark for an export-only pipeline, as the following diagram illustrates:

https://storage.googleapis.com/gweb-cloudblog-publish/images/avromxuw.max-700x700.PNG

Many users have asked how they can speed up the sections of their pipelines that ingest data from BigQuery, and we're happy to be able to provide a method that does exactly that. The ingestion speed has, to this point, been dependent upon the file format that we export from BigQuery. In prior releases of the SDK, tables and queries were made available to Dataflow as JSON-encoded objects in Google Cloud Storage. Considering that every such entry has the same schema, this representation is extremely redundant, essentially duplicating the schema, in string form, for every record.

In the 1.5.0 release, Dataflow uses the Avro file format to binary-encode and decode BigQuery data according to a single shared schema. This reduces the size of each individual record to correspond to the actual field values. This efficiency is one reason we're so fond of Protocol Buffers (one of Apache Avro's many cousins) here at Google.

Naturally, there's a similar performance gain to be had for BigQuery sinks, and the Dataflow and BigQuery teams are excited to release support for Avro encoding in the future.

Posted in