Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Improve BigQuery ingestion times 10x by using Avro source format

Tuesday, March 15, 2016

In the 1.5.0 release of the Dataflow SDK, you'll find that the speed at which Google BigQuery data is available for processing has significantly increased. In fact, our internal benchmarks show that these segments of a pipeline execute as much as ten times faster. The contrast is pretty stark for an export-only pipeline, as the following diagram illustrates:

Many users have asked how they can speed up the sections of their pipelines that ingest data from BigQuery, and we're happy to be able to provide a method that does exactly that. The ingestion speed has, to this point, been dependent upon the file format that we export from BigQuery. In prior releases of the SDK, tables and queries were made available to Dataflow as JSON-encoded objects in Google Cloud Storage. Considering that every such entry has the same schema, this representation is extremely redundant, essentially duplicating the schema, in string form, for every record.

In the 1.5.0 release, Dataflow uses the Avro file format to binary-encode and decode BigQuery data according to a single shared schema. This reduces the size of each individual record to correspond to the actual field values. This efficiency is one reason we're so fond of Protocol Buffers (one of Apache Avro's many cousins) here at Google.

Naturally, there's a similar performance gain to be had for BigQuery sinks, and the Dataflow and BigQuery teams are excited to release support for Avro encoding in the future.

Posted by Sam McVeety, Technical Lead, Google Cloud Platform

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.


Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.