Keep Parquet and ORC from the data graveyard with new BigQuery features
Tino Tereshko
Product Manager, Google BigQuery
Parquet and ORC are popular columnar open source formats for large-scale data analytics. As you make your move to the cloud, you may want to use the power of BigQuery to analyze data stored in these formats. Choosing between keeping these files in Cloud Storage vs. loading your data into BigQuery can be a difficult decision, leading to your data platform looking more like a spooky data graveyard where data goes to disappear. However, it’s now possible to merge the worlds of the living and the undead: your old columnar-format files in Cloud Storage with BigQuery’s Standard SQL interface.
We’re pleased to announce that BigQuery has conjured up (OK, launched) beta support for querying Parquet and ORC file formats in Cloud Storage. This new feature joins other federated querying capabilities from within BigQuery, including storage systems such as Cloud Bigtable, Google Sheets, and Cloud SQL, as well as AVRO, CSV, and JSON file formats in Cloud Storage —all part of BigQuery’s commitment to building an open and accessible data warehouse.
Federated queries allow you to access real-time data from many different sources with one query, helping you do advanced analytics faster, thus bringing you the power of BigQuery analysis to your data, wherever it is. You don’t have to move any data, and you can be sure of the integrity of the data you’re querying—no evil twin copies lurking about. In addition, you can now query and load Hive partitioned tables stored in Cloud Storage from within BigQuery.
You’ll find that using these new features builds a bridge between your datasets and can help you be more flexible. Your data stays in your preferred open source formats in Cloud Storage and you can use BigQuery’s ANSI Standard SQL for analytics and data processing. This means:
BigQuery is able to take full advantage of the columnar nature of Parquet and ORC to efficiently project columns.
BigQuery’s support for understanding Hive Partitions scales to 10 levels of partitioning and millions of partition permutations.
BigQuery is able to efficiently prune partitions for Hive partitioned tables.
Using federated queries to avoid data graveyards
We were fortunate to have nearly two hundred customers participate in the alpha release of this feature, and their feedback and input was invaluable in the release and development process. In this blog post, you’ll hear about the early impact on three of those customers: Pandora, Truecaller, and Cardinal Health.
“At Pandora, we have petabytes of data spread across multiple Google Cloud storage services; accordingly, we expect BigQuery’s federated query capability to be a useful tool for integrating our diverse data assets into a unified analytics ecosystem,” says Greg Kurzhals, product manager at Pandora. “The support for Parquet and other external data source formats will give us the ability to choose the best underlying storage option for each use case, while still surfacing all our data within a centralized, BigQuery-based data lake optimized for analytics and insights.” Gaining this flexibility eliminated some difficult architectural trade-offs, helping to simplify the design process and ultimately facilitate the creation of an efficient, accessible data structure in the cloud for the music services company.
When Cardinal Health started their journey to the cloud, they chose a lift-and-shift strategy, migrating all of their Hadoop jobs to run in Cloud Dataproc. “We also wanted to leverage cloud-native options like BigQuery but without necessarily rewriting our entire ingestion pipeline,” says Ken Flannery, senior enterprise architect at Cardinal Health. “We needed a quick and cost-effective way to allow our users the flexibility of using different compute options (BigQuery or Hive) without necessarily sacrificing performance or data integrity. Adding ORC federation support to BigQuery was exactly what we needed and was timed perfectly for our migration.”
As soon as Cardinal Health started migrating jobs to Cloud Dataproc, the same datasets that users were already querying from Cloud Dataproc were now simultaneously available to them in BigQuery. “ORC federation helped us take advantage of BigQuery much sooner than otherwise possible and gave us the needed flexibility of choosing when and how much of BigQuery we would use,” says Flannery.
Software company Truecaller was using Hive/Spark to query data before it tested external table support on the columnar format—but it was slower and cost twice as much. They were working on onboarding teams to BigQuery quickly, so they decided to try external tables vs. managed tables. “We were impressed by how convenient it was: There is zero setup cost and it is incredibly simple,” says Juliana Araújo, data product manager at Truecaller. “All we had to do was set the Cloud Storage URL path to our data and make a permanent table that references the data source. Now we can have our EDWH and data lake under the same stack.
The greatest benefit of using BigQuery external tables for Truecaller is that it provides unprecedented opportunity to do ad-hoc analysis on enormous datasets that we don’t want to store in BigQuery and are too big for usual Hadoop processing.” This has saved hours of time for Truecaller. For example, in one use case, querying external tables was 30 times faster than querying Hive/Spark in the Truecaller data platform.
With the release of querying Parquet and ORC files in Cloud Storage, you can continue to use Cloud Storage as your storage system and take advantage of BigQuery’s data processing capabilities. Moreover, BigQuery’s managed storage is able to provide a higher level of automation, performance, security, and capability—something to consider as you move forward.
Loading Hive partitioned data into BigQuery
In addition to the native functionality provided by BigQuery, you may take advantage of the convenient command-line open source utility Hive External Table Loader for BigQuery. This utility aids in loading Hive partitioned data into BigQuery.
You may want to use this tool if:
Your Hive partitioned data does not have a default Hive partitioned layout encoding all partition keys and values
Your Hive partitioned data does not share a common source URI prefix for all URIs and requires metastore for partition locations
Your Hive partitioned data relies on metastore positional column matching for schema detection
Commitment to open data warehousing
BigQuery’s original columnar file format ColumnIO inspired the open source ecosystem to develop open columnar file formats, including Parquet. Today, dozens of exabytes are stored in Parquet across organizations of all shapes and sizes. This data format has come full circle: Parquet is now a first-class citizen of the BigQuery ecosystem. We’re pleased to be able to continue our commitment to open source with this integration.
“In 2012, I worked on a side project that was going to become the basis for Apache Parquet: I implemented the column-striping algorithm from ColumnIO based on the Dremel paper,” says Julien Le Dem, vice president, Apache Parquet. “At the time, Google had recently made that technology available through BigQuery. I didn’t imagine that one day they would support Parquet, integrating the work of its contributors. That’s the magic of open source!”
Learn more about staying in the land of the living with BigQuery
For more information and practical examples on how to take advantage of Parquet, ORC, and Hive partitioned data, head over to the documentation. As always, you can try BigQuery with our free perpetual tier of 1TB of data processed and 10GB of data stored per month. Keep your data well away from the land of the undead with our rich ecosystem across different file formats and storage types.