Google Cloud Platform

Accessing external (federated) data sources with BigQuery’s data access layer

Although BigQuery is a fully-managed query service for massive datasets, you might not realize you can use the data access layer in BigQuery without moving your data to a BigQuery dataset.

BigQuery has a federated data access model that allows for querying data directly from Bigtable, Cloud Storage, and Google Drive with permanent and temporary tables. If you have data in multiple Google Cloud Platform services and are building a data lake or data warehouse strategy, you might want to learn more about this capability.

While the storage costs in BigQuery are comparable to regional per GB charges in Cloud Storage, there may be cases where you want to consider querying an external data source instead of importing and storing data in BigQuery. External Data Sources can be useful for avoiding duplicate copies of data, single or ad-hoc queries for data that is not intended to be loaded or streamed, or reading data in one pass from a source.

External data sources may fit your data lake strategy

There are cases in data lake architectures where external federated data sources may make sense. Traditionally we find that data lakes are comprised of raw, un/semi-structured, and schema on read.

Corner cases for external data sources:

  • Avoiding duplicate data in BigQuery storage
  • Queries that do not have strong performance requirements
  • Small amount of frequently changing data to join with other tables in BigQuery
  • Irregular, non scheduled, non shared queries

From BigQuery to Google Cloud Storage

There are two ways to use BigQuery to query files in a Cloud Storage bucket in the same project; by querying a temporary or permanent table.

  • Permanent table—You create a table in a BigQuery dataset that is linked to your external data source. This allows you to use BigQuery dataset-level IAM roles to share the table with others who may have access to the underlying external data source. Use permanent tables when you need to share the table with others.
  • Temporary table—You submit a command that includes a query and creates a non-permanent table linked to the external data source. With this approach you do not create a table in one of your BigQuery datasets, so make sure to give consideration towards sharing the query or table. Consider using a temporary table for one-time, ad-hoc queries, or for one time extract, transform, or load (ETL) workflows
Today we support querying AVRO, CSV, or JSON, row oriented formats from BigQuery to Cloud Storage. Remember, you will not receive the benefits of columnar storage by choosing this approach.

Query a temporary table linked to an external data source with a wildcard Cloud Storage URI

Language: lang-sh

  $ BigQuery query --external_table_definition=healthwatch::date:DATETIME,bpm:INTEGER,sleep:STRING,type:STRING@CSV=gs://healthwatch2/healthwatchdetail*.csv 'SELECT date,bpm,type FROM healthwatch WHERE type = "elevated" and bpm > 150;'Waiting on BigQueryjob_r5770d3fba8d81732_00000162ad25a6b8_1 ... (0s) Current status: DONE   +---------------------+-----+----------+|        date    | bpm | type   |+---------------------+-----+----------+| 2018-02-07T11:14:44 | 186 | elevated || 2018-02-07T11:14:49 | 184 | elevated |+---------------------+-----+----------+

As you can see in the above screenshot, we’re using the BigQuery command line to query multiple CSVs in a Cloud Storage bucket. We give the table a name, define the schema, identify the external data source, and compose the query in the above snippet.

Why choose to store data in BigQuery tables over an external (federated) data source?

  1. Storage affinity, compute affinity, and speed. We do have regional requirements for federated data, but reading cross-continent is still much slower. If you have a massive dataset and need a fast response, store your data in BigQuery storage.
  2. Transactional guarantees. If the data is changing a lot in the external source, there is no guarantee that BigQuery wont read partially correct data. If the data is deleted, queries will fail.
  3. You need DML (data manipulation language) support. So if you need to do updates, inserts, deletes, external queries are not for you. Federated queries are strictly read only.
  4. You need DDL (data definition language) support. You cannot create a table or modify schema with external data sources since you are essentially only linking a table to the external source.
In conclusion, you can use BigQuery External Data Sources to take advantage of Google Cloud’s computational infrastructure to query row-oriented data in Cloud Storage, Bigtable, and Drive. Consider external data sources when building your data lake strategy with BigQuery.