This page provides an overview of querying data stored outside of BigQuery.
An external data source (also known as a federated data source) is a data source that you can query directly even though the data is not stored in BigQuery. Instead of loading or streaming the data, you create a table that references the external data source.
BigQuery offers support for querying data directly from:
Supported formats are:
- JSON (newline delimited only)
Use cases for external data sources include:
- Loading and cleaning your data in one pass by querying the data from an external data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
- Having a small amount of frequently changing data that you join with other tables. As an external data source, the frequently changing data does not need to be reloaded every time it is updated.
External data source limitations
External data source limitations include the following:
- BigQuery does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior.
- Query performance for external data sources may not be as high as querying data in a native BigQuery table. If query speed is a priority, load the data into BigQuery instead of setting up an external data source. The performance of a query that includes an external data source depends on the external storage type. For example, querying data stored in Cloud Storage is faster than querying data stored in Google Drive. In general, query performance for external data sources should be equivalent to reading the data directly from the external storage.
You cannot use the
TableDataListJSON API method to retrieve data from tables that reside in an external data source. For more information, see
To work around this limitation, you can save query results in a destination table. You can then use the
TableDataListmethod on the results table.
You cannot run a BigQuery job that exports data from an external data source.
To work around this limitation, you can save query results in a destination table. You can then run an export job against the results table.
You cannot reference an external data source in a wildcard table query.
External data sources support table partitioning or clustering in limited ways. For details, see Querying externally partitioned data.
When you query an external data source, the results are not cached. You are charged for each query against an external table even if you issue the same query multiple times. If you need to repeatedly issue a query against an external table that does not change frequently, consider writing the query results to a permanent table and run the queries against the permanent table instead.
- Querying data in Cloud Bigtable is currently only available in the following regions and zones:
Region Zone(s) us-central1
You are limited to 4 concurrent queries against a Cloud Bigtable external data source.
When you choose a location for your data, consider the following:
- Colocate your BigQuery dataset and your external data source.
- When you query data in an external data source such as Cloud Storage, the data you're querying must be in the same location as your BigQuery dataset. For example, if your BigQuery dataset is in the EU multi-regional location, the Cloud Storage bucket containing the data you're querying must be in a multi-regional bucket in the EU. If your dataset is in the US multi-regional location, your Cloud Storage bucket must be in a multi-regional bucket in the US.
- If your dataset is in a regional location, the Cloud Storage bucket containing the data you're querying must be in a regional bucket in the same location. For example, if your dataset is in the Tokyo region, your Cloud Storage bucket must be a regional bucket in Tokyo.
- If your external dataset is in Cloud Bigtable, your dataset must be in the US or the EU multi-regional location. Your Cloud Bigtable data must be in one of the supported Cloud Bigtable locations.
- Location considerations do not apply to Google Drive external data sources.
- Develop a data management plan.
- If you choose a regional storage resource such as a BigQuery dataset or a Cloud Storage bucket, develop a plan for geographically managing your data.
For more information on Cloud Storage locations, see Bucket Locations in the Cloud Storage documentation.
To manually move a dataset from one location to another, follow this process:
Export the data from your BigQuery tables to a regional or multi-region Cloud Storage bucket in the same location as your dataset. For example, if your dataset is in the EU multi-region location, export your data into a regional or multi-region bucket in the EU.
Copy or move the data from your Cloud Storage bucket to a regional or multi-region bucket in the new location. For example, if you are moving your data from the US multi-region location to the Tokyo regional location, you would transfer the data to a regional bucket in Tokyo. For information on transferring Cloud Storage objects, see Renaming, copying, and moving objects in the Cloud Storage documentation.
Note that transferring data between regions incurs network egress charges in Cloud Storage.
After you transfer the data to a Cloud Storage bucket in the new location, create a new BigQuery dataset (in the new location). Then, load your data from the Cloud Storage bucket into BigQuery.
You are not charged for loading the data into BigQuery, but you will incur charges for storing the data in Cloud Storage until you delete the data or the bucket. You are also charged for storing the data in BigQuery after it is loaded. Loading data into BigQuery is subject to the limits on load jobs.
You can also use Cloud Composer to move and copy large datasets programmatically.
For more information on using Cloud Storage to store and move large datasets, see Using Cloud Storage with big data.
External data source pricing
When querying an external data source from BigQuery, you are charged for the number of bytes read by the query. For more information, see Query pricing and Querying data stored on Cloud Storage pricing.
If your data is stored in ORC or Parquet on Cloud Storage, see Querying columnar formats on Cloud Storage.
You are also charged for storing the data and any resources used by the source application, subject to the application's pricing guidelines:
- For information on Cloud Storage pricing, see Cloud Storage Pricing.
- For information on Cloud Bigtable pricing, see Pricing.
- For information on Google Drive pricing, see Pricing.