Introduction to BigLake tables
This document provides an overview of BigLake and assumes familiarity with database tables and Identity and Access Management (IAM). To query data stored in the supported data stores, you must first create BigLake tables and then query them using GoogleSQL syntax:
- Create Cloud Storage BigLake tables and then query.
- Create Amazon S3 BigLake tables and then query.
- Create Azure Blob Storage BigLake tables and then query.
You can also upgrade an external table to BigLake. For more information, see Upgrade an external table to BigLake.
BigLake tables let you query structured data in external data stores with access delegation. Access delegation decouples access to the BigLake table from access to the underlying data store. An external connection associated with a service account is used to connect to the data store. Because the service account handles retrieving data from the data store, you only have to grant users access to the BigLake table. This lets you enforce fine-grained security at the table level, including row-level and column-level security. For BigLake tables based on Cloud Storage, you can also use dynamic data masking. To learn more about multi-cloud analytic solutions using BigLake tables with Amazon S3 or Blob Storage data, see BigQuery Omni.
Supported data stores
You can use BigLake tables with the following data stores:
- Amazon S3 by using BigQuery Omni
- Blob Storage by using BigQuery Omni
- Cloud Storage
Temporary table support
BigLake tables based on Cloud Storage can be temporary or permanent. BigLake tables based on Amazon S3 or Blob Storage must be permanent.
Multiple source files
You can create a BigLake table based on multiple external data sources, provided those data sources have the same schema.
Joins
BigLake tables based on Cloud Storage can be joined to other BigQuery tables, subject to location considerations. BigLake tables based on Amazon S3 or Blob Storage can only be joined with other BigLake tables in the same location.
Connectors
You can access data in BigLake tables based on Cloud Storage from other data processing tools by using BigQuery connectors. For example, you could access data in BigLake tables from Apache Spark, Apache Hive, TensorFlow, Trino, or Presto. The BigQuery Storage API enforces row- and column-level governance policies on all data access to BigLake tables, including through connectors.
For example, the following diagram demonstrates how the BigQuery Storage API lets users access authorized data using open source query engines such as Apache Spark:
BigLake tables on object stores
For data lake administrators, BigLake lets you set access controls on tables rather than files, which gives you finer-grained options when setting user access to data in the data lake.
Because BigLake tables simplifies access control in this way, we recommend using BigLake tables to build and maintain connections to external object stores.
You can use external tables in cases where governance is not a requirement, or for ad hoc data discovery and manipulation.
Limitations
- All limitations for external tables apply to BigLake tables.
- BigLake tables on object stores are subject to the same limitations as BigQuery tables. For more information, see Quotas.
BigLake does not support downscoped credentials from Dataproc Personal Cluster Authentication. As a workaround, to use clusters with Personal Cluster Authentication, you must inject your credentials using an empty Credential Access Boundary with the
--access-boundary=<(echo -n "{}")
flag. For example, the following command enables a credential propagation session in a project namedmyproject
for the cluster namedmycluster
:gcloud dataproc clusters enable-personal-auth-session \ --region=us \ --project=myproject \ --access-boundary=<(echo -n "{}") \ mycluster
BigLake tables are read-only. You cannot modify BigLake tables using DML statements or other methods.
BigLake tables support the following formats:
- Avro
- CSV
- Iceberg
- JSON
- ORC
- Parquet
You can only use cached metadata with BigLake tables that use Parquet, JSON, and CSV formats.
The BigQuery Storage API is not available in other cloud environments, such as AWS and Azure.
Security model
The following organizational roles are typically involved in managing and using BigLake tables:
- Data lake administrators. These administrators typically manage Identity and Access Management (IAM) policies on Cloud Storage buckets and objects.
- Data warehouse administrators. These administrators typically create, delete, and update tables.
- Data analysts. Analysts typically read data and run queries.
Data lake administrators are responsible for creating connections and sharing them with data warehouse administrators. In turn, data warehouse administrators create tables, set appropriate access controls, and share the tables with data analysts.
Metadata caching for performance
You can use cached metadata to improve query performance on BigLake tables. It is especially helpful in cases where you are working with large numbers of files or if the data is hive partitioned.
This feature is in preview. For feedback or questions on this feature, email biglake-help@google.com.
BigLake and object tables support caching metadata about files from Cloud Storage. The metadata includes file names, partitioning information, and physical metadata from files such as row counts. You can choose whether or not to enable metadata caching on a table. Queries with a large number of files and with Hive partition filters benefit the most from metadata caching.
If you don't enable metadata caching, queries on the table must read Cloud Storage to get object metadata which increases the query latency; listing millions of files from Cloud Storage can take several minutes. If you enable metadata caching, queries can avoid listing files from Cloud Storage and achieve faster partition and file pruning.
There are two properties that control this feature:
- Maximum staleness, which controls when queries use cached metadata.
- Metadata cache mode, which controls how the metadata is collected.
When you have metadata caching enabled, you specify the maximum interval of metadata staleness that is acceptable for operations against the table. For example, if you specify an interval of 1 hour, then operations against the table use cached metadata if it has been refreshed within the past hour. If the cached metadata is older than that, the operation falls back to retrieving metadata from Cloud Storage instead. You can specify a staleness interval between 30 minutes and 7 days.
You can choose to refresh the cache either automatically or manually:
- For automatic refreshes, the cache is refreshed at a system defined interval, usually somewhere between 30 and 60 minutes. Refreshing the cache automatically is a good approach if the files in Cloud Storage are added, deleted, or modified at random intervals. If you need to control the timing of the refresh, for example to trigger the refresh at the end of an extract-transform-load job, use manual refresh.
For manual refreshes, you run the
BQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure to refresh the metadata cache on whatever schedule you determine. Refreshing the cache manually is a good approach if the files in Cloud Storage are added, deleted, or modified at known intervals, for example as the output of a pipeline.If you issue multiple concurrent manual refreshes, only one will succeed.
The metadata cache expires after 7 days if it isn't refreshed.
You should consider how the staleness interval and metadata caching mode values will interact before you set them. Consider the following examples:
- If you are manually refreshing the metadata cache for a table, and you set
the staleness interval to 2 days, you must run the
BQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure every 2 days or less if you want operations against the table to use cached metadata. - If you are automatically refreshing the metadata cache for a table, and you set the staleness interval to 30 minutes, it is possible that some of your operations against the table will read from Cloud Storage if the metadata cache refresh takes on the longer side of the usual 30 to 60 minute window.
To find information about metadata refresh jobs, query the
INFORMATION_SCHEMA.JOBS
view,
as shown in the following example:
SELECT * FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT` WHERE job_id LIKE '%metadata_cache_refresh%' AND creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 6 HOUR) ORDER BY start_time DESC LIMIT 10;
Materialized views over BigLake metadata cache-enabled tables can be used to improve performance and efficiency when querying structured data stored in Cloud Storage. These materialized views function like materialized views over BigQuery-managed storage tables, including the benefits of automatic refresh and smart tuning.
For more information on setting caching options, see Create a BigLake table or Update a BigLake table.
Integrations
BigLake tables are accessible from a number of other BigQuery features and gcloud CLI services, including the following, highlighted services.
Analytics Hub
BigLake tables are compatible with Analytics Hub. Datasets containing BigLake tables can be published as Analytics Hub listings. Analytics Hub subscribers can subscribe to these listings, which provision a read-only dataset, called a linked dataset, in their project. Subscribers can query all tables in the linked dataset, including all BigLake tables. For more information, see Subscribe to a listing.
BigQuery ML
You can use BigQuery ML to train and run models on BigLake in Cloud Storage.
Cloud DLP
Cloud Data Loss Prevention (DLP) scans your BigLake tables to identify and classify sensitive data. If sensitive data is detected, Cloud DLP deidentification transformations can mask, delete, or otherwise obscure that data.
Costs
Costs are associated with the following aspects of BigLake tables:
- Querying the tables.
The following table shows how your pricing model affects how these costs are applied:
On-demand pricing |
Flat-rate pricing and Standard, Enterprise, and Enterprise Plus editions |
|
---|---|---|
Queries |
You are billed for the bytes processed by user queries. |
Slots in reservation assignments with a QUERY job type are consumed during query time. |
Manually refreshing the metadata cache. |
You are billed for the bytes processed to refresh the cache. |
Slots in reservation assignments with a QUERY job type are consumed during cache refresh. |
Automatically refreshing the metadata cache. |
You are billed for the bytes processed to refresh the cache. |
Slots in reservation assignments with a BACKGROUND job type are consumed during cache refresh.If there are no BACKGROUND reservations available for refreshing
the metadata cache, BigQuery automatically uses slots in
QUERY reservations instead if you are using
flat-rate pricing, or Enterprise or Enterprise Plus edition. |
What's next
- Learn how to upgrade external tables to BigLake tables.
- Learn how to create a Cloud Storage BigLake table.
- Learn how to create an Amazon S3 BigLake table.
- Learn how to create a Blob Storage BigLake table.
- Learn how to create data quality checks with Dataplex.