Create AWS Glue federated datasets
This document describes how to create a federated dataset in BigQuery that's linked to an existing database in AWS Glue.
A federated dataset is a connection between BigQuery and an external data source at the dataset level. The tables in a federated dataset are automatically populated from the tables in the corresponding external data source. You can query these tables directly in BigQuery, but you cannot make modifications, additions, or deletions. However, any updates that you make in the external data source are automatically reflected in BigQuery.
Before you begin
Ensure that you have a connection to access AWS Glue data.
To create or modify a connection, follow the instructions in Connect to Amazon S3. When you create that connection, include the following policy statement for AWS Glue in your AWS Identity and Access Management policy for BigQuery. Include this statement in addition to the other permissions on the Amazon S3 bucket where the data in your AWS Glue tables is stored.
{ "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:GetTable", "glue:GetTables", "glue:GetPartitions" ], "Resource": [ "arn:aws:glue:REGION:ACCOUNT_ID:catalog", "arn:aws:glue:REGION:ACCOUNT_ID:database/DATABASE_NAME", "arn:aws:glue:REGION:ACCOUNT_ID:table/DATABASE_NAME/*" ] }
Replace the following:
REGION
: the AWS region—for exampleus-east-1
ACCOUNT_ID:
: the 12-digit AWS Account IDDATABASE_NAME
: the AWS Glue database name
Required permissions
To get the permissions that you need to create a federated dataset,
ask your administrator to grant you the
BigQuery Admin (roles/bigquery.admin
) IAM role.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the permissions required to create a federated dataset. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to create a federated dataset:
-
bigquery.datasets.create
-
bigquery.connections.use
-
bigquery.connections.delegate
You might also be able to get these permissions with custom roles or other predefined roles.
For more information about IAM roles and permissions in BigQuery, see Introduction to IAM.
Create a federated dataset
To create a federated dataset, do the following:
Console
Open the BigQuery page in the Google Cloud console.
In the Explorer panel, select the project where you want to create the dataset.
Expand the
Actions option and click Create dataset.On the Create dataset page, do the following:
- For Dataset ID, enter a unique dataset name.
- For Location type, choose an AWS location for the dataset, such as
aws-us-east-1
. After you create a dataset, the location can't be changed. For External Dataset, do the following:
- Check the box next to Link to an external dataset.
- For External dataset type, select
AWS Glue
. - For External source, enter
aws-glue://
followed by the Amazon Resource Name (ARN) of the AWS Glue database—for example,aws-glue://arn:aws:glue:us-east-1:123456789:database/test_database
. - For Connection ID, select your AWS connection.
Leave the other default settings as they are.
Click Create dataset.
SQL
Use the
CREATE EXTERNAL SCHEMA
data definition language (DDL) statement.
In the Google Cloud console, go to the BigQuery page.
In the query editor, enter the following statement:
CREATE EXTERNAL SCHEMA DATASET_NAME WITH CONNECTION PROJECT_ID.CONNECTION_LOCATION.CONNECTION_NAME OPTIONS ( external_source = 'AWS_GLUE_SOURCE', location = 'LOCATION');
Replace the following:
DATASET_NAME
: the name of your new dataset in BigQuery.PROJECT_ID
: your project ID.CONNECTION_LOCATION
: the location of your AWS connection—for example,aws-us-east-1
.CONNECTION_NAME
: the name of your AWS connection.AWS_GLUE_SOURCE
: the Amazon Resource Name (ARN) of the AWS Glue database with a prefix identifying the source—for example,aws-glue://arn:aws:glue:us-east-1:123456789:database/test_database
.LOCATION
: the location of your new dataset in BigQuery—for example,aws-us-east-1
. After you create a dataset, you can't change its location.
Click
Run.
For more information about how to run queries, see Run an interactive query.
bq
In a command-line environment, create a dataset by using the
bq mk
command:
bq --location=LOCATION mk --dataset \ --external_source aws-glue://AWS_GLUE_SOURCE \ --connection_id PROJECT_ID.CONNECTION_LOCATION.CONNECTION_NAME \ DATASET_NAME
Replace the following:
LOCATION
: the location of your new dataset in BigQuery—for example,aws-us-east-1
. After you create a dataset, you can't change its location. You can set a default location value by using the.bigqueryrc
file.AWS_GLUE_SOURCE
: the Amazon Resource Name (ARN) of the AWS Glue database—for example,arn:aws:glue:us-east-1:123456789:database/test_database
.PROJECT_ID
: your BigQuery project ID.CONNECTION_LOCATION
: the location of your AWS connection—for example,aws-us-east-1
.CONNECTION_NAME
: the name of your AWS connection.DATASET_NAME
: the name of your new dataset in BigQuery. To create a dataset in a project other than your default project, add the project ID to the dataset name in the following format:PROJECT_ID
:DATASET_NAME
.
Terraform
Use the
google_bigquery_dataset
resource.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
The following example creates an AWS Glue federated dataset:
resource "google_bigquery_dataset" "dataset" { provider = google-beta dataset_id = "example_dataset" friendly_name = "test" description = "This is a test description." location = "aws-us-east-1" external_dataset_reference { external_source = "aws-glue://arn:aws:glue:us-east-1:999999999999:database/database" connection = "projects/project/locations/aws-us-east-1/connections/connection" } }
To apply your Terraform configuration in a Google Cloud project, complete the steps in the following sections.
Prepare Cloud Shell
- Launch Cloud Shell.
-
Set the default Google Cloud project where you want to apply your Terraform configurations.
You only need to run this command once per project, and you can run it in any directory.
export GOOGLE_CLOUD_PROJECT=PROJECT_ID
Environment variables are overridden if you set explicit values in the Terraform configuration file.
Prepare the directory
Each Terraform configuration file must have its own directory (also called a root module).
-
In Cloud Shell, create a directory and a new
file within that directory. The filename must have the
.tf
extension—for examplemain.tf
. In this tutorial, the file is referred to asmain.tf
.mkdir DIRECTORY && cd DIRECTORY && touch main.tf
-
If you are following a tutorial, you can copy the sample code in each section or step.
Copy the sample code into the newly created
main.tf
.Optionally, copy the code from GitHub. This is recommended when the Terraform snippet is part of an end-to-end solution.
- Review and modify the sample parameters to apply to your environment.
- Save your changes.
-
Initialize Terraform. You only need to do this once per directory.
terraform init
Optionally, to use the latest Google provider version, include the
-upgrade
option:terraform init -upgrade
Apply the changes
-
Review the configuration and verify that the resources that Terraform is going to create or
update match your expectations:
terraform plan
Make corrections to the configuration as necessary.
-
Apply the Terraform configuration by running the following command and entering
yes
at the prompt:terraform apply
Wait until Terraform displays the "Apply complete!" message.
- Open your Google Cloud project to view the results. In the Google Cloud console, navigate to your resources in the UI to make sure that Terraform has created or updated them.
API
Call the
datasets.insert
method
with a defined dataset resource
and externalDatasetReference
field
for your AWS Glue database.
List tables in a federated dataset
To list the tables that are available for query in your federated dataset, see Listing datasets.
Get table information
To get information on the tables in your federated dataset, such as schema details, see Get table information.
Control access to tables
To manage access to the tables in your federated dataset, see Control access to resources with IAM.
Row-level security, column-level security, and data masking are also supported for tables in federated datasets.
Schema operations that might invalidate security policies, such as deleting a column in AWS Glue, can cause jobs to fail until the policies are updated. Additionally, if you delete a table in AWS Glue and recreate it, your security policies no longer apply to the recreated table.
Query AWS Glue data
Querying tables in federated datasets is the same as querying tables in any other BigQuery dataset.
You can query AWS Glue tables in the following formats:
- CSV (compressed and uncompressed)
- JSON (compressed and uncompressed)
- Parquet
- ORC
- Avro
- Iceberg
Table mapping details
Every table that you grant access to in your AWS Glue database appears as an equivalent table in your BigQuery dataset.
Format
The format of each BigQuery table is determined by the following fields of the respective AWS Glue table:
InputFormat
(Table.StorageDescriptor.InputFormat
)OutputFormat
(Table.StorageDescriptor.OutputFormat
)SerializationLib
(Table.StorageDescriptor.SerdeInfo.SerializationLibrary
)
The only exception is Iceberg tables, which use the TableType
(Table.Parameters["table_type"]
) field.
For example, an AWS Glue table with the following fields is mapped to an ORC table in BigQuery:
InputFormat
="org.apache.hadoop.hive.ql.io.orc.OrcInputFormat"
OutputFormat
="org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"
SerializationLib
="org.apache.hadoop.hive.ql.io.orc.OrcSerde"
Location
The location of each BigQuery table is determined by the following:
- Iceberg tables: the
Table.Parameters["metadata_location"]
field in the AWS Glue table - Non-Iceberg unpartitioned tables: the
Table.StorageDescriptor.Location
field in the AWS Glue table - Non-Iceberg partitioned tables: the AWS Glue GetPartitions API
Other properties
Additionally, some AWS Glue table properties are automatically mapped to format-specific options in BigQuery:
Format | SerializationLib | AWS Glue table value | BigQuery option |
---|---|---|---|
CSV | LazySimpleSerDe | Table.StorageDescriptor.SerdeInfo.Parameters["field.delim"] | CsvOptions.fieldDelimiter |
CSV | LazySimpleSerDe | Table.StorageDescriptor.Parameters["serialization.encoding"] | CsvOptions.encoding |
CSV | LazySimpleSerDe | Table.StorageDescriptor.Parameters["skip.header.line.count"] | CsvOptions.skipLeadingRows |
CSV | OpenCsvSerDe | Table.StorageDescriptor.SerdeInfo.Parameters["separatorChar"] | CsvOptions.fieldDelimiter |
CSV | OpenCsvSerDe | Table.StorageDescriptor.SerdeInfo.Parameters["quoteChar"] | CsvOptions.quote |
CSV | OpenCsvSerDe | Table.StorageDescriptor.Parameters["serialization.encoding"] | CsvOptions.encoding |
CSV | OpenCsvSerDe | Table.StorageDescriptor.Parameters["skip.header.line.count"] | CsvOptions.skipLeadingRows |
JSON | Hive JsonSerDe | Table.StorageDescriptor.Parameters["serialization.encoding"] | JsonOptions.encoding |
Create a view in a federated dataset
You can't create a view in a federated dataset. However, you can create a view in a standard dataset that's based on a table in a federated dataset. For more information, see Create views.
Delete a federated dataset
Deleting a federated dataset is the same as deleting any other BigQuery dataset. For more information, see Delete datasets.
Pricing
For information about pricing, see BigQuery Omni pricing.
Limitations
- All BigQuery Omni limitations apply.
- You can't add, delete, or update data or metadata in tables in an AWS Glue federated dataset.
- You can't create new tables, views, or materialized views in an AWS Glue federated dataset.
INFORMATION_SCHEMA
views aren't supported.- Metadata caching isn't supported.
- Dataset-level settings that are related to table creation defaults don't affect federated datasets because you can't create tables manually.
- The Apache Hive data type
UNION
isn't supported for Avro tables.
What's next
- Learn more about BigQuery Omni.
- Try the BigQuery Omni with AWS lab.