Create AWS Glue federated datasets

This document describes how to create a federated dataset in BigQuery that's linked to an existing database in AWS Glue.

A federated dataset is a connection between BigQuery and an external data source at the dataset level. The tables in a federated dataset are automatically populated from the tables in the corresponding external data source. You can query these tables directly in BigQuery, but you cannot make modifications, additions, or deletions. However, any updates that you make in the external data source are automatically reflected in BigQuery.

Before you begin

Ensure that you have a connection to access AWS Glue data.

  • To create or modify a connection, see Connect to Amazon S3, and include the following additional policy statement for AWS Glue in your AWS Identity and Access Management policy for BigQuery:

    {
     "Effect": "Allow",
     "Action": [
       "glue:GetDatabase",
       "glue:GetTable",
       "glue:GetTables",
       "glue:GetPartitions"
     ],
     "Resource": [
       "arn:aws:glue:REGION:ACCOUNT_ID:catalog",
       "arn:aws:glue:REGION:ACCOUNT_ID:database/DATABASE_NAME",
       "arn:aws:glue:REGION:ACCOUNT_ID:table/DATABASE_NAME/*"
     ]
    }
    

    Replace the following:

    • REGION: the AWS region—for example us-east-1
    • ACCOUNT_ID:: the 12-digit AWS Account ID
    • DATABASE_NAME: the AWS Glue database name

Required permissions

To get the permissions that you need to create a federated dataset, ask your administrator to grant you the BigQuery Admin (roles/bigquery.admin) IAM role. For more information about granting roles, see Manage access.

This predefined role contains the permissions required to create a federated dataset. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to create a federated dataset:

  • bigquery.datasets.create
  • bigquery.connections.use
  • bigquery.connections.delegate

You might also be able to get these permissions with custom roles or other predefined roles.

For more information about IAM roles and permissions in BigQuery, see Introduction to IAM.

Create a federated dataset

To created a federated dataset, do the following:

bq

In a command-line environment, create a dataset by using the bq mk command:

bq --location=LOCATION mk --dataset \
    --external_source aws-glue://AWS_GLUE_SOURCE \
    --connection_id PROJECT_ID.CONNECTION_LOCATION.CONNECTION_NAME \
    DATASET_NAME

Replace the following:

  • LOCATION: the location of your new dataset in BigQuery—for example, aws-us-east-1. After you create a dataset, you can't change its location. You can set a default location value by using the .bigqueryrc file.
  • AWS_GLUE_SOURCE: the Amazon Resource Name (ARN) of the AWS Glue database—for example, arn:aws:glue:us-east-1:123456789:database/test_database.
  • PROJECT_ID: your BigQuery project ID.
  • CONNECTION_LOCATION: the location of your AWS connection—for example, aws-us-east-1.
  • CONNECTION_NAME: the name of your AWS connection.
  • DATASET_NAME: the name of your new dataset in BigQuery. To create a dataset in a project other than your default project, add the project ID to the dataset name in the following format: PROJECT_ID:DATASET_NAME.

API

Call the datasets.insert method with a defined dataset resource and externalDatasetReference field for your AWS Glue database.

List tables in a federated dataset

To list the tables that are available for query in your federated dataset, see Listing datasets.

Get table information

To get information on the tables in your federated dataset, such as schema details, see Get table information.

Control access to tables

To manage access to the tables in your federated dataset, see Control access to resources with IAM.

Query AWS Glue data

Querying tables in federated datasets is the same as querying tables in any other BigQuery dataset.

You can query AWS Glue tables in the following formats:

  • CSV (compressed and uncompressed)
  • JSON (compressed and uncompressed)
  • Parquet
  • ORC
  • Avro
  • Iceberg

Table mapping details

Every table that you grant access to in your AWS Glue database appears as an equivalent table in your BigQuery dataset.

Format

The format of each BigQuery table is determined by the following fields of the respective AWS Glue table:

  • InputFormat (Table.StorageDescriptor.InputFormat)
  • OutputFormat (Table.StorageDescriptor.OutputFormat)
  • SerializationLib (Table.StorageDescriptor.SerdeInfo.SerializationLibrary)

The only exception is Iceberg tables, which use the TableType (Table.Parameters["table_type"]) field.

For example, an AWS Glue table with the following fields is mapped to an ORC table in BigQuery:

  • InputFormat = "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat"
  • OutputFormat = "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"
  • SerializationLib = "org.apache.hadoop.hive.ql.io.orc.OrcSerde"

Location

The location of each BigQuery table is determined by the following:

  • Iceberg tables: the Table.Parameters["metadata_location"] field in the AWS Glue table
  • Non-Iceberg unpartitioned tables: the Table.StorageDescriptor.Location field in the AWS Glue table
  • Non-Iceberg partitioned tables: the AWS Glue GetPartitions API

Other properties

Additionally, some AWS Glue table properties are automatically mapped to format-specific options in BigQuery:

Format SerializationLib AWS Glue table value BigQuery option
CSV LazySimpleSerDe Table.StorageDescriptor.SerdeInfo.Parameters["field.delim"] CsvOptions.fieldDelimiter
CSV LazySimpleSerDe Table.StorageDescriptor.Parameters["serialization.encoding"] CsvOptions.encoding
CSV LazySimpleSerDe Table.StorageDescriptor.Parameters["skip.header.line.count"] CsvOptions.skipLeadingRows
CSV OpenCsvSerDe Table.StorageDescriptor.SerdeInfo.Parameters["separatorChar"] CsvOptions.fieldDelimiter
CSV OpenCsvSerDe Table.StorageDescriptor.SerdeInfo.Parameters["quoteChar"] CsvOptions.quote
CSV OpenCsvSerDe Table.StorageDescriptor.Parameters["serialization.encoding"] CsvOptions.encoding
CSV OpenCsvSerDe Table.StorageDescriptor.Parameters["skip.header.line.count"] CsvOptions.skipLeadingRows
JSON Hive JsonSerDe Table.StorageDescriptor.Parameters["serialization.encoding"] JsonOptions.encoding

Create a view in a federated dataset

You can't create a view in a federated dataset. However, you can create a view in a standard dataset that's based on a table in a federated dataset. For more information, see Create views.

Delete a federated dataset

Deleting a federated dataset is the same as deleting any other BigQuery dataset. For more information, see Delete datasets.

Pricing

For information about pricing, see BigQuery Omni pricing.

Limitations

What's next