Stay organized with collections Save and categorize content based on your preferences.

Query Amazon S3 data

This document describes how to query data stored in Amazon Simple Storage Service (Amazon S3) by creating a BigLake table. A BigLake table lets you to use access delegation to query data in Amazon S3. Access delegation decouples access to the BigLake table from access to the underlying data store.

Before you begin

Ensure that you have a connection to access Amazon S3 data.

Required role

To get the permissions that you need to create an external table, ask your administrator to grant you the BigQuery Admin (roles/bigquery.admin) IAM role on your dataset. For more information about granting roles, see Manage access.

This predefined role contains the permissions required to create an external table. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

  • bigquery.tables.create
  • bigquery.tables.getData
  • bigquery.jobs.create
  • bigquery.connections.delegate

You might also be able to get these permissions with custom roles or other predefined roles.

Create a dataset

Before you create an external table, you need to create a dataset in the aws-us-east-1 region. Select one of the following options:

Console

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, select the project where you want to create the dataset.
  3. Expand the View actions option and click Create dataset.
  4. On the Create dataset page, specify the following details:
    1. For Dataset ID enter a unique dataset name.
    2. For Data location choose aws-us-east-1 .
    3. Optional: To delete tables automatically, select the Enable table expiration checkbox and set the Default maximum table age in days. Data in Amazon S3 is not deleted when the table expires.
    4. Optional: Expand the Advanced options section and select the following options:
      1. If you want to use a customer-managed encryption key, then select the Customer-managed encryption key (CMEK) option. By default, BigQuery encrypts customer content stored at rest by using a Google-managed key.
      2. If you want to use default collation, then select the Enable default collation option.
    5. Click Create dataset.

SQL

Use the CREATE SCHEMA DDL statement:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE SCHEMA mydataset
    OPTIONS (
      location = 'aws-us-east-1');
    

  3. Click Run.

For more information about how to run queries, see Running interactive queries.

bq

In a command-line environment, create a dataset using the bq mk command:

bq --location=LOCATION mk \
    --dataset \
PROJECT_ID:DATASET_NAME

The --project_id parameter overrides the default project.

Replace the following:

  • LOCATION: the location of your dataset

    Valid value is aws-us-east-1. After you create a dataset, you can't change its location. You can set a default value for the location by using the .bigqueryrc file.

  • PROJECT_ID: your project ID

  • DATASET_NAME: the name of the dataset that you want to create

    To create a dataset in a project other than your default project, add the project ID to the dataset name in the following format: PROJECT_ID:DATASET_NAME.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.Dataset;
import com.google.cloud.bigquery.DatasetInfo;

// Sample to create a aws dataset
public class CreateDatasetAws {

  public static void main(String[] args) {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "MY_PROJECT_ID";
    String datasetName = "MY_DATASET_NAME";
    // Note: As of now location only supports aws-us-east-1
    String location = "aws-us-east-1";
    createDatasetAws(projectId, datasetName, location);
  }

  public static void createDatasetAws(String projectId, String datasetName, String location) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      DatasetInfo datasetInfo =
          DatasetInfo.newBuilder(projectId, datasetName).setLocation(location).build();

      Dataset dataset = bigquery.create(datasetInfo);
      System.out.println(
          "Aws dataset created successfully :" + dataset.getDatasetId().getDataset());
    } catch (BigQueryException e) {
      System.out.println("Aws dataset was not created. \n" + e.toString());
    }
  }
}

Create BigLake tables

Select one of the following options:

Console

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand your project, and then select a dataset.

  3. In the Dataset info section, click Create table.

  4. On the Create table page, in the Source section, do the following:

    1. For Create table from, select Amazon S3.
    2. For Select S3 path, enter a URI pointing to the Amazon S3 data in the format s3://BUCKET_NAME/PATH. Replace BUCKET_NAME with the name of the Amazon S3 bucket; the bucket's region should be the same as the dataset's region. Replace PATH with the path that you would like to write the exported file to; it can contain one wildcard *.
    3. For File format, select the data format in Amazon S3. Supported formats are AVRO, PARQUET, ORC, CSV, and JSONL (Newline delimited JSON).
  5. Optional: If the data in Amazon S3 uses hive partitioning, then do the following:

    1. Select the Source data partitioning checkbox.
    2. In the Select Source URI Prefix field, enter the URI prefix. The URI prefix is the part of the URI that appears before the partition encoding. For example, s3://s3-bucket/path/.
    3. Optional: To require a partition filter on all queries for this table, select the Require partition filter checkbox. A partition filter can reduce cost and improve performance. For more information, see Set partition filter requirements.
    4. Select a Partition inference mode. If you select Provide my own, then enter schema information for the partition keys.
  6. On the Create table page, in the Destination section, specify the following details:

    1. For Dataset, choose the appropriate dataset.
    2. In the Table field, enter the name of the table.
    3. Verify that Table type is set to External table.
    4. For Connection ID, choose the appropriate connection ID from the dropdown. For information about connections, see Connect to Amazon S3.
  7. Click Create table.

SQL

To create an external table, use the CREATE EXTERNAL TABLE statement:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE EXTERNAL TABLE DATASET_NAME.TABLE_NAME
      WITH CONNECTION `AWS_LOCATION.CONNECTION_NAME`
      OPTIONS (
        format = "DATA_FORMAT",
        uris = ["S3_URI"]);
    

    Replace the following:

    • DATASET_NAME: the name of the dataset you created
    • TABLE_NAME: the name you want to give to this table
    • AWS_LOCATION: an AWS location in Google Cloud (for example, `aws-us-east-1`)
    • CONNECTION_NAME: the name of the connection you created
    • DATA_FORMAT: any of the supported BigQuery federated formats (such as AVRO or CSV)
    • S3_URI: a URI pointing to the Amazon S3 data (for example, s3://bucket/path)

  3. Click Run.

For more information about how to run queries, see Running interactive queries.

Example:

CREATE EXTERNAL TABLE awsdataset.awstable
  WITH CONNECTION `aws-us-east-1.s3-read-connection`
  OPTIONS (format="CSV", uris=["s3://s3-bucket/path/file.csv"]);

bq

Create a table definition file:


bq mkdef  \
--source_format=DATA_FORMAT \
--connection_id=AWS_LOCATION.CONNECTION_NAME \
S3_URI > table_def

Replace the following:

  • DATA_FORMAT: any of the supported BigQuery federated formats (such as AVRO or CSV).
  • S3_URI: a URI pointing to the Amazon S3 data (for example, s3://bucket/path).
  • AWS_LOCATION: an AWS location in Google Cloud (for example, aws-us-east-1).
  • CONNECTION_NAME: the name of the connection you created.

Next, create the external table:

bq mk --external_table_definition=table_def DATASET_NAME.TABLE_NAME

Replace the following:

  • DATASET_NAME: the name of the dataset you created.
  • TABLE_NAME: the name you want to give to this table.

For example, the following command creates a new external table, awsdataset.awstable, which can query your Amazon S3 data that's stored at the path s3://s3-bucket/path/file.csv and has a read connection in the location aws-us-east-1:

bq mkdef  \
--autodetect \
--source_format=CSV \
--connection_id=aws-us-east-1.s3-read-connection \
s3://s3-bucket/path/file.csv > table_def

bq mk --external_table_definition=table_def awsdataset.awstable

Hive partitioning

If the data in Amazon S3 uses hive partitioning, then create the table definition file as follows:

bq mkdef  \
--source_format=DATA_FORMAT \
--connection_id=AWS_LOCATION.CONNECTION_NAME \
--hive_partitioning_mode=HIVE_PARTITIONING_MODE \
--hive_partitioning_source_uri_prefix=HIVE_PARTITIONING_URI_PRFEFIX \
"S3_URI" > table_def

Replace the following:

  • HIVE_PARTITIONING_MODE: The partition schema detection mode.
  • HIVE_PARTITIONING_URI_PRFEFIX: The URI prefix, which is the part of the URI that appears before the partition encoding. For example, s3://s3-bucket/path/.

API

Call the tables.insert method API method, and create an ExternalDataConfiguration in the Table resource that you pass in.

Specify the schema property or set the autodetect property to true to enable schema auto detection for supported data sources.

Specify the connectionId property to identify the connection to use for connecting to Amazon S3.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.CsvOptions;
import com.google.cloud.bigquery.ExternalTableDefinition;
import com.google.cloud.bigquery.Field;
import com.google.cloud.bigquery.Schema;
import com.google.cloud.bigquery.StandardSQLTypeName;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.TableInfo;

// Sample to create an external aws table
public class CreateExternalTableAws {

  public static void main(String[] args) {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "MY_PROJECT_ID";
    String datasetName = "MY_DATASET_NAME";
    String tableName = "MY_TABLE_NAME";
    String connectionId = "MY_CONNECTION_ID";
    String sourceUri = "s3://your-bucket-name/";
    CsvOptions options = CsvOptions.newBuilder().setSkipLeadingRows(1).build();
    Schema schema =
        Schema.of(
            Field.of("name", StandardSQLTypeName.STRING),
            Field.of("post_abbr", StandardSQLTypeName.STRING));
    ExternalTableDefinition externalTableDefinition =
        ExternalTableDefinition.newBuilder(sourceUri, options)
            .setConnectionId(connectionId)
            .setSchema(schema)
            .build();
    createExternalTableAws(projectId, datasetName, tableName, externalTableDefinition);
  }

  public static void createExternalTableAws(
      String projectId,
      String datasetName,
      String tableName,
      ExternalTableDefinition externalTableDefinition) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      TableId tableId = TableId.of(projectId, datasetName, tableName);
      TableInfo tableInfo = TableInfo.newBuilder(tableId, externalTableDefinition).build();

      bigquery.create(tableInfo);
      System.out.println("Aws external table created successfully");

      // Clean up
      bigquery.delete(TableId.of(projectId, datasetName, tableName));
    } catch (BigQueryException e) {
      System.out.println("Aws external was not created." + e.toString());
    }
  }
}

Query the table

BigQuery Omni lets you query the external table like any BigQuery table. The maximum result size for interactive queries is 20 GiB logical bytes (preview). For more information, see BigQuery Omni limitations. If your query result is larger than 20 GiB, we recommend that you export it to Amazon S3. The cached query results are stored in a BigQuery temporary table.

To run a query, use SQL or Java.

SQL

To query the table:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    SELECT * FROM DATASET_NAME.TABLE_NAME;
    

    Replace the following:

    • DATASET_NAME: the dataset name that you created
    • TABLE_NAME: the name of the table that you created

    • Click Run.

For more information about how to run queries, see Running interactive queries.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.CsvOptions;
import com.google.cloud.bigquery.DatasetId;
import com.google.cloud.bigquery.ExternalTableDefinition;
import com.google.cloud.bigquery.Field;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.Schema;
import com.google.cloud.bigquery.StandardSQLTypeName;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.TableInfo;
import com.google.cloud.bigquery.TableResult;

// Sample to queries an external data source aws s3 using a permanent table
public class QueryExternalTableAws {

  public static void main(String[] args) throws InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "MY_PROJECT_ID";
    String datasetName = "MY_DATASET_NAME";
    String externalTableName = "MY_EXTERNAL_TABLE_NAME";
    // Query to find states starting with 'W'
    String query =
        String.format(
            "SELECT * FROM s%.%s.%s WHERE name LIKE 'W%%'",
            projectId, datasetName, externalTableName);
    queryExternalTableAws(query);
  }

  public static void queryExternalTableAws(String query) throws InterruptedException {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      TableResult results = bigquery.query(QueryJobConfiguration.of(query));

      results
          .iterateAll()
          .forEach(row -> row.forEach(val -> System.out.printf("%s,", val.toString())));

      System.out.println("Query on aws external permanent table performed successfully.");
    } catch (BigQueryException e) {
      System.out.println("Query not performed \n" + e.toString());
    }
  }
}

View resource metadata with INFORMATION_SCHEMA

When you query the JOBS_BY_*, JOBS_TIMELINE_BY_*, and RESERVATION* BigQuery INFORMATION_SCHEMA views, you must specify the query's processing location as us-east4 (Northern Virginia) that is collocated with the table's region. The job fails if the location is not specified. For all other system tables, specifying the query job location is optional.

For information about the system tables that BigQuery Omni supports, see Limitations.

To query JOBS_* and RESERVATION* system tables, specify the processing location as follows:

Console

  1. Go to the BigQuery page.

    Go to BigQuery

  2. If the Editor tab isn't visible, then click Compose new query.

  3. Click More > Query settings. The Query settings dialog opens.

    Query settings

  4. In the Query settings dialog, for Additional settings > Data location, select us-east4 (Northern Virginia).

  5. Select the remaining fields and click Save.

bq

Use the --location flag to set the job's processing location to us-east4.

Example

bq --use_legacy_sql=false --location=us-east4 \
"SELECT * FROM region-aws-us-east-1.INFORMATION_SCHEMA.JOBS limit 10;"

API

If you are running jobs programmatically, ensure that you set the location argument to us-east4.

What's next