Create an external table for Amazon S3 data

BigQuery Omni does not manage data stored in Amazon S3. To access S3 data, define an external table.

Create a dataset

To create a dataset, use the Google Cloud Console, the bq command-line tool, or the client library:

Google Cloud Console

  1. Open the BigQuery page in the Google Cloud Console.

    Go to the BigQuery page

  2. In the navigation panel, in the Resources section, select your project.

  3. In the details panel, click Create dataset.

  4. On the Create dataset page, enter the following information:

    • For Dataset ID, enter a unique dataset name.
    • For Data location, choose aws-us-east-1.

    • For Default table expiration, choose one of the following options:

      • Never: (Default) Tables created in the dataset are never automatically deleted. You must delete them manually.
      • Number of days after table creation: This value determines when a newly created table in the dataset is deleted. This value is applied if you do not set a table expiration when the table is created. Data in Amazon S3 is not deleted when the table expires.
    • Click Create dataset.

bq

To create a new dataset, use the bq mk command with the --location flag. Optional parameters include --default_table_expiration and --description.

To create a dataset in a project other than your default project, add the project ID to the dataset name in the following format: PROJECT_ID:DATASET_NAME.

bq --location=LOCATION mk \
--dataset \
--default_table_expiration INTEGER1 \
--description DESCRIPTION \
PROJECT_ID:DATASET_NAME

Replace the following:

  • LOCATION: Replace with aws-us-east-1. After you create a dataset, you can't change its location. You can set a default value for the location by using the .bigqueryrc file.
  • INTEGER1: The default lifetime (in seconds) for newly created tables. The minimum value is 3600 (1 hour). The expiration time evaluates to the current time plus the integer value. Any table created in the dataset is deleted INTEGER1 seconds after the table's creation time. This value is applied if you do not set a table expiration when you create the table.
  • DESCRIPTION: A description of the dataset in quotes.
  • PROJECT_ID: Your project ID.
  • DATASET_NAME: The name of the dataset you're creating.

Java

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.Dataset;
import com.google.cloud.bigquery.DatasetInfo;

// Sample to create a aws dataset
public class CreateDatasetAws {

  public static void main(String[] args) {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "MY_PROJECT_ID";
    String datasetName = "MY_DATASET_NAME";
    // Note: As of now location only supports aws-us-east-1
    String location = "aws-us-east-1";
    createDatasetAws(projectId, datasetName, location);
  }

  public static void createDatasetAws(String projectId, String datasetName, String location) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      DatasetInfo datasetInfo =
          DatasetInfo.newBuilder(projectId, datasetName).setLocation(location).build();

      Dataset dataset = bigquery.create(datasetInfo);
      System.out.println(
          "Aws dataset created successfully :" + dataset.getDatasetId().getDataset());
    } catch (BigQueryException e) {
      System.out.println("Aws dataset was not created. \n" + e.toString());
    }
  }
}

Create an external table

To create an external table, follow these steps:

Google Cloud Console

  1. Open the BigQuery page in the Google Cloud Console.

    Go to the BigQuery page

  2. In the navigation panel, in the Resources section, expand your project and select a dataset.

  3. In the details panel, click Create table.

  4. On the Create table page, in the Source section, do the following:

    • For Create table from, select Amazon S3.
    • For Select S3 path, enter a URI pointing to the S3 data in the format s3://BUCKET_NAME/PATH. Replace BUCKET_NAME with the name of the S3 bucket; the bucket's region should be the same as the dataset's region. Replace PATH with the path that you would like to write the exported file to; it can contain one wildcard *.
    • For File format, select the data format in S3. Supported formats are AVRO, PARQUET, ORC, CSV, NEWLINE_DELIMITED_JSON, and Google Sheets.
  5. Optional: If the data in AWS uses hive partitioning, then do the following:

    • Select the Source data partitioning checkbox.
    • In the Select AWS path field, enter the AWS S3 path using wildcard format. For example, s3://s3-bucket/path/*.
    • In the Select Source URI Prefix field, enter the URI prefix. The URI prefix is the part of the URI that appears before the partition encoding. For example, s3://s3-bucket/path/.
    • Select a Partition inference mode. If you select Provide my own, then enter schema information for the partition keys.
  6. On the Create table page, in the Destination section, do the following:

    • For Dataset name, choose the appropriate dataset.
    • In the Table name field, enter the name of the table you're creating.
    • Verify that Table type is set to External table.
    • For Connection ID, choose the appropriate connection ID from the dropdown.
  7. Click Create table.

SQL

To create an external table, use the CREATE EXTERNAL TABLE statement:

CREATE EXTERNAL TABLE DATASET_NAME.TABLE_NAME
  WITH CONNECTION `AWS_LOCATION.CONNECTION_NAME`
  OPTIONS(
    format="DATA_FORMAT",
    uris=["S3_URI"]
  )

Replace the following:

  • DATASET_NAME: the name of the dataset you created.
  • TABLE_NAME: the name you want to give to this table.
  • AWS_LOCATION: an AWS location in Google Cloud (for example, aws-us-east-1).
  • CONNECTION_NAME: the name of the connection you created.
  • DATA_FORMAT: any of the supported BigQuery federated formats (such as AVRO or CSV).
  • S3_URI: a URI pointing to the S3 data (for example, s3://bucket/path).

Example:

CREATE EXTERNAL TABLE awsdataset.awstable
  WITH CONNECTION `aws-us-east-1.s3-read-connection`
  OPTIONS(format="CSV", uris=["s3://s3-bucket/path/file.csv"])

bq

Create a table definition file:


bq mkdef  \
--source_format=DATA_FORMAT \
--connection_id=AWS_LOCATION.CONNECTION_NAME \
S3_URI > table_def

Replace the following:

  • DATA_FORMAT: any of the supported BigQuery federated formats (such as AVRO or CSV).
  • S3_URI: a URI pointing to the S3 data (for example, s3://bucket/path).
  • AWS_LOCATION: an AWS location in Google Cloud (for example, aws-us-east-1).
  • CONNECTION_NAME: the name of the connection you created.

Next, create the external table:

bq mk --external_table_definition=table_def DATASET_NAME.TABLE_NAME

Replace the following:

  • DATASET_NAME: the name of the dataset you created.
  • TABLE_NAME: the name you want to give to this table.

For example, the following command creates a new external table, awsdataset.awstable, which can query your AWS data that's stored at the path s3://s3-bucket/path/file.csv and has a read connection in the location aws-us-east-1:

bq mkdef  \
--autodetect \
--source_format=CSV \
--connection_id=aws-us-east-1.s3-read-connection \
s3://s3-bucket/path/file.csv > table_def

bq mk --external_table_definition=table_def awsdataset.awstable

Hive partitioning

If the data in AWS uses hive partitioning, then create the table definition file as follows:

bq mkdef  \
--source_format=DATA_FORMAT \
--connection_id=AWS_LOCATION.CONNECTION_NAME \
--hive_partitioning_mode=HIVE_PARTITIONING_MODE \
--hive_partitioning_source_uri_prefix=HIVE_PARTITIONING_URI_PRFEFIX \
"S3_URI" > table_def

Replace the following:

  • HIVE_PARTITIONING_MODE: The partition schema detection mode.
  • HIVE_PARTITIONING_URI_PRFEFIX: The URI prefix, which is the part of the URI that appears before the partition encoding. For example, s3://s3-bucket/path/.

Java

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.CsvOptions;
import com.google.cloud.bigquery.ExternalTableDefinition;
import com.google.cloud.bigquery.Field;
import com.google.cloud.bigquery.Schema;
import com.google.cloud.bigquery.StandardSQLTypeName;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.TableInfo;

// Sample to create an external aws table
public class CreateExternalTableAws {

  public static void main(String[] args) {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "MY_PROJECT_ID";
    String datasetName = "MY_DATASET_NAME";
    String tableName = "MY_TABLE_NAME";
    String connectionId = "MY_CONNECTION_ID";
    String sourceUri = "s3://your-bucket-name/";
    CsvOptions options = CsvOptions.newBuilder().setSkipLeadingRows(1).build();
    Schema schema =
        Schema.of(
            Field.of("name", StandardSQLTypeName.STRING),
            Field.of("post_abbr", StandardSQLTypeName.STRING));
    ExternalTableDefinition externalTableDefinition =
        ExternalTableDefinition.newBuilder(sourceUri, options)
            .setConnectionId(connectionId)
            .setSchema(schema)
            .build();
    createExternalTableAws(projectId, datasetName, tableName, externalTableDefinition);
  }

  public static void createExternalTableAws(
      String projectId,
      String datasetName,
      String tableName,
      ExternalTableDefinition externalTableDefinition) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      TableId tableId = TableId.of(projectId, datasetName, tableName);
      TableInfo tableInfo = TableInfo.newBuilder(tableId, externalTableDefinition).build();

      bigquery.create(tableInfo);
      System.out.println("Aws external table created successfully");

      // Clean up
      bigquery.delete(TableId.of(projectId, datasetName, tableName));
    } catch (BigQueryException e) {
      System.out.println("Aws external was not created." + e.toString());
    }
  }
}

Query the table

BigQuery Omni lets you query the external table like any BigQuery table. The maximum result size for interactive queries is 2 MB. If your query result is larger, consider exporting it to Amazon S3. The query result is stored in a BigQuery anonymous table (prod spanner).

To run a query, use the Google Cloud Console or bq command-line tool.

Google Cloud Console

  1. Open the BigQuery page in the Google Cloud Console.

    Go to the BigQuery page

  2. Enter a standard SQL query in the Query editor box. Standard SQL is the default syntax in the Google Cloud Console.

    SELECT * FROM DATASET_NAME.TABLE_NAME
    

    Replace the following:

    • DATASET_NAME: the dataset name that you created.
    • TABLE_NAME: the external table that name you created.
  3. Click Run.

bq

Use the query command and specify standard SQL syntax by using the --nouse_legacy_sql or --use_legacy_sql=false flag.

To run the query, enter the following:

bq query --nouse_legacy_sql \
SELECT * FROM DATASET_NAME.TABLE_NAME

Replace the following:

  • DATASET_NAME: the dataset name that you created.
  • TABLE_NAME: the external table name that you created.

Java

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.CsvOptions;
import com.google.cloud.bigquery.DatasetId;
import com.google.cloud.bigquery.ExternalTableDefinition;
import com.google.cloud.bigquery.Field;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.Schema;
import com.google.cloud.bigquery.StandardSQLTypeName;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.TableInfo;
import com.google.cloud.bigquery.TableResult;

// Sample to queries an external data source aws s3 using a permanent table
public class QueryExternalTableAws {

  public static void main(String[] args) throws InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "MY_PROJECT_ID";
    String datasetName = "MY_DATASET_NAME";
    String externalTableName = "MY_EXTERNAL_TABLE_NAME";
    // Query to find states starting with 'W'
    String query =
        String.format(
            "SELECT * FROM s%.%s.%s WHERE name LIKE 'W%%'",
            projectId, datasetName, externalTableName);
    queryExternalTableAws(query);
  }

  public static void queryExternalTableAws(String query) throws InterruptedException {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      TableResult results = bigquery.query(QueryJobConfiguration.of(query));

      results
          .iterateAll()
          .forEach(row -> row.forEach(val -> System.out.printf("%s,", val.toString())));

      System.out.println("Query on aws external permanent table performed successfully.");
    } catch (BigQueryException e) {
      System.out.println("Query not performed \n" + e.toString());
    }
  }
}

What's next