Query Cloud Storage data in external tables

This document describes how to query data stored in a Cloud Storage external table.

Before you begin

Ensure that you have a Cloud Storage external table.

Required roles

To query Cloud Storage external tables, ensure you have the following roles:

  • BigQuery Data Viewer (roles/bigquery.dataViewer)
  • BigQuery User (roles/bigquery.user)
  • Storage Object Viewer (roles/storage.objectViewer)

Depending on your permissions, you can grant these roles to yourself or ask your administrator to grant them to you. For more information about granting roles, see Viewing the grantable roles on resources.

To see the exact BigQuery permissions that are required to query external tables, expand the Required permissions section:

Required permissions

You might also be able to get these permissions with custom roles or other predefined roles.

Query permanent external tables

After creating an Cloud Storage external table, you can query it using GoogleSQL syntax, the same as if it were a standard BigQuery table. For example, SELECT field1, field2 FROM mydataset.my_cloud_storage_table;.

Query temporary external tables

Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

To query an external data source without creating a permanent table, you provide a table definition for the temporary table, and then use that table definition in a command or call to query the temporary table. You can provide the table definition in any of the following ways:

The table definition file or supplied schema is used to create the temporary external table, and the query runs against the temporary external table.

When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others.

You can create and query a temporary table linked to an external data source by using the bq command-line tool, the API, or the client libraries.

bq

You query a temporary table linked to an external data source using the bq query command with the --external_table_definition flag. When you use the bq command-line tool to query a temporary table linked to an external data source, you can identify the table's schema using:

(Optional) Supply the --location flag and set the value to your location.

To query a temporary table linked to your external data source using a table definition file, enter the following command.

bq --location=LOCATION query \
--external_table_definition=TABLE::DEFINITION_FILE \
'QUERY'

Replace the following:

  • LOCATION: the name of your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • TABLE: the name of the temporary table you're creating.
  • DEFINITION_FILE: the path to the table definition file on your local machine.
  • QUERY: the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named sales using a table definition file named sales_def.

bq query \
--external_table_definition=sales::sales_def \
'SELECT
  Region,
  Total_sales
FROM
  sales'

To query a temporary table linked to your external data source using an inline schema definition, enter the following command.

bq --location=LOCATION query \
--external_table_definition=TABLE::SCHEMA@SOURCE_FORMAT=BUCKET_PATH \
'QUERY'

Replace the following:

  • LOCATION: the name of your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • TABLE: the name of the temporary table you're creating.
  • SCHEMA: the inline schema definition in the format field:data_type,field:data_type.
  • SOURCE_FORMAT: the format of the external data source, for example, CSV.
  • BUCKET_PATH: the path to the Cloud Storage bucket that contains the data for the table, in the format gs://bucket_name/[folder_name/]file_pattern.

    You can select multiple files from the bucket by specifying one asterisk (*) wildcard character in the file_pattern. For example, gs://mybucket/file00*.parquet. For more information, see Wildcard support for Cloud Storage URIs.

    You can specify multiple buckets for the uris option by providing multiple paths.

    The following examples show valid uris values:

    • gs://bucket/path1/myfile.csv
    • gs://bucket/path1/*.parquet
    • gs://bucket/path1/file1*, gs://bucket1/path1/*

    When you specify uris values that target multiple files, all of those files must share a compatible schema.

    For more information about using Cloud Storage URIs in BigQuery, see Cloud Storage resource path.

  • QUERY: the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named sales linked to a CSV file stored in Cloud Storage with the following schema definition: Region:STRING,Quarter:STRING,Total_sales:INTEGER.

bq query \
--external_table_definition=sales::Region:STRING,Quarter:STRING,Total_sales:INTEGER@CSV=gs://mybucket/sales.csv \
'SELECT
  Region,
  Total_sales
FROM
  sales'

To query a temporary table linked to your external data source using a JSON schema file, enter the following command.

bq --location=LOCATION query \
--external_table_definition=SCHEMA_FILE@SOURCE_FORMAT=BUCKET_PATH \
'QUERY'

Replace the following:

  • LOCATION: the name of your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • SCHEMA_FILE: the path to the JSON schema file on your local machine.
  • SOURCE_FORMAT: the format of the external data source, for example, CSV.
  • BUCKET_PATH: the path to the Cloud Storage bucket that contains the data for the table, in the format gs://bucket_name/[folder_name/]file_pattern.

    You can select multiple files from the bucket by specifying one asterisk (*) wildcard character in the file_pattern. For example, gs://mybucket/file00*.parquet. For more information, see Wildcard support for Cloud Storage URIs.

    You can specify multiple buckets for the uris option by providing multiple paths.

    The following examples show valid uris values:

    • gs://bucket/path1/myfile.csv
    • gs://bucket/path1/*.parquet
    • gs://bucket/path1/file1*, gs://bucket1/path1/*

    When you specify uris values that target multiple files, all of those files must share a compatible schema.

    For more information about using Cloud Storage URIs in BigQuery, see Cloud Storage resource path.

  • QUERY: the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named sales linked to a CSV file stored in Cloud Storage using the /tmp/sales_schema.json schema file.

  bq query \
  --external_table_definition=sales::/tmp/sales_schema.json@CSV=gs://mybucket/sales.csv \
  'SELECT
      Region,
      Total_sales
    FROM
      sales'

API

To run a query using the API, follow these steps:

  1. Create a Job object.
  2. Populate the configuration section of the Job object with a JobConfiguration object.
  3. Populate the query section of the JobConfiguration object with a JobConfigurationQuery object.
  4. Populate the tableDefinitions section of the JobConfigurationQuery object with an ExternalDataConfiguration object.
  5. Call the jobs.insert method to run the query asynchronously or the jobs.query method to run the query synchronously, passing in the Job object.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.CsvOptions;
import com.google.cloud.bigquery.ExternalTableDefinition;
import com.google.cloud.bigquery.Field;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.Schema;
import com.google.cloud.bigquery.StandardSQLTypeName;
import com.google.cloud.bigquery.TableResult;

// Sample to queries an external data source using a temporary table
public class QueryExternalGCSTemp {

  public static void runQueryExternalGCSTemp() {
    // TODO(developer): Replace these variables before running the sample.
    String tableName = "MY_TABLE_NAME";
    String sourceUri = "gs://cloud-samples-data/bigquery/us-states/us-states.csv";
    Schema schema =
        Schema.of(
            Field.of("name", StandardSQLTypeName.STRING),
            Field.of("post_abbr", StandardSQLTypeName.STRING));
    String query = String.format("SELECT * FROM %s WHERE name LIKE 'W%%'", tableName);
    queryExternalGCSTemp(tableName, sourceUri, schema, query);
  }

  public static void queryExternalGCSTemp(
      String tableName, String sourceUri, Schema schema, String query) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      // Skip header row in the file.
      CsvOptions csvOptions = CsvOptions.newBuilder().setSkipLeadingRows(1).build();

      // Configure the external data source and query job.
      ExternalTableDefinition externalTable =
          ExternalTableDefinition.newBuilder(sourceUri, csvOptions).setSchema(schema).build();
      QueryJobConfiguration queryConfig =
          QueryJobConfiguration.newBuilder(query)
              .addTableDefinition(tableName, externalTable)
              .build();

      // Example query to find states starting with 'W'
      TableResult results = bigquery.query(queryConfig);

      results
          .iterateAll()
          .forEach(row -> row.forEach(val -> System.out.printf("%s,", val.toString())));

      System.out.println("Query on external temporary table performed successfully.");
    } catch (BigQueryException | InterruptedException e) {
      System.out.println("Query not performed \n" + e.toString());
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Node.js API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

// Import the Google Cloud client library and create a client
const {BigQuery} = require('@google-cloud/bigquery');
const bigquery = new BigQuery();

async function queryExternalGCSTemp() {
  // Queries an external data source using a temporary table.

  const tableId = 'us_states';

  // Configure the external data source
  const externalDataConfig = {
    sourceFormat: 'CSV',
    sourceUris: ['gs://cloud-samples-data/bigquery/us-states/us-states.csv'],
    // Optionally skip header row.
    csvOptions: {skipLeadingRows: 1},
    schema: {fields: schema},
  };

  // Example query to find states starting with 'W'
  const query = `SELECT post_abbr
  FROM \`${tableId}\`
  WHERE name LIKE 'W%'`;

  // For all options, see https://cloud.google.com/bigquery/docs/reference/v2/tables#resource
  const options = {
    query,
    tableDefinitions: {[tableId]: externalDataConfig},
  };

  // Run the query as a job
  const [job] = await bigquery.createQueryJob(options);
  console.log(`Job ${job.id} started.`);

  // Wait for the query to finish
  const [rows] = await job.getQueryResults();

  // Print the results
  console.log('Rows:');
  console.log(rows);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# Configure the external data source and query job.
external_config = bigquery.ExternalConfig("CSV")
external_config.source_uris = [
    "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
]
external_config.schema = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("post_abbr", "STRING"),
]
external_config.options.skip_leading_rows = 1
table_id = "us_states"
job_config = bigquery.QueryJobConfig(table_definitions={table_id: external_config})

# Example query to find states starting with 'W'.
sql = 'SELECT * FROM `{}` WHERE name LIKE "W%"'.format(table_id)

query_job = client.query(sql, job_config=job_config)  # Make an API request.

w_states = list(query_job)  # Wait for the job to complete.
print("There are {} states with names starting with W.".format(len(w_states)))

Query the _FILE_NAME pseudo-column

Tables based on external data sources provide a pseudo-column named _FILE_NAME. This column contains the fully qualified path to the file to which the row belongs. This column is available only for tables that reference external data stored in Cloud Storage, Google Drive, Amazon S3, and Azure Blob Storage.

The _FILE_NAME column name is reserved, which means that you cannot create a column by that name in any of your tables. To select the value of _FILE_NAME, you must use an alias. The following example query demonstrates selecting _FILE_NAME by assigning the alias fn to the pseudo-column.

  bq query \
  --project_id=PROJECT_ID \
  --use_legacy_sql=false \
  'SELECT
     name,
     _FILE_NAME AS fn
   FROM
     `DATASET.TABLE_NAME`
   WHERE
     name contains "Alex"' 

Replace the following:

  • PROJECT_ID is a valid project ID (this flag is not required if you use Cloud Shell or if you set a default project in the Google Cloud CLI)
  • DATASET is the name of the dataset that stores the permanent external table
  • TABLE_NAME is the name of the permanent external table

When the query has a filter predicate on the _FILE_NAME pseudo-column, BigQuery attempts to skip reading files that do not satisfy the filter. Similar recommendations to querying ingestion-time partitioned tables using pseudo-columns apply when constructing query predicates with the _FILE_NAME pseudo-column.

What's next