Stay organized with collections Save and categorize content based on your preferences.

Query unpartitioned data in Cloud Storage

BigQuery supports querying Cloud Storage data in the following formats:

  • Comma-separated values (CSV)
  • JSON (newline-delimited)
  • Avro
  • ORC
  • Parquet
  • Datastore exports
  • Firestore exports

BigQuery supports querying Cloud Storage data from these storage classes:

  • Standard
  • Nearline
  • Coldline
  • Archive

To query a Cloud Storage external data source, provide the Cloud Storage URI path to your data and create a table that references the data source. The table used to reference the Cloud Storage data source can be a permanent table or a temporary table.

Be sure to consider the location of your dataset and Cloud Storage bucket when you query data stored in Cloud Storage.

Limitations

For information about limitations that apply to external tables, see External table limitations.

Before you begin

Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document. The permissions required to perform a task (if any) are listed in the "Required permissions" section of the task.

Retrieve the Cloud Storage URI

To create an external table using a Cloud Storage data source, you must provide the Cloud Storage URI.

The Cloud Storage resource path contains your bucket name and your object (filename). For example, if the Cloud Storage bucket is named mybucket and the data file is named myfile.csv, the resource path would be gs://mybucket/myfile.csv. If your data is separated into multiple files you can use a wildcard in the path.

BigQuery does not support Cloud Storage resource paths that include multiple consecutive slashes after the initial double slash. Cloud Storage object names can contain multiple consecutive slash ("/") characters. However, BigQuery converts multiple consecutive slashes into a single slash. For example, the following resource path, though valid in Cloud Storage, does not work in BigQuery: gs://bucket/my//object//name.

To retrieve the Cloud Storage resource path:

  1. Open the Cloud Storage console.

    Cloud Storage console

  2. Browse to the location of the object (file) that contains the source data.

  3. Click on the name of the object.

    The Object details page opens.

  4. Copy the value provided in the gsutil URI field, which begins with gs://.

Create and query Cloud Storage data using permanent external tables

Use the information in this section to learn how to create and then query an external table.

Required permissions

To query external data in Cloud Storage using a permanent table, you need permissions to do the following:

  • Run a query job at the project level or higher.
  • Create a table that points to the external data.
  • Access the table.

If your external data is stored in Cloud Storage, you also need permissions to access the bucket that contains your data.

Permissions to create and query an external table in BigQuery

To create and query an external table in BigQuery, you need the following IAM permissions:

  • bigquery.tables.create
  • bigquery.tables.getData
  • bigquery.jobs.create

Each of the following predefined IAM roles includes the permissions that you need in order to create and query an external table in BigQuery:

  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner
  • roles/bigquery.admin (includes the bigquery.jobs.create permission)
  • roles/bigquery.user (includes the bigquery.jobs.create permission)
  • roles/bigquery.jobUser (includes the bigquery.jobs.create permission)

Additionally, if you have the bigquery.datasets.create permission, you can create and access external tables in the datasets that you create. But, you still require the bigquery.jobs.create permission to query the data.

For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions.

Permissions to query external data in a Cloud Storage bucket

To query external data in a Cloud Storage bucket, you need the following IAM permissions:

  • storage.buckets.get
  • storage.objects.get
  • storage.objects.list (required if you are using a URI wildcard)

Access scopes for Compute Engine instances

If, from a Compute Engine instance, you need to query an external table that is linked to a Cloud Storage source, the instance must have at least the Cloud Storage read-only access scope (https://www.googleapis.com/auth/devstorage.read_only).

The scopes control the Compute Engine instance's access to Google Cloud products, including Cloud Storage. Applications running on the instance use the service account attached to the instance to call Google Cloud APIs.

If you set up a Compute Engine instance to run as the default Compute Engine service account, the instance is by default granted a number of default scopes, including the https://www.googleapis.com/auth/devstorage.read_only scope.

If instead you set up the instance with a custom service account, make sure to explicitly grant the https://www.googleapis.com/auth/devstorage.read_only scope to the instance.

For information on applying scopes to a Compute Engine instance, see Changing the service account and access scopes for an instance. For more information on Compute Engine service accounts, see Service accounts.

Create a permanent external table

You can create a permanent table linked to your external data source by:

To create an external table:

Console

  1. In the Google Cloud console, open the BigQuery page.

Go to BigQuery

  1. In the Explorer panel, expand your project and select a dataset.

  2. Expand the Actions option and click Create table.

  3. For Create table from, select Google Cloud Storage, and then select the data source for the external table from your Cloud Storage bucket.

    You cannot include multiple URIs in the Google Cloud console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.

  4. For File format, select the format that matches your file.

  5. On the Create table page, in the Destination section:

    • For Dataset, choose the appropriate dataset
    • Verify that Table type is set to External table.
    • In the Table field, enter the name of the table you're creating in BigQuery.
  6. In the Schema section, you can either enable schema auto-detection or specify a schema manually.

    • To enable schema auto-detection, select the Auto-detect option.

    • To manually specify a schema, leave the Auto-detect option unchecked and then do one of the following:

      • Enable Edit as text and enter the table schema as a JSON array.
  7. Click Create table.

After the permanent table is created, you can run a query against the table as if it were a native BigQuery table. After your query completes, you can export the results as CSV or JSON files, save the results as a table, or save the results to Google Sheets.

SQL

You can create a permanent external table by running the CREATE EXTERNAL TABLE DDL statement. You can specify the schema explicitly. If you don't specify a schema, BigQuery uses schema auto-detection to infer the schema from the external data.

The following example uses schema auto-detection to create an external table named sales that is linked to a CSV file stored in Cloud Storage:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE OR REPLACE EXTERNAL TABLE mydataset.sales
      OPTIONS (
      format = 'CSV',
      uris = ['gs://mybucket/sales.csv']);
    

  3. Click Run.

For more information about how to run queries, see Running interactive queries.

The next example specifies a schema explicitly and skips the first row in the CSV file:

CREATE OR REPLACE EXTERNAL TABLE mydataset.sales (
  Region STRING,
  Quarter STRING,
  Total_Sales INT64
) OPTIONS (
    format = 'CSV',
    uris = ['gs://mybucket/sales.csv'],
    skip_leading_rows = 1);

bq

To create an external table, use the bq mk command with the --external_table_definition flag. This flag contains either a path to a table definition file or an inline table definition.

Option 1: Table definition file

Use the bq mkdef command to create a table definition file, and then pass the file path to the bq mk command as follows:

bq mkdef --source_format=SOURCE_FORMAT \
  BUCKET_URL > DEFINITION_FILE

bq mk --table \
  --external_table_definition=DEFINITION_FILE \
  DATASET_NAME.TABLE_NAME \
  SCHEMA

Where:

  • SOURCE_FORMAT is the format of the external data source. For example, CSV.
  • BUCKET_URI is your Cloud Storage URI.
  • DEFINITION_FILE is the path to the table definition file on your local machine.
  • DATASET_NAME is the name of the dataset that contains the table.
  • TABLE_NAME is the name of the table you're creating.
  • SCHEMA specifies a path to a JSON schema file, or specifies the schema in the form field:data_type,field:data_type,....

Example:

bq mkdef --source_format=CSV gs://mybucket/sales.csv > mytable_def

bq mk --table --external_table_definition=mytable_def \
  mydataset.mytable \
  Region:STRING,Quarter:STRING,Total_sales:INTEGER

To use schema auto-detection, set the --autodetect=true flag in the mkdef command and omit the schema:

bq mkdef --source_format=CSV --autodetect=true \
  gs://mybucket/sales.csv > mytable_def

bq mk --table --external_table_definition=mytable_def \
  mydataset.mytable

Option 2: Inline table definition

Instead of creating a table definition file, you can pass the table definition directly to the bq mk command:

bq mk --table \
  --external_table_definition=@SOURCE_FORMAT=BUCKET_URI \
  DATASET_NAME.TABLE_NAME \
  SCHEMA

Where:

  • SOURCE_FORMAT is the format of the external data source, for example, CSV.
  • BUCKET_URI is your Cloud Storage URI.
  • DATASET_NAME is the name of the dataset that contains the table.
  • TABLE_NAME is the name of the table you're creating.
  • SCHEMA specifies a path to a JSON schema file, or specifies the schema in the form field:data_type,field:data_type,.... To use schema auto-detection, omit this argument.

Example:

bq mkdef --source_format=CSV gs://mybucket/sales.csv > mytable_def
bq mk --table --external_table_definition=mytable_def \
  mydataset.mytable \
  Region:STRING,Quarter:STRING,Total_sales:INTEGER

API

Use the tables.insert API method, and create an ExternalDataConfiguration in the Table resource that you pass in. Specify the schema property or set the autodetect property to true to enable schema auto detection for supported data sources.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.CsvOptions;
import com.google.cloud.bigquery.ExternalTableDefinition;
import com.google.cloud.bigquery.Field;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.Schema;
import com.google.cloud.bigquery.StandardSQLTypeName;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.TableInfo;
import com.google.cloud.bigquery.TableResult;

// Sample to queries an external data source using a permanent table
public class QueryExternalGCSPerm {

  public static void runQueryExternalGCSPerm() {
    // TODO(developer): Replace these variables before running the sample.
    String datasetName = "MY_DATASET_NAME";
    String tableName = "MY_TABLE_NAME";
    String sourceUri = "gs://cloud-samples-data/bigquery/us-states/us-states.csv";
    Schema schema =
        Schema.of(
            Field.of("name", StandardSQLTypeName.STRING),
            Field.of("post_abbr", StandardSQLTypeName.STRING));
    String query =
        String.format("SELECT * FROM %s.%s WHERE name LIKE 'W%%'", datasetName, tableName);
    queryExternalGCSPerm(datasetName, tableName, sourceUri, schema, query);
  }

  public static void queryExternalGCSPerm(
      String datasetName, String tableName, String sourceUri, Schema schema, String query) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      // Skip header row in the file.
      CsvOptions csvOptions = CsvOptions.newBuilder().setSkipLeadingRows(1).build();

      TableId tableId = TableId.of(datasetName, tableName);
      // Create a permanent table linked to the GCS file
      ExternalTableDefinition externalTable =
          ExternalTableDefinition.newBuilder(sourceUri, csvOptions).setSchema(schema).build();
      bigquery.create(TableInfo.of(tableId, externalTable));

      // Example query to find states starting with 'W'
      TableResult results = bigquery.query(QueryJobConfiguration.of(query));

      results
          .iterateAll()
          .forEach(row -> row.forEach(val -> System.out.printf("%s,", val.toString())));

      System.out.println("Query on external permanent table performed successfully.");
    } catch (BigQueryException | InterruptedException e) {
      System.out.println("Query not performed \n" + e.toString());
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Node.js API reference documentation.

// Import the Google Cloud client library and create a client
const {BigQuery} = require('@google-cloud/bigquery');
const bigquery = new BigQuery();

async function queryExternalGCSPerm() {
  // Queries an external data source using a permanent table

  /**
   * TODO(developer): Uncomment the following lines before running the sample.
   */
  // const datasetId = "my_dataset";
  // const tableId = "my_table";

  // Configure the external data source
  const dataConfig = {
    sourceFormat: 'CSV',
    sourceUris: ['gs://cloud-samples-data/bigquery/us-states/us-states.csv'],
    // Optionally skip header row
    csvOptions: {skipLeadingRows: 1},
  };

  // For all options, see https://cloud.google.com/bigquery/docs/reference/v2/tables#resource
  const options = {
    schema: schema,
    externalDataConfiguration: dataConfig,
  };

  // Create an external table linked to the GCS file
  const [table] = await bigquery
    .dataset(datasetId)
    .createTable(tableId, options);

  console.log(`Table ${table.id} created.`);

  // Example query to find states starting with 'W'
  const query = `SELECT post_abbr
  FROM \`${datasetId}.${tableId}\`
  WHERE name LIKE 'W%'`;

  // Run the query as a job
  const [job] = await bigquery.createQueryJob(query);
  console.log(`Job ${job.id} started.`);

  // Wait for the query to finish
  const [rows] = await job.getQueryResults();

  // Print the results
  console.log('Rows:');
  console.log(rows);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'

# Configure the external data source
dataset_ref = bigquery.DatasetReference(project, dataset_id)
table_id = "us_states"
schema = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("post_abbr", "STRING"),
]
table = bigquery.Table(dataset_ref.table(table_id), schema=schema)
external_config = bigquery.ExternalConfig("CSV")
external_config.source_uris = [
    "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
]
external_config.options.skip_leading_rows = 1  # optionally skip header row
table.external_data_configuration = external_config

# Create a permanent table linked to the GCS file
table = client.create_table(table)  # API request

# Example query to find states starting with 'W'
sql = 'SELECT * FROM `{}.{}` WHERE name LIKE "W%"'.format(dataset_id, table_id)

query_job = client.query(sql)  # API request

w_states = list(query_job)  # Waits for query to finish
print("There are {} states with names starting with W.".format(len(w_states)))

Query a permanent external table

To query an external data source using a permanent table, you create a table in a BigQuery dataset that is linked to your external data source. The data is not stored in the BigQuery table. Because the table is permanent, you can use access controls to share the table with others who also have access to the underlying external data source.

When you create a permanent external table, you can specify the schema in the following ways:

Query Cloud Storage data using temporary tables

To query an external data source without creating a permanent table, you run a command to combine:

  • A table definition file with a query
  • An inline schema definition with a query
  • A JSON schema definition file with a query

The table definition file or supplied schema is used to create the temporary external table, and the query runs against the temporary external table. Querying an external data source using a temporary table is supported by the bq command-line tool and the API.

When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Required permissions

To query external data in Cloud Storage using a temporary table, you need permissions to run a query job at the project level or higher and access to the dataset that contains the table that points to the external data. To query data in Cloud Storage, you also need permissions to access the bucket that contains your data.

Permissions to query an external table in BigQuery

To query an external table in BigQuery using a temporary table, you need the following IAM permissions:

  • bigquery.tables.getData
  • bigquery.jobs.create

Each of the following predefined IAM roles includes the permissions that you need in order to query an external table in BigQuery using a temporary table:

  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner
  • roles/bigquery.admin (includes the bigquery.jobs.create permission)
  • roles/bigquery.user (includes the bigquery.jobs.create permission)
  • roles/bigquery.jobUser (includes the bigquery.jobs.create permission)

Additionally, if you have the bigquery.datasets.create permission, you can create and access external tables in the datasets that you create. But, you still require the bigquery.jobs.create permission to query the data.

For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions.

Permissions to query external data in a Cloud Storage bucket

To query external data in a Cloud Storage bucket, you need the following IAM permissions:

  • storage.objects.get
  • storage.objects.list (required if you are using a URI wildcard)

The predefined IAM role roles/storage.objectViewer includes all the permissions that you need in order to query external data in a Cloud Storage bucket.

Create and query a temporary table

You can create and query a temporary table linked to an external data source by using the bq command-line tool, the API, or the client libraries.

bq

You query a temporary table linked to an external data source using the bq query command with the --external_table_definition flag. When you use the bq command-line tool to query a temporary table linked to an external data source, you can identify the table's schema using:

  • A table definition file (stored on your local machine)
  • An inline schema definition
  • A JSON schema file (stored on your local machine)

(Optional) Supply the --location flag and set the value to your location.

To query a temporary table linked to your external data source using a table definition file, enter the following command.

bq --location=location query \
--external_table_definition=table::definition_file \
'query'

Where:

  • location is the name of your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • table is the name of the temporary table you're creating.
  • definition_file is the path to the table definition file on your local machine.
  • query is the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named sales using a table definition file named sales_def.

bq query \
--external_table_definition=sales::sales_def \
'SELECT
  Region,
  Total_sales
FROM
  sales'

To query a temporary table linked to your external data source using an inline schema definition, enter the following command.

bq --location=location query \
--external_table_definition=table::schema@source_format=Cloud Storage URI \
'query'

Where:

  • location is the name of your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • table is the name of the temporary table you're creating.
  • schema is the inline schema definition in the format field:data_type,field:data_type.
  • source_format is the format of the external data source, for example, CSV.
  • Cloud Storage URI is your Cloud Storage URI.
  • query is the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named sales linked to a CSV file stored in Cloud Storage with the following schema definition: Region:STRING,Quarter:STRING,Total_sales:INTEGER.

bq query \
--external_table_definition=sales::Region:STRING,Quarter:STRING,Total_sales:INTEGER@CSV=gs://mybucket/sales.csv \
'SELECT
  Region,
  Total_sales
FROM
  sales'

To query a temporary table linked to your external data source using a JSON schema file, enter the following command.

bq --location=location query \
--external_table_definition=schema_file@source_format=Cloud Storage URI \
'query'

Where:

  • location is the name of your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • schema_file is the path to the JSON schema file on your local machine.
  • source_format is the format of the external data source, for example, CSV.
  • Cloud Storage URI is your Cloud Storage URI.
  • query is the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named sales linked to a CSV file stored in Cloud Storage using the /tmp/sales_schema.json schema file.

  bq query \
  --external_table_definition=sales::/tmp/sales_schema.json@CSV=gs://mybucket/sales.csv \
  'SELECT
      Region,
      Total_sales
    FROM
      sales'

API

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.CsvOptions;
import com.google.cloud.bigquery.ExternalTableDefinition;
import com.google.cloud.bigquery.Field;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.Schema;
import com.google.cloud.bigquery.StandardSQLTypeName;
import com.google.cloud.bigquery.TableResult;

// Sample to queries an external data source using a temporary table
public class QueryExternalGCSTemp {

  public static void runQueryExternalGCSTemp() {
    // TODO(developer): Replace these variables before running the sample.
    String tableName = "MY_TABLE_NAME";
    String sourceUri = "gs://cloud-samples-data/bigquery/us-states/us-states.csv";
    Schema schema =
        Schema.of(
            Field.of("name", StandardSQLTypeName.STRING),
            Field.of("post_abbr", StandardSQLTypeName.STRING));
    String query = String.format("SELECT * FROM %s WHERE name LIKE 'W%%'", tableName);
    queryExternalGCSTemp(tableName, sourceUri, schema, query);
  }

  public static void queryExternalGCSTemp(
      String tableName, String sourceUri, Schema schema, String query) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      // Skip header row in the file.
      CsvOptions csvOptions = CsvOptions.newBuilder().setSkipLeadingRows(1).build();

      // Configure the external data source and query job.
      ExternalTableDefinition externalTable =
          ExternalTableDefinition.newBuilder(sourceUri, csvOptions).setSchema(schema).build();
      QueryJobConfiguration queryConfig =
          QueryJobConfiguration.newBuilder(query)
              .addTableDefinition(tableName, externalTable)
              .build();

      // Example query to find states starting with 'W'
      TableResult results = bigquery.query(queryConfig);

      results
          .iterateAll()
          .forEach(row -> row.forEach(val -> System.out.printf("%s,", val.toString())));

      System.out.println("Query on external temporary table performed successfully.");
    } catch (BigQueryException | InterruptedException e) {
      System.out.println("Query not performed \n" + e.toString());
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Node.js API reference documentation.

// Import the Google Cloud client library and create a client
const {BigQuery} = require('@google-cloud/bigquery');
const bigquery = new BigQuery();

async function queryExternalGCSTemp() {
  // Queries an external data source using a temporary table.

  const tableId = 'us_states';

  // Configure the external data source
  const externalDataConfig = {
    sourceFormat: 'CSV',
    sourceUris: ['gs://cloud-samples-data/bigquery/us-states/us-states.csv'],
    // Optionally skip header row.
    csvOptions: {skipLeadingRows: 1},
    schema: {fields: schema},
  };

  // Example query to find states starting with 'W'
  const query = `SELECT post_abbr
  FROM \`${tableId}\`
  WHERE name LIKE 'W%'`;

  // For all options, see https://cloud.google.com/bigquery/docs/reference/v2/tables#resource
  const options = {
    query,
    tableDefinitions: {[tableId]: externalDataConfig},
  };

  // Run the query as a job
  const [job] = await bigquery.createQueryJob(options);
  console.log(`Job ${job.id} started.`);

  // Wait for the query to finish
  const [rows] = await job.getQueryResults();

  // Print the results
  console.log('Rows:');
  console.log(rows);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# Configure the external data source and query job.
external_config = bigquery.ExternalConfig("CSV")
external_config.source_uris = [
    "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
]
external_config.schema = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("post_abbr", "STRING"),
]
external_config.options.skip_leading_rows = 1
table_id = "us_states"
job_config = bigquery.QueryJobConfig(table_definitions={table_id: external_config})

# Example query to find states starting with 'W'.
sql = 'SELECT * FROM `{}` WHERE name LIKE "W%"'.format(table_id)

query_job = client.query(sql, job_config=job_config)  # Make an API request.

w_states = list(query_job)  # Wait for the job to complete.
print("There are {} states with names starting with W.".format(len(w_states)))

Query externally partitioned data

See instructions for querying externally partitioned Cloud Storage data.

Wildcard support for Cloud Storage URIs

If your Cloud Storage data is separated into multiple files that share a common base-name, you can use a wildcard in the URI in the table definition file. You can also use a wildcard when you create an external table without using a table definition file.

To add a wildcard to the Cloud Storage URI, append an asterisk (*) to the base name.

Examples:

  • The following wildcard URI selects all of the files in all the folders which start with the prefix gs://mybucket/fed-samples/fed-sample:

    gs://mybucket/fed-samples/fed-sample*
    
  • The following wildcard URI selects only files with a .csv extension in the folder named fed-samples and any subfolders of fed-samples:

    gs://mybucket/fed-samples/*.csv
    
  • The following wildcard URI selects files with a naming pattern of fed-sample*.csv in the folder named fed-samples. This example doesn't select files in subfolders of fed-samples.

    gs://mybucket/fed-samples/fed-sample*.csv
    

When using the bq command-line tool, you might need to escape the asterisk on some platforms.

You can use only one wildcard for objects (filenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported. Multiple wildcards are not supported in the source URI. For example, the path gs://mybucket/fed-*/temp/*.csv is not valid.

For Google Datastore exports, only one URI can be specified, and it must end with .backup_info or .export_metadata.

The asterisk wildcard character isn't allowed when you do the following:

  • Create external tables linked to Datastore or Firestore exports.
  • Load Datastore or Firestore export data from Cloud Storage.

The _FILE_NAME pseudo column

Tables based on external data sources provide a pseudo column named _FILE_NAME. This column contains the fully qualified path to the file to which the row belongs. This column is available only for tables that reference external data stored in Cloud Storage and Google Drive.

The _FILE_NAME column name is reserved, which means that you cannot create a column by that name in any of your tables. To select the value of _FILE_NAME, you must use an alias. The following example query demonstrates selecting _FILE_NAME by assigning the alias fn to the pseudo column.

bq query \
--project_id=project_id \
--use_legacy_sql=false \
'SELECT
   name,
   _FILE_NAME AS fn
 FROM
   `dataset.table_name`
 WHERE
   name contains "Alex"' 

Where:

  • project_id is a valid project ID (this flag is not required if you use Cloud Shell or if you set a default project in the Google Cloud CLI)
  • dataset is the name of the dataset that stores the permanent external table
  • table_name is the name of the permanent external table

When the query has a filter predicate on the _FILE_NAME pseudo column, BigQuery attempts to skip reading files that do not satisfy the filter. Similar recommendations to querying ingestion-time partitioned tables using pseudo columns apply when constructing query predicates with the _FILE_NAME pseudo column.

What's next