Querying Google Cloud Storage Data

This page describes how to use BigQuery to query data stored in Google Cloud Storage.

BigQuery supports the following formats for data stored in Google Cloud Storage:

  • Comma-separated values (CSV)
  • JSON (newline-delimited)
  • Avro files
  • Google Cloud Datastore backup files

BigQuery supports querying Google Cloud Storage data from these storage classes:

  • Multi-Regional
  • Regional
  • Nearline
  • Coldline

To directly query a Cloud Storage external data source, provide the Cloud Storage URI path to your data and create a table that references the data source. The table used to reference the Cloud Storage data source can be a permanent table or a temporary table.

Retrieving the Google Cloud Storage URI

To create an external table using a Google Cloud Storage data source, you must provide the Cloud Storage URI.

The Cloud Storage URI comprises your bucket name and your object (filename). For example, if the Cloud Storage bucket is named mybucket and the data file is named myfile.csv, the bucket URI would be gs://mybucket/myfile.csv. If your data is separated into multiple files you can use a wildcard in the URI. For more information, see Cloud Storage Request URIs.

BigQuery does not support source URIs that include multiple consecutive slashes after the initial double slash. Cloud Storage object names can contain multiple consecutive slash ("/") characters. However, BigQuery converts multiple consecutives slashes into a single slash. For example, the following source URI, though valid in Cloud Storage, does not work in BigQuery: gs://[BUCKET]/my//object//name.

To retrieve the Cloud Storage URI:

  1. Open the Cloud Storage web UI.

    Cloud Storage web UI

  2. Browse to the location of the object (file) that contains the source data.

  3. At the top of the Cloud Storage web UI, note the path to the object. To compose the URI, replace gs://[BUCKET]/[FILE] with the appropriate path, for example, gs://mybucket/myfile.json. [BUCKET] is the Cloud Storage bucket name and [FILE] is the name of the object (file) containing the data.

Access controls and scopes

Access controls for permanent external tables

You can share access to a permanent external table linked to a Google Cloud Storage data source. You can share access with users (including service accounts) or groups. To query the external table, your users or groups need (at a minimum):

  • READER or bigquery.dataViewer access to the dataset containing the external table
  • bigquery.user access to the project containing the dataset (in order to run query jobs)
  • The predefined IAM role storage.objectViewer to read the Cloud Storage data or the following permissions:

    • storage.objects.get
    • storage.objects.list if you're using a URI wildcard

Scopes for Compute Engine instances

When you create a Compute Engine instance, you can specify a list of scopes for the instance. The scopes control the instance's access to Google Cloud Platform products, including Google Cloud Storage. Applications running on the VM use the service account attached to the instance to call Google Cloud APIs.

If you set up a Compute Engine instance to run as the default Compute Engine service account, and that service account accesses an external table linked to a Cloud Storage data source, the instance requires read-only access to Google Cloud Storage. The default Compute Engine service account is automatically granted the https://www.googleapis.com/auth/devstorage.read_only scope. If you create your own service account, apply the Cloud Storage read-only scope to the instance.

For information on applying scopes to a Compute Engine instance, see Changing the service account and access scopes for an instance. For more information on Compute Engine service accounts, see Service Accounts.

Permanent versus temporary external tables

You can query an external data source in BigQuery using a permanent table or a temporary table. When you use a permanent table, you create a table in a BigQuery dataset that is linked to your external data source. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source, and you can query the table at any time.

When you query an external data source using a temporary table, you submit a command that includes a query and creates a non-permanent table linked to the external data source. When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Querying Cloud Storage data using permanent external tables

To query an external data source using a permanent table, you create a table in a BigQuery dataset that is linked to your external data source. The data is not stored in the BigQuery table. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source.

There are three ways to specify schema information when you create a permanent external table in BigQuery:

  • If you are using the API to create a permanent external table, you begin by creating a table definition file that defines the schema and metadata for the external data source. When you create a table definition file, you can enable schema auto-detection for supported data sources.
  • If you are using the CLI to create a permanent external table, you can use a table definition file, you can create and use your own schema file, or you can enter the schema inline (on the command line).
  • If you are using the web UI to create a permanent external table, you can enter the table schema manually or use schema auto-detection for supported data sources.

To query Cloud Storage data using a permanent external table, you:

  • Create a table definition file (for the API and optionally for the CLI)
  • Create a table in BigQuery linked to the external data source
  • Query the table linked to the external data source

Creating a permanent external table

You can create a permanent table linked to your external data source using the web UI, the CLI, or the API.

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. Note that wildcards are supported for Google Cloud Storage URIs.
    • For File format, select the format of your data. Valid formats for Cloud Storage data include:

      • Comma-separated values (CSV)
      • JSON (newline delimited)
      • Avro
      • Cloud Datastore backup
  4. In the Schema section, enter the schema information.

    • For JSON or CSV files, you can check the Auto-detect option to enable schema auto-detect. Auto-detect is not available for Cloud Datastore backups and Avro files. Schema information is automatically retrieved from Cloud Datastore backups and Avro files.

    • You can also enter CSV or JSON schema information manually by:

      • Clicking Edit as text and entering the table schema in JSON format
      • Using Add Field to manually input the schema
  5. Select applicable items in the Options section and then click Create Table.

Once the permanent table is created, you can run a query against the table as if it were a native BigQuery table, subject to the limits on external data sources.

After your query completes, you can export the results as CSV or JSON files, save the results as a table, or save the results to Google Sheets. See Download, save, and export data for more information.

CLI

You create a table in the BigQuery command-line tool using the bq mk command. When you use the CLI to create a table linked to an external data source, you can identify the table's schema using:

  • A table definition file (stored on your local machine)
  • An inline schema definition
  • A JSON schema file (stored on your local machine)

To create a permanent table linked to your Cloud Storage data source using a table definition file, enter the following command.

bq mk --external_table_definition=[DEFINITION_FILE] [DATASET_ID].[TABLE_NAME]

Where:

  • [DEFINITION_FILE] is the path to the table definition file on your local machine
  • [DATASET_ID] is the name of the dataset that contains the table
  • [TABLE_NAME] is the name of the table you're creating

For example, the following command creates a permanent table named mytable using a table definition file named mytable_def.

bq mk --external_table_definition=/tmp/mytable_def mydataset.mytable

To create a permanent table linked to your external data source using an inline schema definition, enter the following command.

bq mk --external_table_definition=[SCHEMA]@[SOURCE_FORMAT]=[CLOUD_STORAGE_URI] [DATASET_ID].[TABLE_NAME]

Where:

  • [SCHEMA] is the schema definition (in the format [FIELD]:[DATA_TYPE], [FIELD]:[DATA_TYPE])
  • [SOURCE_FORMAT] is CSV, NEWLINE_DELIMITED_JSON, AVRO, or DATASTORE_BACKUP
  • [CLOUD_STORAGE_URI] is your Cloud Storage URI
  • [DATASET_ID] is the name of the dataset that contains the table
  • [TABLE_NAME] is the name of the table you're creating

For example, the following command creates a permanent table named sales linked to a CSV file stored in Cloud Storage with the following schema definition: Region:STRING,Quarter:STRING,Total_sales:INTEGER.

bq mk --external_table_definition=Region:STRING,Quarter:STRING,Total_sales:INTEGER@CSV=gs://mybucket/sales.csv mydataset.sales

To create a permanent table linked to your external data source using a JSON schema file, enter the following command.

bq mk --external_table_definition=[SCHEMA_FILE]@[SOURCE_FORMAT]=[CLOUD_STORAGE_URI] [DATASET_ID].[TABLE_NAME]

Where:

  • [SCHEMA_FILE] is the path to the JSON schema file on your local machine
  • [SOURCE_FORMAT] is CSV, NEWLINE_DELIMITED_JSON, AVRO, or DATASTORE_BACKUP
  • [CLOUD_STORAGE_URI] is your Cloud Storage URI
  • [DATASET_ID] is the name of the dataset that contains the table
  • [TABLE_NAME] is the name of the table you're creating

For example, the following command creates a table named sales linked to a CSV file stored in Cloud Storage using the /tmp/sales_schema.json schema file.

bq mk --external_table_definition=/tmp/sales_schema.json@CSV=gs://mybucket/sales.csv mydataset.sales

Once the permanent table is created, you can then run a query against the table as if it were a native BigQuery table, subject to the limitations on external data sources.

After your query completes, you can download the results as CSV or JSON, save the results as a table, or save the results to Google Sheets. See Download, save, and export data for more information.

API

Querying Cloud Storage data using temporary tables

To query an external data source without creating a permanent table, you run a command to combine:

  • A table definition file with a query
  • An inline schema definition with a query
  • A JSON schema definition file with a query

The table definition file or supplied schema is used to create the temporary external table, and the query runs against the temporary external table. Querying an external data source using a temporary table is supported by the BigQuery CLI and API.

When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Creating and querying a temporary table

You can create and query a temporary table linked to an external data source using the CLI or the API.

CLI

You query a temporary table linked to an external data source using the bq query command with the --external_table_definition flag. When you use the CLI to query a temporary table linked to an external data source, you can identify the table's schema using:

  • A table definition file (stored on your local machine)
  • An inline schema definition
  • A JSON schema file (stored on your local machine)

To query a temporary table linked to your external data source using a table definition file, enter the following command.

bq query --external_table_definition=[TABLE_NAME]::[DEFINITION_FILE] '[QUERY]'

Where:

  • [TABLE_NAME] is the name of the temporary table you're creating
  • [DEFINITION_FILE] is the path to the table definition file on your local machine
  • [QUERY] is the query you're submitting to the temporary table

For example, the following command creates and queries a temporary table named sales using a table definition file named sales_def.

bq query --external_table_definition=sales::sales_def 'SELECT Region,Total_sales FROM sales;'

To query a temporary table linked to your external data source using an inline schema definition, enter the following command.

bq query --external_table_definition=[TABLE_NAME]::[SCHEMA]@[SOURCE_FORMAT]=[CLOUD_STORAGE_URI] '[QUERY]'

Where:

  • [TABLE_NAME] is the name of the temporary table you're creating
  • [SCHEMA] is the inline schema definition (in the format [FIELD]:[DATA_TYPE],[FIELD]:[DATA_TYPE])
  • [SOURCE_FORMAT] is CSV, NEWLINE_DELIMITED_JSON, AVRO, or DATASTORE_BACKUP
  • [CLOUD_STORAGE_URI] is your Cloud Storage URI
  • [QUERY] is the query you're submitting to the temporary table

For example, the following command creates and queries a temporary table named sales linked to a CSV file stored in Cloud Storage with the following schema definition: Region:STRING,Quarter:STRING,Total_sales:INTEGER.

bq query --external_table_definition=sales::Region:STRING,Quarter:STRING,Total_sales:INTEGER@CSV=gs://mybucket/sales.csv 'SELECT Region,Total_sales FROM sales;'

To query a temporary table linked to your external data source using a JSON schema file, enter the following command.

bq query --external_table_definition=[SCHEMA_FILE]@[SOURCE_FORMAT]=[CLOUD_STORAGE_URI] '[QUERY]'

Where:

  • [SCHEMA_FILE] is the path to the JSON schema file on your local machine
  • [SOURCE_FORMAT] is CSV, NEWLINE_DELIMITED_JSON, AVRO, or DATASTORE_BACKUP
  • [CLOUD_STORAGE_URI] is your Cloud Storage URI
  • [QUERY] is the query you're submitting to the temporary table

For example, the following command creates and queries a temporary table named sales linked to a CSV file stored in Cloud Storage using the /tmp/sales_schema.json schema file.

bq query --external_table_definition=sales::/tmp/sales_schema.json@CSV=gs://mybucket/sales.csv 'SELECT Region,Total_sales FROM sales;'

API

Wildcard support for Cloud Storage URIs

If your Google Cloud Storage data is separated into multiple files that share a common base-name, you can use a wildcard in the URI in the table definition file. You can also use a wildcard when you create an external table without using a table definition file.

To add a wildcard to the Cloud Storage URI, you append an asterisk (*) to the base-name. For example, if you have two files named fed-sample000001.csv and fed-sample000002.csv, the bucket URI would be gs://mybucket/fed-sample*. This wildcard URI can then be used in the web UI, CLI, or API.

You can use only one wildcard for objects (filenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported.

For Google Cloud Datastore backups, only one URI can be specified, and it must end with .backup_info or .export_metadata. The * wildcard character is not allowed when creating external tables linked to Cloud Datastore backups or when loading Cloud Datastore backup data from Cloud Storage.

The _FILE_NAME pseudo column

Tables based on external data sources provide a pseudo column named _FILE_NAME. This column contains the fully qualified path to the file to which the row belongs. This column is available only for tables that reference external data stored in Google Cloud Storage and Google Drive.

The _FILE_NAME column name is reserved, which means that you cannot create a column by that name in any of your tables. To select the value of _FILE_NAME, you must use an alias. The following example query demonstrates selecting _FILE_NAME by assigning the alias fn to the pseudo column.

bq query --project_id=[PROJECT_ID] --use_legacy_sql=false 'SELECT name, _FILE_NAME AS fn from [DATASET].[TABLE_NAME] where name contains "Alex";' 

Where:

  • [PROJECT_ID] is a valid project ID (this flag is not required if you use Cloud Shell or if you set a default project in the Cloud SDK)
  • [DATASET] is the name of the dataset that stores the permanent external table
  • [TABLE_NAME] is the name of the permanent external table

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...