BigQuery supports querying Cloud Storage data in the following formats:
- Comma-separated values (CSV)
- JSON (newline-delimited)
- Avro
- ORC (beta)
- Parquet (beta)
- Datastore exports
- Cloud Firestore exports
BigQuery supports querying Cloud Storage data from these storage classes:
- Multi-Regional
- Regional
- Nearline
- Coldline
To query a Cloud Storage external data source, provide the Cloud Storage URI path to your data and create a table that references the data source. The table used to reference the Cloud Storage data source can be a permanent table or a temporary table.
Be sure to consider the location of your dataset and Cloud Storage bucket when you query data stored in Cloud Storage.
Retrieving the Cloud Storage URI
To create an external table using a Cloud Storage data source, you must provide the Cloud Storage URI.
The Cloud Storage URI comprises your bucket name and your object (filename).
For example, if the Cloud Storage bucket is named mybucket
and the data
file is named myfile.csv
, the bucket URI would be gs://mybucket/myfile.csv
.
If your data is separated into multiple files you can use a
wildcard in the URI. For more information, see Cloud Storage
Request URIs.
BigQuery does not support source URIs that include multiple
consecutive slashes after the initial double slash. Cloud Storage object
names can contain multiple consecutive slash ("/") characters. However,
BigQuery converts multiple consecutive slashes into a single
slash. For example, the following source URI, though valid in Cloud Storage,
does not work in BigQuery:
gs://bucket/my//object//name
.
To retrieve the Cloud Storage URI:
Open the Cloud Storage console.
Browse to the location of the object (file) that contains the source data.
At the top of the Cloud Storage console, note the path to the object. To compose the URI, replace
gs://bucket/file
with the appropriate path, for example,gs://mybucket/myfile.json
. bucket is the Cloud Storage bucket name and file is the name of the object (file) containing the data.
Permanent versus temporary external tables
You can query an external data source in BigQuery by using a permanent table or a temporary table. A permanent table is a table that is created in a dataset and is linked to your external data source. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source, and you can query the table at any time.
When you query an external data source using a temporary table, you submit a command that includes a query and creates a non-permanent table linked to the external data source. When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
Querying Cloud Storage data using permanent external tables
Required permissions and scopes
When you query external data in Cloud Storage using a permanent table, you need permissions to run a query job at the project level or higher, you need permissions that allow you to create a table that points to the external data, and you need permissions that allow you to access the table. When your external data is stored in Cloud Storage, you also need permissions to access the data in the Cloud Storage bucket.
BigQuery permissions
At a minimum, the following permissions are required to create and query an external table in BigQuery.
bigquery.tables.create
bigquery.tables.getData
bigquery.jobs.create
The following predefined Cloud IAM roles include both
bigquery.tables.create
and bigquery.tables.getData
permissions:
bigquery.dataEditor
bigquery.dataOwner
bigquery.admin
The following predefined Cloud IAM roles include bigquery.jobs.create
permissions:
bigquery.user
bigquery.jobUser
bigquery.admin
In addition, if a user has bigquery.datasets.create
permissions, when that
user creates a dataset, they are granted bigquery.dataOwner
access to it.
bigquery.dataOwner
access gives the user the ability to create external tables
in the dataset, but bigquery.jobs.create
permissions are still required to
query the data.
For more information on Cloud IAM roles and permissions in BigQuery, see Predefined roles and permissions.
Cloud Storage permissions
In order to query external data in a Cloud Storage bucket, you must be
granted storage.objects.get
permissions. If you are using a URI
wildcard, you must also have storage.objects.list
permissions.
The predefined Cloud IAM role storage.objectViewer
can be granted to provide both storage.objects.get
and storage.objects.list
permissions.
Scopes for Compute Engine instances
When you create a Compute Engine instance, you can specify a list of scopes for the instance. The scopes control the instance's access to GCP products, including Cloud Storage. Applications running on the VM use the service account attached to the instance to call Google Cloud APIs.
If you set up a Compute Engine instance to run as the default
Compute Engine service account,
and that service account accesses an external
table linked to a Cloud Storage data source, the instance requires read-
only access to Cloud Storage. The default Compute Engine service
account is automatically granted the
https://www.googleapis.com/auth/devstorage.read_only
scope. If you create your
own service account, apply the Cloud Storage read-only scope to the
instance.
For information on applying scopes to a Compute Engine instance, see Changing the service account and access scopes for an instance. For more information on Compute Engine service accounts, see Service accounts.
Creating and querying a permanent external table
You can create a permanent table linked to your external data source by:
- Using the Cloud Console or the classic BigQuery web UI
- Using the command-line tool's
mk
command - Creating an
ExternalDataConfiguration
when you use thetables.insert
API method - Using the client libraries
To query an external data source using a permanent table, you create a table in a BigQuery dataset that is linked to your external data source. The data is not stored in the BigQuery table. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source.
There are three ways to specify schema information when you create a permanent external table in BigQuery:
- If you are using the
tables.insert
API method to create a permanent external table, you create a table resource that includes a schema definition and anExternalDataConfiguration
. Set theautodetect
parameter totrue
to enable schema auto-detection for supported data sources. - If you are using the CLI to create a permanent external table, you can use a table definition file, you can create and use your own schema file, or you can enter the schema inline (on the command line). When you create a table definition file, you can enable schema auto-detection fo supported data sources.
- If you are using the console or the classic BigQuery web UI to create a permanent external table, you can enter the table schema manually or use schema auto-detection for supported data sources.
To create an external table:
Console
- Open the BigQuery web UI in the Cloud Console.
Go to the Cloud Console - In the navigation panel, in the Resources section, expand your project and select a dataset.
- Click Create table on the right side of the window.
On the Create table page, in the Source section:
For Create table from, select Cloud Storage.
In the Select file from Cloud Storage bucket field, browse for the file/Cloud Storage bucket, or enter the Cloud Storage URI. Note that you cannot include multiple URIs in the Cloud Console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.
For File format, select the format of your data. Valid formats for external Cloud Storage data include:
- Comma-separated values (CSV)
- JSON (newline delimited)
- Avro
- Datastore backup (also used for Cloud Firestore)
On the Create table page, in the Destination section:
For Dataset name, choose the appropriate dataset
Verify that Table type is set to External table.
In the Table name field, enter the name of the table you're creating in BigQuery.
In the Schema section, enter the schema definition.
- For JSON or CSV files, you can check the Auto-detect option to enable schema auto-detection. Auto-detect is not available for Datastore exports, Cloud Firestore exports, and Avro files. Schema information for these file types is automatically retrieved from the self-describing source data.
- For CSV and JSON files, you can enter schema information manually by:
- Enabling Edit as text and entering the table schema as a JSON
array.
Note: You can view the schema of an existing table in JSON
format by entering the following CLI command:
bq show --format=prettyjson dataset.table
. - Using Add field to manually input the schema.
- Enabling Edit as text and entering the table schema as a JSON
array.
Note: You can view the schema of an existing table in JSON
format by entering the following CLI command:
Click Create table.
After the permanent table is created, you can run a query against the table as if it were a native BigQuery table. After your query completes, you can export the results as CSV or JSON files, save the results as a table, or save the results to Google Sheets.
Classic UI
Go to the BigQuery web UI.
Go to the BigQuery web UIIn the navigation panel, hover on a dataset, click the down arrow icon
, and click Create new table.
On the Create Table page, in the Source Data section:
- For Location, select Cloud Storage and in the source field, enter the Cloud Storage URI. Note that wildcards are supported for Cloud Storage URIs.
- For File format, select the format of your data. Valid formats
for Cloud Storage data include:
- Comma-separated values (CSV)
- JSON (newline delimited)
- Avro
- Datastore backup (also used for Cloud Firestore)
On the Create Table page, in the Destination Table section:
- For Table name, choose the appropriate dataset, and in the table name field, enter the name of the permanent table you're creating in BigQuery.
- Verify that Table type is set to External table.
In the Schema section, enter the schema information.
For JSON or CSV files, you can check the Auto-detect option to enable schema auto-detect. Auto-detect is not available for Datastore exports, Cloud Firestore exports, and Avro files. Schema information for these file types is automatically retrieved from the self-describing source data.
You can also enter CSV or JSON schema information manually by:
- Clicking Edit as text and entering the table schema in JSON format
- Using Add Field to manually input the schema
Select applicable items in the Options section and then click Create Table.
After the permanent table is created, you can run a query against the table as if it were a native BigQuery table. After your query completes, you can export the results as CSV or JSON files, save the results as a table, or save the results to Google Sheets.
CLI
You create a table in the BigQuery command-line tool using
the bq mk
command. When you use the CLI to create a table linked to an
external data source, you can identify the table's schema using:
- A table definition file (stored on your local machine)
- An inline schema definition
- A JSON schema file (stored on your local machine)
To create a permanent table linked to your Cloud Storage data source by using a table definition file, enter the following command.
bq mk \ --external_table_definition=definition_file \ dataset.table
Where:
- definition_file is the path to the table definition file on your local machine.
- dataset is the name of the dataset that contains the table.
- table is the name of the table you're creating.
For example, the following command creates a permanent table named mytable
using a table definition file named mytable_def
.
bq mk --external_table_definition=/tmp/mytable_def mydataset.mytable
To create a permanent table linked to your external data source using an inline schema definition, enter the following command.
bq mk \ --external_table_definition=schema@source_format=Cloud Storage URI \ dataset.table
Where:
- schema is the schema definition in the format
field:data_type,field:data_type
. - source_format is
CSV
,NEWLINE_DELIMITED_JSON
,AVRO
, orDATASTORE_BACKUP
(DATASTORE_BACKUP
is also used for Filestore). - Cloud Storage URI is your Cloud Storage URI.
- dataset is the name of the dataset that contains the table.
- table is the name of the table you're creating.
For example, the following command creates a permanent table named sales
linked to a CSV file stored in Cloud Storage with the following schema
definition: Region:STRING,Quarter:STRING,Total_sales:INTEGER
.
bq mk \
--external_table_definition=Region:STRING,Quarter:STRING,Total_sales:INTEGER@CSV=gs://mybucket/sales.csv \
mydataset.sales
To create a permanent table linked to your external data source using a JSON schema file, enter the following command.
bq mk \ --external_table_definition=schema@source_format=Cloud Storage URI \ dataset.table
Where:
- schema is the path to the JSON schema file on your local machine.
- source_format is
CSV
,NEWLINE_DELIMITED_JSON
,AVRO
, orDATASTORE_BACKUP
(DATASTORE_BACKUP
is also used for Cloud Firestore). - Cloud Storage URI is your Cloud Storage URI
- dataset is the name of the dataset that contains the table.
- table is the name of the table you're creating.
For example, the following command creates a table named sales
linked
to a CSV file stored in Cloud Storage using the
/tmp/sales_schema.json
schema file.
bq mk \
--external_table_definition=/tmp/sales_schema.json@CSV=gs://mybucket/sales.csv \
mydataset.sales
After the permanent table is created, you can run a query against the table as if it were a native BigQuery table. After your query completes, you can export the results as CSV or JSON files, save the results as a table, or save the results to Google Sheets.
API
Create an ExternalDataConfiguration
when you use the tables.insert
API method. Specify the schema
property or set the autodetect
property
to true
to enable schema auto detection for supported data sources.
Python
Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .
Querying Cloud Storage data using temporary tables
To query an external data source without creating a permanent table, you run a command to combine:
- A table definition file with a query
- An inline schema definition with a query
- A JSON schema definition file with a query
The table definition file or supplied schema is used to create the temporary external table, and the query runs against the temporary external table. Querying an external data source using a temporary table is supported by the BigQuery CLI and API.
When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
Required permissions
When you query external data in Cloud Storage using a temporary table, you need permissions to run a query job at the project level or higher, and you need access to the dataset that contains the table that points to the external data. When you query data in Cloud Storage, you also need permissions to access the bucket that contains your data.
BigQuery permissions
At a minimum, the following permissions are required to query an external table in BigQuery using a temporary table.
bigquery.tables.getData
bigquery.jobs.create
The following predefined Cloud IAM roles include
bigquery.tables.getData
permissions:
bigquery.dataEditor
bigquery.dataOwner
bigquery.admin
The following predefined Cloud IAM roles include bigquery.jobs.create
permissions:
bigquery.user
bigquery.jobUser
bigquery.admin
In addition, if a user has bigquery.datasets.create
permissions, when that
user creates a dataset, they are granted bigquery.dataOwner
access to it.
bigquery.dataOwner
access gives the user the ability to create and access
external tables in the dataset, but bigquery.jobs.create
permissions are still
required to query the data.
For more information on Cloud IAM roles and permissions in BigQuery, see Predefined roles and permissions.
Cloud Storage permissions
In order to query external data in a Cloud Storage bucket, you must be
granted storage.objects.get
permissions. If you are using a URI
wildcard, you must also have storage.objects.list
permissions.
The predefined Cloud IAM role storage.objectViewer
can be granted to provide both storage.objects.get
and storage.objects.list
permissions.
Creating and querying a temporary table
You can create and query a temporary table linked to an external data source by using the CLI, the API, or the client libraries.
CLI
You query a temporary table linked to an external data source using the
bq query
command with the --external_table_definition
flag. When you use
the CLI to query a temporary table linked to an external data source, you
can identify the table's schema using:
- A table definition file (stored on your local machine)
- An inline schema definition
- A JSON schema file (stored on your local machine)
(Optional) Supply the --location
flag and set the value to your
location.
To query a temporary table linked to your external data source using a table definition file, enter the following command.
bq --location=location query \ --external_table_definition=table::definition_file \ 'query'
Where:
- location is the name of your location.
The
--location
flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value toasia-northeast1
. You can set a default value for the location using the .bigqueryrc file. - table is the name of the temporary table you're creating.
- definition_file is the path to the table definition file on your local machine.
- query is the query you're submitting to the temporary table.
For example, the following command creates and queries a temporary table
named sales
using a table definition file named sales_def
.
bq query \
--external_table_definition=sales::sales_def \
'SELECT
Region,
Total_sales
FROM
sales'
To query a temporary table linked to your external data source using an inline schema definition, enter the following command.
bq --location=location query \ --external_table_definition=table::schema@source_format=Cloud Storage URI \ 'query'
Where:
- location is the name of your location.
The
--location
flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value toasia-northeast1
. You can set a default value for the location using the .bigqueryrc file. - table is the name of the temporary table you're creating.
- schema is the inline schema definition in the format
field:data_type,field:data_type
. - source_format is
CSV
,NEWLINE_DELIMITED_JSON
,AVRO
, orDATASTORE_BACKUP
(DATASTORE_BACKUP
is also used for Cloud Firestore). - Cloud Storage URI is your Cloud Storage URI.
- query is the query you're submitting to the temporary table.
For example, the following command creates and queries a temporary table
named sales
linked to a CSV file stored in Cloud Storage with the
following schema definition:
Region:STRING,Quarter:STRING,Total_sales:INTEGER
.
bq query \
--external_table_definition=sales::Region:STRING,Quarter:STRING,Total_sales:INTEGER@CSV=gs://mybucket/sales.csv \
'SELECT
Region,
Total_sales
FROM
sales'
To query a temporary table linked to your external data source using a JSON schema file, enter the following command.
bq --location=location query \ --external_table_definition=schema_file@source_format=Cloud Storage URI \ 'query'
Where:
- location is the name of your location.
The
--location
flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value toasia-northeast1
. You can set a default value for the location using the .bigqueryrc file. - schema_file is the path to the JSON schema file on your local machine.
- source_format is
CSV
,NEWLINE_DELIMITED_JSON
,AVRO
, orDATASTORE_BACKUP
(DATASTORE_BACKUP
is also used for Cloud Firestore). - Cloud Storage URI is your Cloud Storage URI.
- query is the query you're submitting to the temporary table.
For example, the following command creates and queries a temporary table
named sales
linked to a CSV file stored in Cloud Storage using the
/tmp/sales_schema.json
schema file.
bq query
--external_table_definition=sales::/tmp/sales_schema.json@CSV=gs://mybucket/sales.csv
'SELECT
Region,
Total_sales
FROM
sales'
API
Create a query job configuration. See Querying data for information about calling
jobs.query
andjobs.insert
.Specify the external data source by creating an
ExternalDataConfiguration
.
Python
Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .
Querying externally partitioned data
See instructions for querying externally partitioned Cloud Storage data.
Wildcard support for Cloud Storage URIs
If your Cloud Storage data is separated into multiple files that share a common base-name, you can use a wildcard in the URI in the table definition file. You can also use a wildcard when you create an external table without using a table definition file.
To add a wildcard to the Cloud Storage URI, you append an asterisk (*) to
the base-name. For example, if you have two files named fed-sample000001.csv
and fed-sample000002.csv
, the bucket URI would be gs://mybucket/fed-sample*
.
This wildcard URI can then be used in the console, the classic web UI,
the CLI, the API, or the client libraries.
You can use only one wildcard for objects (filenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported.
For Google Datastore exports, only one URI can be specified, and it
must end with .backup_info
or .export_metadata
.
The *
wildcard character is not allowed when you:
- Create external tables linked to Datastore or Cloud Firestore exports
- Load Datastore or Cloud Firestore export data from Cloud Storage
The _FILE_NAME pseudo column
Tables based on external data sources provide a pseudo column named _FILE_NAME
. This
column contains the fully qualified path to the file to which the row belongs. This column is
available only for tables that reference external data stored in Cloud Storage
and Google Drive.
The _FILE_NAME
column name is reserved, which means that you cannot
create a column by that name in any of your tables. To select the value of
_FILE_NAME
, you must use an alias. The following example query demonstrates
selecting _FILE_NAME
by assigning the alias fn
to the pseudo column.
bq query \
--project_id=project_id \
--use_legacy_sql=false \
'SELECT
name,
_FILE_NAME AS fn
FROM
`dataset.table_name`
WHERE
name contains "Alex"'
Where:
- project_id is a valid project ID (this flag is not required if you use Cloud Shell or if you set a default project in the Cloud SDK)
- dataset is the name of the dataset that stores the permanent external table
- table_name is the name of the permanent external table