This page describes how to use BigQuery to query data stored in Drive.
BigQuery supports queries against both personal Drive files and shared files. For more information on Drive, see the G Suite learning center.
You can query files in Drive in the following formats:
- Comma-separated values (CSV)
- Newline-delimited JSON
- Avro
- Sheets
To query a Drive external data source, provide the Drive URI path to your data and create an external table that references that data source. The table used to reference the Drive data source can be a permanent table or a temporary table.
Retrieving the Drive URI
To create an external table for a Drive data source, you must provide the Drive URI. To retrieve the Drive URI:
Go to Drive.
Right-click your file and select Get link. The URI looks something like the following:
https://docs.google.com/spreadsheets/d/FILE_ID
or
https://drive.google.com/file/d/FILE_ID
where
FILE_ID
is the alphanumeric ID for your Drive file.
Alternatively, you can retrieve the URI by opening the file. For example, to retrieve the URI for a Sheets file:
Go to Sheets.
Open your spreadsheet and copy the URI in the browser's address bar.
Enabling Drive access
Accessing data hosted within Drive requires an additional OAuth scope,
both when defining the federated source as well as during query execution.
While not enabled by default, it can be included in the Cloud Console,
in the bq
command-line tool, or via the API via the following mechanisms:
Console
Follow the web-based authentication steps when creating a permanent table in the Cloud Console. When you are prompted, click Allow to give BigQuery Client Tools access to Drive.
gcloud
To enable Drive access:
Enter the following command to ensure that you have the latest version of the
gcloud
command-line tool.gcloud components update
Enter the following command to authenticate with Drive.
gcloud auth login --enable-gdrive-access
API
If you are using the BigQuery API, request the OAuth scope for Drive in addition to the scope for BigQuery.
Python
Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Python API reference documentation.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Java API reference documentation.
Permanent versus temporary external tables
You can query an external data source in BigQuery by using a permanent table or a temporary table. A permanent table is a table that is created in a dataset and is linked to your external data source. Because the table is permanent, you can use access controls to share the table with others who also have access to the underlying external data source, and you can query the table at any time.
When you query an external data source using a temporary table, you submit a command that includes a query and creates a non-permanent table linked to the external data source. When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
Querying Drive data using permanent external tables
Required permissions and scopes
When you query external data in Drive using a permanent table, you need permissions to run a query job at the project level or higher, you need permissions that allow you to create a table that points to the external data, and you need permissions that allow you to access the table data. When your external data is stored in Drive, you also need permissions to access the Drive file that is linked to your external table.
BigQuery permissions
At a minimum, the following permissions are required to create and query an external table in BigQuery.
bigquery.tables.create
bigquery.tables.getData
bigquery.jobs.create
The following predefined IAM roles include both
bigquery.tables.create
and bigquery.tables.getData
permissions:
bigquery.dataEditor
bigquery.dataOwner
bigquery.admin
The following predefined IAM roles include bigquery.jobs.create
permissions:
bigquery.user
bigquery.jobUser
bigquery.admin
In addition, if a user has bigquery.datasets.create
permissions, when that
user creates a dataset, they are granted bigquery.dataOwner
access to it.
bigquery.dataOwner
access gives the user the ability to create external tables
in the dataset, but bigquery.jobs.create
permissions are still required to
query the data.
For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions.
Drive permissions
At a minimum, to query external data in Drive you must be
granted View
access to the Drive file linked to the external table.
Scopes for Compute Engine instances
When you create a Compute Engine instance, you can specify a list of scopes for the instance. The scopes control the instance's access to Google Cloud products, including Drive. Applications running on the VM use the service account to call Google Cloud APIs.
If you set up a Compute Engine instance to run as a
service account,
and that service account accesses an external table linked to a Drive
data source, you must add the
OAuth scope for Drive
(https://www.googleapis.com/auth/drive
) to the instance.
For information on applying scopes to a Compute Engine instance, see Changing the service account and access scopes for an instance. For more information on Compute Engine service accounts, see Service accounts.
Creating and querying a permanent external table
You can create a permanent table linked to your external data source by:
- Using the Cloud Console
- Using the
bq
command-line tool'smk
command - Creating an
ExternalDataConfiguration
when you use thetables.insert
API method - Using the client libraries
To query an external data source using a permanent table, you create a table in a BigQuery dataset that is linked to your external data source. The data is not stored in the BigQuery table. Because the table is permanent, you can use access controls to share the table with others who also have access to the underlying external data source.
There are three ways to specify schema information when you create a permanent external table in BigQuery:
- If you are using the
tables.insert
API method to create a permanent external table, you create a table resource that includes a schema definition and anExternalDataConfiguration
. Set theautodetect
parameter totrue
to enable schema auto-detection for supported data sources. - If you are using the
bq
command-line tool to create a permanent external table, you can use a table definition file, you can create and use your own schema file, or you can enter the schema inline with thebq
tool. When you create a table definition file, you can enable schema auto-detection for supported data sources. - If you are using the Cloud Console to create a permanent external table, you can enter the table schema manually or use schema auto-detection for supported data sources.
To create an external table:
Console
Open the BigQuery page in the Cloud Console.
Go to the BigQuery pageIn the Explorer panel, expand your project and select a dataset.
In the details panel, click Create table.
On the Create table page, in the Source section:
For Create table from, select Drive.
In the Select Drive URI field, enter the Drive URI. Note that wildcards are not supported for Drive URIs.
For File format, select the format of your data. Valid formats for Drive data include:
- Comma-separated values (CSV)
- Newline delimited JSON
- Avro
- Sheets
(Optional) If you choose Sheets, in the Sheet range (Optional) box, specify the sheet and cell range to query. You can specify a sheet name, or you can specify
sheet_name!top_left_cell_id:bottom_right_cell_id
for a cell range; for example, "Sheet1!A1:B20". If Sheet range is not specified, the first sheet in the file is used.On the Create table page, in the Destination section: * For Dataset name, choose the appropriate dataset, and in the Table name field, enter the name of the table you're creating in BigQuery.
- Verify that Table type is set to External table.
In the Schema section, enter the schema definition.
- For JSON or CSV files, you can check the Auto-detect option to enable schema auto-detect. Auto-detect is not available for Datastore exports, Firestore exports, and Avro files. Schema information for these file types is automatically retrieved from the self-describing source data.
- Enter schema information manually by:
- Enabling Edit as text and entering the table schema as a JSON
array.
Note: You can view the schema of an existing table in JSON
format by entering the following command in the
bq
command-line tool:bq show --format=prettyjson DATASET.TABLE
. - Using Add field to manually input the schema.
- Enabling Edit as text and entering the table schema as a JSON
array.
Note: You can view the schema of an existing table in JSON
format by entering the following command in the
Click Create table.
If necessary, select your account and then click Allow to give the BigQuery client tools access to Drive.
You can then run a query against the table as if it were a native BigQuery table, subject to the limitations on external data sources.
After your query completes, you can download the results as CSV or JSON, save the results as a table, or save the results to Sheets. See Download, save, and export data for more information.
bq
You create a table in the bq
command-line tool using the bq mk
command. When you use
the bq
command-line tool to create a table linked to an external data source, you can
identify the table's schema using:
- A table definition file (stored on your local machine)
- An inline schema definition
- A JSON schema file (stored on your local machine)
To create a permanent table linked to your Drive data source using a table definition file, enter the following command.
bq mk \ --external_table_definition=DEFINITION_FILE \ DATASET.TABLE
Where:
DEFINITION_FILE
is the path to the table definition file on your local machine.DATASET
is the name of the dataset that contains the table.TABLE
is the name of the table you're creating.
For example, the following command creates a permanent table named mytable
using a table definition file named mytable_def
.
bq mk --external_table_definition=/tmp/mytable_def mydataset.mytable
To create a permanent table linked to your external data source using an inline schema definition, enter the following command.
bq mk \ --external_table_definition=SCHEMA@SOURCE_FORMAT=DRIVE_URI \ DATASET.TABLE
Where:
SCHEMA
is the schema definition in the formatFIELD:DATA_TYPE,FIELD:DATA_TYPE
.SOURCE_FORMAT
isCSV
,NEWLINE_DELIMITED_JSON
,AVRO
, orGOOGLE_SHEETS
.DRIVE_URI
is your Drive URI.DATASET
is the name of the dataset that contains the table.TABLE
is the name of the table you're creating.
For example, the following command creates a permanent table named sales
linked to a Sheets file stored in Drive with the following
schema definition: Region:STRING,Quarter:STRING,Total_sales:INTEGER
.
bq mk \
--external_table_definition=Region:STRING,Quarter:STRING,Total_sales:INTEGER@GOOGLE_SHEETS=https://drive.google.com/open?id=1234_AbCD12abCd \
mydataset.sales
To create a permanent table linked to your external data source using a JSON schema file, enter the following command.
bq mk \ --external_table_definition=SCHEMA_FILE@SOURCE_FORMAT=DRIVE_URI \ DATASET.TABLE
Where:
SCHEMA_FILE
is the path to the JSON schema file on your local machine.SOURCE_FORMAT
isCSV
,NEWLINE_DELIMITED_JSON
,AVRO
, orGOOGLE_SHEETS
.DRIVE_URI
is your Drive URI.DATASET
is the name of the dataset that contains the table.TABLE
is the name of the table you're creating.
For example, the following command creates a table named sales
linked
to a CSV file stored in Drive using the /tmp/sales_schema.json
schema file.
bq mk \
--external_table_definition=/tmp/sales_schema.json@CSV=https://drive.google.com/open?id=1234_AbCD12abCd \
mydataset.sales
After the permanent table is created, you can then run a query against the table as if it were a native BigQuery table, subject to the limitations on external data sources.
After your query completes, you can download the results as CSV or JSON, save the results as a table, or save the results to Sheets. See Download, save, and export data for more information.
API
Create an ExternalDataConfiguration
when you use the tables.insert
API method. Specify the schema
property or set the autodetect
property
to true
to enable schema auto detection for supported data sources.
Python
Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Python API reference documentation.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Java API reference documentation.
Querying Drive data using temporary tables
To query an external data source without creating a permanent table, you run a command to combine:
- A table definition file with a query
- An inline schema definition with a query
- A JSON schema definition file with a query
The table definition file or supplied schema is used to create the temporary external table,
and the query runs against the temporary external table. Querying an external data source using a
temporary table is supported by the bq
command-line tool and the API.
When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
Required permissions
When you query external data in Drive using a temporary table, you need permissions to run a query job at the project level or higher, and you need access to the dataset that contains the table that points to the external data. When you query data in Drive, you also need permissions to access the Drive file that contains your data.
BigQuery permissions
At a minimum, the following permissions are required to query an external table in BigQuery using a temporary table.
bigquery.tables.getData
bigquery.jobs.create
The following predefined IAM roles include
bigquery.tables.getData
permissions:
bigquery.dataEditor
bigquery.dataOwner
bigquery.admin
The following predefined IAM roles include bigquery.jobs.create
permissions:
bigquery.user
bigquery.jobUser
bigquery.admin
In addition, if a user has bigquery.datasets.create
permissions, when that
user creates a dataset, they are granted bigquery.dataOwner
access to it.
bigquery.dataOwner
access gives the user the ability to create and access
external tables in the dataset, but bigquery.jobs.create
permissions are still
required to query the data.
For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions.
Drive permissions
At a minimum, to query external data in Drive you must be
granted View
access to the Drive file linked to the external table.
Creating and querying a temporary table
You can create and query a temporary table linked to an external data source
by using the bq
command-line tool, the API, or the client libraries.
bq
You query a temporary table linked to an external data source using the
bq query
command with the --external_table_definition
flag. When you use
the bq
command-line tool to query a temporary table linked to an external data source, you
can identify the table's schema using:
- A table definition file (stored on your local machine)
- An inline schema definition
- A JSON schema file (stored on your local machine)
To query a temporary table linked to your external data source using a table definition file, enter the following command.
bq --location=LOCATION query \ --external_table_definition=TABLE::DEFINITION_FILE \ 'QUERY'
Where:
LOCATION
is your location. The--location
flag is optional.TABLE
is the name of the temporary table you're creating.DEFINITION_FILE
is the path to the table definition file on your local machine.QUERY
is the query you're submitting to the temporary table.
For example, the following command creates and queries a temporary table
named sales
using a table definition file named sales_def
.
bq query \
--external_table_definition=sales::sales_def \
'SELECT
Region,Total_sales
FROM
sales'
To query a temporary table linked to your external data source using an inline schema definition, enter the following command.
bq --location=LOCATION query \ --external_table_definition=TABLE::SCHEMA@SOURCE_FORMAT=DRIVE_URI \ 'QUERY'
Where:
LOCATION
is your location. The--location
flag is optional.TABLE
is the name of the temporary table you're creating.SCHEMA
is the inline schema definition in the formatFIELD:DATA_TYPE,FIELD:DATA_TYPE
.SOURCE_FORMAT
isCSV
,NEWLINE_DELIMITED_JSON
,AVRO
, orGOOGLE_SHEETS
.DRIVE_URI
is your Drive URI.QUERY
is the query you're submitting to the temporary table.
For example, the following command creates and queries a temporary table
named sales
linked to a CSV file stored in Drive with the following
schema definition: Region:STRING,Quarter:STRING,Total_sales:INTEGER
.
bq --location=US query \
--external_table_definition=sales::Region:STRING,Quarter:STRING,Total_sales:INTEGER@CSV=https://drive.google.com/open?id=1234_AbCD12abCd \
'SELECT
Region,Total_sales
FROM
sales'
To query a temporary table linked to your external data source using a JSON schema file, enter the following command.
bq --location=LOCATION query \ --external_table_definition=SCHEMA_FILE@SOURCE_FORMT=DRIVE_URI \ 'QUERY'
Where:
LOCATION
is your location. The--location
flag is optional.SCHEMA_FILE
is the path to the JSON schema file on your local machine.SOURCE_FILE
isCSV
,NEWLINE_DELIMITED_JSON
,AVRO
, orGOOGLE_SHEETS
.DRIVE_URI
is your Drive URI.QUERY
is the query you're submitting to the temporary table.
For example, the following command creates and queries a temporary table
named sales
linked to a CSV file stored in Drive using the
/tmp/sales_schema.json
schema file.
bq query \
--external_table_definition=sales::/tmp/sales_schema.json@CSV=https://drive.google.com/open?id=1234_AbCD12abCd \
'SELECT
Total_sales
FROM
sales'
API
Create a query job configuration. See Querying data for information about calling
jobs.query
andjobs.insert
.Specify the external data source by creating an
ExternalDataConfiguration
.
Python
Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Python API reference documentation.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Java API reference documentation.
The _FILE_NAME pseudo column
Tables based on external data sources provide a pseudo column named _FILE_NAME
. This
column contains the fully qualified path to the file to which the row belongs. This column is
available only for tables that reference external data stored in Cloud Storage
and Google Drive.
The _FILE_NAME
column name is reserved, which means that you cannot
create a column by that name in any of your tables. To select the value of
_FILE_NAME
, you must use an alias. The following example query demonstrates
selecting _FILE_NAME
by assigning the alias fn
to the pseudo column.
bq query \
--project_id=project_id \
--use_legacy_sql=false \
'SELECT
name,
_FILE_NAME AS fn
FROM
`dataset.table_name`
WHERE
name contains "Alex"'
Where:
- project_id is a valid project ID (this flag is not required if you use Cloud Shell or if you set a default project in the Cloud SDK)
- dataset is the name of the dataset that stores the permanent external table
- table_name is the name of the permanent external table
When the query has a filter predicate on the _FILE_NAME
pseudo column,
BigQuery attempts to skip reading files that do not satisfy the filter. Similar
recommendations to
querying ingestion-time partitioned tables using pseudo columns
apply when constructing query predicates with the _FILE_NAME
pseudo column.