Querying Google Drive data

This page describes how to use BigQuery to query data stored in Google Drive.

BigQuery supports queries against both personal Google Drive files and shared files. For more information on Google Drive, see the G Suite learning center.

You can query files in Google Drive in the following formats:

  • Comma-separated values (CSV)
  • Newline-delimited JSON
  • Avro
  • Google Sheets

To query a Google Drive external data source, provide the Google Drive URI path to your data and create an external table that references that data source. The table used to reference the Google Drive data source can be a permanent table or a temporary table.

Retrieving the Google Drive URI

To create an external table for a Google Drive data source, you must provide the Google Drive URI. To retrieve the Google Drive URI:

  1. Go to Google Drive.

  2. Right-click your file and select Get shareable link. The URI should look like this: https://drive.google.com/open?id=file_id.

    Where:

    • file_id is the alphanumeric id for your Google Drive file

Alternatively, you can retrieve the URI by opening the file. For example, to retrieve the URI for a Google Sheets file:

  1. Go to Google Sheets.

  2. Open your spreadsheet and copy the URI in the browser's address bar. The URI should look like the following: https://docs.google.com/spreadsheets/d/file_id.

Enabling Google Drive access

Accessing data hosted within Google Drive requires an additional OAuth scope, both when defining the federated source as well as during query execution. While not enabled by default, it can be included in the UI, CLI, or via the API via the following mechanisms:

Console

Follow the web-based authentication steps when creating a permanent table in the BigQuery web UI. When you are prompted, click Allow to give BigQuery Client Tools access to Google Drive.

Classic UI

Follow the web-based authentication steps when creating a permanent table in the BigQuery web UI. When you are prompted, click Allow to give BigQuery Client Tools access to Google Drive.

CLI

To enable Google Drive access:

  1. Enter the following command to ensure that you have the latest version of the command-line tool.

    gcloud components update
    
  2. Enter the following command to authenticate with Google Drive.

    gcloud auth login --enable-gdrive-access
    

API

If you are using the BigQuery API, request the OAuth scope for Google Drive in addition to the scope for BigQuery.

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

import google.auth
# from google.cloud import bigquery

# Create credentials with Drive & BigQuery API scopes
# Both APIs must be enabled for your project before running this code
credentials, project = google.auth.default(scopes=[
    'https://www.googleapis.com/auth/drive',
    'https://www.googleapis.com/auth/bigquery',
])
client = bigquery.Client(credentials=credentials, project=project)

Permanent versus temporary external tables

You can query an external data source in BigQuery by using a permanent table or a temporary table. A permanent table is a table that is created in a dataset and is linked to your external data source. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source, and you can query the table at any time.

When you query an external data source using a temporary table, you submit a command that includes a query and creates a non-permanent table linked to the external data source. When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Querying Google Drive data using permanent external tables

Required permissions and scopes

When you query external data in Google Drive using a permanent table, you need permissions to run a query job at the project level or higher, you need permissions that allow you to create a table that points to the external data, and you need permissions that allow you to access the table data. When your external data is stored in Google Drive, you also need permissions to access the Google Drive file that is linked to your external table.

BigQuery permissions

At a minimum, the following permissions are required to create and query an external table in BigQuery.

  • bigquery.tables.create
  • bigquery.tables.getData
  • bigquery.jobs.create

The following predefined Cloud IAM roles include both bigquery.tables.create and bigquery.tables.getData permissions:

  • bigquery.dataEditor
  • bigquery.dataOwner
  • bigquery.admin

The following predefined Cloud IAM roles include bigquery.jobs.create permissions:

  • bigquery.user
  • bigquery.jobUser
  • bigquery.admin

In addition, if a user has bigquery.datasets.create permissions, when that user creates a dataset, they are granted bigquery.dataOwner access to it. bigquery.dataOwner access gives the user the ability to create external tables in the dataset, but bigquery.jobs.create permissions are still required to query the data.

For more information on Cloud IAM roles and permissions in BigQuery, see Predefined roles and permissions.

Google Drive permissions

At a minimum, to query external data in Google Drive you must be granted View access to the Google Drive file linked to the external table.

Scopes for Compute Engine instances

When you create a Compute Engine instance, you can specify a list of scopes for the instance. The scopes control the instance's access to Google Cloud Platform products, including Google Drive. Applications running on the VM use the service account to call Google Cloud Platform APIs.

If you set up a Compute Engine instance to run as a service account, and that service account accesses an external table linked to a Google Drive data source, you must add the OAuth scope for Google Drive (https://www.googleapis.com/auth/drive) to the instance.

For information on applying scopes to a Compute Engine instance, see Changing the service account and access scopes for an instance. For more information on Compute Engine service accounts, see Service accounts.

Creating and querying a permanent external table

You can create a permanent table linked to your external data source by:

  • Using the GCP Console or the classic BigQuery web UI
  • Using the command-line tool's mk command
  • Creating an ExternalDataConfiguration when you use the tables.insert API method
  • Using the client libraries

To query an external data source using a permanent table, you create a table in a BigQuery dataset that is linked to your external data source. The data is not stored in the BigQuery table. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source.

There are three ways to specify schema information when you create a permanent external table in BigQuery:

  • If you are using the tables.insert API method to create a permanent external table, you create a table resource that includes a schema definition and an ExternalDataConfiguration. Set the autodetect parameter to true to enable schema auto-detection for supported data sources.
  • If you are using the CLI to create a permanent external table, you can use a table definition file, you can create and use your own schema file, or you can enter the schema inline (on the command line). When you create a table definition file, you can enable schema auto-detection fo supported data sources.
  • If you are using the console or the classic BigQuery web UI to create a permanent external table, you can enter the table schema manually or use schema auto-detection for supported data sources.

To create an external table:

Console

  1. Open the BigQuery web UI in the GCP Console.
    Go to the GCP Console

  2. In the navigation panel, in the Resources section, expand your project and select a dataset. Click Create table on the right side of the window.

    Create table

  3. On the Create table page, in the Source section:

    • For Create table from, select Drive.

    • In the Select Drive URI field, enter the Google Drive URI. Note that wildcards are not supported for Google Drive URIs.

    • For File format, select the format of your data. Valid formats for Google Drive data include:

      • Comma-separated values (CSV)
      • Newline delimited JSON
      • Avro
      • Google Sheets
  4. (Optional) If you choose Google Sheets, in the Sheet range (Optional) box, specify the sheet and cell range to query. You can specify a sheet name, or you can specify sheet_name!top_left_cell_id:bottom_right_cell_id for a cell range; for example, "Sheet1!A1:B20". If Sheet range is not specified, the first sheet in the file is used.

  5. On the Create table page, in the Destination section: * For Dataset name, choose the appropriate dataset, and in the Table name field, enter the name of the table you're creating in BigQuery.

    Select dataset

    • Verify that Table type is set to External table.
  6. In the Schema section, enter the schema definition.

    • For JSON or CSV files, you can check the Auto-detect option to enable schema auto-detect. Auto-detect is not available for Cloud Datastore exports, Cloud Firestore exports, and Avro files. Schema information for these file types is automatically retrieved from the self-describing source data.
    • Enter schema information manually by:
      • Enabling Edit as text and entering the table schema as a JSON array. Note: You can view the schema of an existing table in JSON format by entering the following CLI command: bq show --format=prettyjson dataset.table.
      • Using Add field to manually input the schema.
  7. Click Create table.

  8. If necessary, select your account and then click Allow to give the BigQuery client tools access to Google Drive.

You can then run a query against the table as if it were a native BigQuery table, subject to the limitations on external data sources.

After your query completes, you can download the results as CSV or JSON, save the results as a table, or save the results to Google Sheets. See Download, save, and export data for more information.

Classic UI

  1. Go to the BigQuery web UI.

    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Drive and in the source field, enter the Google Drive URI. Note that wildcards are not supported for Google Drive URIs.
    • For File format, select the format of your data. Valid formats for Google Drive data include:

      • Comma-separated values (CSV)
      • JSON (newline delimited)
      • Avro
      • Google Sheets
  4. On the Create Table page, in the Destination Table section:

    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the permanent table you're creating in BigQuery.
    • Verify that Table type is set to External table.
  5. In the Schema section, enter the schema information.

    • For JSON or CSV files, you can check the Auto-detect option to enable schema auto-detect. Schema auto-detection in the web UI is not currently supported for Google Sheets (though it is available using the CLI and API). Also, Auto-detect is not available for Avro data sources. Schema information is automatically retrieved from Avro files.

    • For CSV, JSON, or Google Sheets files, you can enter schema information manually by:

      • Clicking Edit as text and entering the table schema in JSON format
      • Using Add Field to manually input the schema
  6. Select applicable items in the Options section and then click Create Table.

  7. Select your account and then click Allow to give the BigQuery client tools access to Google Drive.

    Client tools access dialog

You can then run a query against the table as if it were a native BigQuery table, subject to the limitations on external data sources.

After your query completes, you can download the results as CSV or JSON, save the results as a table, or save the results to Google Sheets. See Download, save, and export data for more information.

CLI

You create a table in the BigQuery command-line tool using the bq mk command. When you use the CLI to create a table linked to an external data source, you can identify the table's schema using:

  • A table definition file (stored on your local machine)
  • An inline schema definition
  • A JSON schema file (stored on your local machine)

To create a permanent table linked to your Google Drive data source using a table definition file, enter the following command.

bq mk \
--external_table_definition=definition_file \
dataset.table

Where:

  • definition_file is the path to the table definition file on your local machine.
  • dataset is the name of the dataset that contains the table.
  • table is the name of the table you're creating.

For example, the following command creates a permanent table named mytable using a table definition file named mytable_def.

bq mk --external_table_definition=/tmp/mytable_def mydataset.mytable

To create a permanent table linked to your external data source using an inline schema definition, enter the following command.

bq mk \
--external_table_definition=schema@source_format=drive_uri \
dataset.table

Where:

  • schema is the schema definition in the format field:data_type,field:data_type.
  • source_format is CSV, NEWLINE_DELIMITED_JSON, AVRO, or GOOGLE_SHEETS.
  • drive_uri is your Google Drive URI.
  • dataset is the name of the dataset that contains the table.
  • table is the name of the table you're creating.

For example, the following command creates a permanent table named sales linked to a Google Sheets file stored in Google Drive with the following schema definition: Region:STRING,Quarter:STRING,Total_sales:INTEGER.

bq mk \
--external_table_definition=Region:STRING,Quarter:STRING,Total_sales:INTEGER@GOOGLE_SHEETS=https://drive.google.com/open?id=1234_AbCD12abCd \
mydataset.sales

To create a permanent table linked to your external data source using a JSON schema file, enter the following command.

bq mk \
--external_table_definition=schema_file@source_format=drive_uri \
dataset.table

Where:

  • schema_file is the path to the JSON schema file on your local machine.
  • source_format is CSV, NEWLINE_DELIMITED_JSON, AVRO, or GOOGLE_SHEETS.
  • drive_uri is your Google Drive URI.
  • dataset is the name of the dataset that contains the table.
  • table is the name of the table you're creating.

For example, the following command creates a table named sales linked to a CSV file stored in Google Drive using the /tmp/sales_schema.json schema file.

bq mk \
--external_table_definition=/tmp/sales_schema.json@CSV=https://drive.google.com/open?id=1234_AbCD12abCd \
mydataset.sales

After the permanent table is created, you can then run a query against the table as if it were a native BigQuery table, subject to the limitations on external data sources.

After your query completes, you can download the results as CSV or JSON, save the results as a table, or save the results to Google Sheets. See Download, save, and export data for more information.

API

Create an ExternalDataConfiguration when you use the tables.insert API method. Specify the schema property or set the autodetect property to true to enable schema auto detection for supported data sources.

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

import google.auth
# from google.cloud import bigquery
# dataset_id = 'my_dataset'

# Create credentials with Drive & BigQuery API scopes
# Both APIs must be enabled for your project before running this code
credentials, project = google.auth.default(scopes=[
    'https://www.googleapis.com/auth/drive',
    'https://www.googleapis.com/auth/bigquery',
])
client = bigquery.Client(credentials=credentials, project=project)

# Configure the external data source
dataset_ref = client.dataset(dataset_id)
table_id = 'us_states'
schema = [
    bigquery.SchemaField('name', 'STRING'),
    bigquery.SchemaField('post_abbr', 'STRING')
]
table = bigquery.Table(dataset_ref.table(table_id), schema=schema)
external_config = bigquery.ExternalConfig('GOOGLE_SHEETS')
# Use a shareable link or grant viewing access to the email address you
# used to authenticate with BigQuery (this example Sheet is public)
sheet_url = (
    'https://docs.google.com/spreadsheets'
    '/d/1i_QCL-7HcSyUZmIbP9E6lO_T5u3HnpLe7dnpHaijg_E/edit?usp=sharing')
external_config.source_uris = [sheet_url]
external_config.options.skip_leading_rows = 1  # optionally skip header row
table.external_data_configuration = external_config

# Create a permanent table linked to the Sheets file
table = client.create_table(table)  # API request

# Example query to find states starting with 'W'
sql = 'SELECT * FROM `{}.{}` WHERE name LIKE "W%"'.format(
    dataset_id, table_id)

query_job = client.query(sql)  # API request

w_states = list(query_job)  # Waits for query to finish
print('There are {} states with names starting with W.'.format(
    len(w_states)))

Querying Google Drive data using temporary tables

To query an external data source without creating a permanent table, you run a command to combine:

  • A table definition file with a query
  • An inline schema definition with a query
  • A JSON schema definition file with a query

The table definition file or supplied schema is used to create the temporary external table, and the query runs against the temporary external table. Querying an external data source using a temporary table is supported by the BigQuery CLI and API.

When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Required permissions

When you query external data in Google Drive using a temporary table, you need permissions to run a query job at the project level or higher, and you need access to the dataset that contains the table that points to the external data. When you query data in Google Drive, you also need permissions to access the Google Drive file that contains your data.

BigQuery permissions

At a minimum, the following permissions are required to query an external table in BigQuery using a temporary table.

  • bigquery.tables.getData
  • bigquery.jobs.create

The following predefined Cloud IAM roles include bigquery.tables.getData permissions:

  • bigquery.dataEditor
  • bigquery.dataOwner
  • bigquery.admin

The following predefined Cloud IAM roles include bigquery.jobs.create permissions:

  • bigquery.user
  • bigquery.jobUser
  • bigquery.admin

In addition, if a user has bigquery.datasets.create permissions, when that user creates a dataset, they are granted bigquery.dataOwner access to it. bigquery.dataOwner access gives the user the ability to create and access external tables in the dataset, but bigquery.jobs.create permissions are still required to query the data.

For more information on Cloud IAM roles and permissions in BigQuery, see Predefined roles and permissions.

Google Drive permissions

At a minimum, to query external data in Google Drive you must be granted View access to the Google Drive file linked to the external table.

Creating and querying a temporary table

You can create and query a temporary table linked to an external data source by using the CLI, the API, or the client libraries.

CLI

You query a temporary table linked to an external data source using the bq query command with the --external_table_definition flag. When you use the CLI to query a temporary table linked to an external data source, you can identify the table's schema using:

  • A table definition file (stored on your local machine)
  • An inline schema definition
  • A JSON schema file (stored on your local machine)

To query a temporary table linked to your external data source using a table definition file, enter the following command.

bq --location=location query \
--external_table_definition=table::definition_file \
'query'

Where:

  • location is your location. The --location flag is optional.
  • table is the name of the temporary table you're creating.
  • definition_file is the path to the table definition file on your local machine.
  • query is the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named sales using a table definition file named sales_def.

bq query \
--external_table_definition=sales::sales_def \
'SELECT
   Region,Total_sales
 FROM
   sales'

To query a temporary table linked to your external data source using an inline schema definition, enter the following command.

bq --location=location query \
--external_table_definition=table::schema@source_format=drive_uri \
'query'

Where:

  • location is your location. The --location flag is optional.
  • table is the name of the temporary table you're creating.
  • schema is the inline schema definition in the format field:data_type,field:data_type.
  • source_format is CSV, NEWLINE_DELIMITED_JSON, AVRO, or GOOGLE_SHEETS.
  • drive_uri is your Google Drive URI.
  • query is the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named sales linked to a CSV file stored in Google Drive with the following schema definition: Region:STRING,Quarter:STRING,Total_sales:INTEGER.

bq --location=US query \
--external_table_definition=sales::Region:STRING,Quarter:STRING,Total_sales:INTEGER@CSV=https://drive.google.com/open?id=1234_AbCD12abCd \
'SELECT
   Region,Total_sales
 FROM
   sales'

To query a temporary table linked to your external data source using a JSON schema file, enter the following command.

bq --location=location query \
--external_table_definition=schema_file@source_format=drive_uri \
'query'

Where:

  • location is your location. The --location flag is optional.
  • schema_file is the path to the JSON schema file on your local machine.
  • source_format is CSV, NEWLINE_DELIMITED_JSON, AVRO, or GOOGLE_SHEETS.
  • drive_uri is your Google Drive URI.
  • query is the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named sales linked to a CSV file stored in Google Drive using the /tmp/sales_schema.json schema file.

bq query \
--external_table_definition=sales::/tmp/sales_schema.json@CSV=https://drive.google.com/open?id=1234_AbCD12abCd \
'SELECT
   Total_sales
 FROM
   sales'

API

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

import google.auth
# from google.cloud import bigquery

# Create credentials with Drive & BigQuery API scopes
# Both APIs must be enabled for your project before running this code
credentials, project = google.auth.default(scopes=[
    'https://www.googleapis.com/auth/drive',
    'https://www.googleapis.com/auth/bigquery',
])
client = bigquery.Client(credentials=credentials, project=project)

# Configure the external data source and query job
external_config = bigquery.ExternalConfig('GOOGLE_SHEETS')
# Use a shareable link or grant viewing access to the email address you
# used to authenticate with BigQuery (this example Sheet is public)
sheet_url = (
    'https://docs.google.com/spreadsheets'
    '/d/1i_QCL-7HcSyUZmIbP9E6lO_T5u3HnpLe7dnpHaijg_E/edit?usp=sharing')
external_config.source_uris = [sheet_url]
external_config.schema = [
    bigquery.SchemaField('name', 'STRING'),
    bigquery.SchemaField('post_abbr', 'STRING')
]
external_config.options.skip_leading_rows = 1  # optionally skip header row
table_id = 'us_states'
job_config = bigquery.QueryJobConfig()
job_config.table_definitions = {table_id: external_config}

# Example query to find states starting with 'W'
sql = 'SELECT * FROM `{}` WHERE name LIKE "W%"'.format(table_id)

query_job = client.query(sql, job_config=job_config)  # API request

w_states = list(query_job)  # Waits for query to finish
print('There are {} states with names starting with W.'.format(
    len(w_states)))

The _FILE_NAME pseudo column

Tables based on external data sources provide a pseudo column named _FILE_NAME. This column contains the fully qualified path to the file to which the row belongs. This column is available only for tables that reference external data stored in Cloud Storage and Google Drive.

The _FILE_NAME column name is reserved, which means that you cannot create a column by that name in any of your tables. To select the value of _FILE_NAME, you must use an alias. The following example query demonstrates selecting _FILE_NAME by assigning the alias fn to the pseudo column.

bq query \
--project_id=project_id \
--use_legacy_sql=false \
'SELECT
   name,
   _FILE_NAME AS fn
 FROM
   `dataset.table_name`
 WHERE
   name contains "Alex"' 

Where:

  • project_id is a valid project ID (this flag is not required if you use Cloud Shell or if you set a default project in the Cloud SDK)
  • dataset is the name of the dataset that stores the permanent external table
  • table_name is the name of the permanent external table
Segítségére volt ez az oldal? Tudassa velünk a véleményét:

Visszajelzés küldése a következővel kapcsolatban:

Segítségre van szüksége? Keresse fel súgóoldalunkat.