Querying Google Drive Data

This page describes how to use BigQuery to query data stored in Google Drive.

BigQuery supports the following data-source formats for Google Drive:

To directly query a Google Drive external data source, provide the Google Drive URI path to your data and create an external table that references that data source. The table used to reference the Cloud Storage data source can be a permanent table or a temporary table.

For CSV and JSON files, you can also include the table schema. If you omit the schema, BigQuery attempts to automatically detect the schema. For Avro files, BigQuery reads the schema from the data source.

Before you begin

  • If you plan to use the bq command-line tool instead of the web UI, ensure that you have activated BigQuery and installed the Google Cloud SDK. For more information, see the Before you begin section of the command-line quickstart.
  • If you are using the command-line tool, set a default project so that you can omit the project_id flag. For information on how to set a default project, see the Cloud SDK documentation for the gcloud config set section of the command-line tool quickstart.

Querying Google Drive data using permanent tables

To query an external data source using a permanent table, you create a table in a BigQuery dataset that is linked to your external data source using a table definition. The data is not stored in the BigQuery table. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source.

If you are using the CLI or API, you begin by creating a table definition file that is used to define the schema for the external data source. You can also create and use your own schema file or enter the schema inline. If you are using the web UI, you enter the table schema manually.

Permanent versus temporary tables

You can query an external data source in BigQuery using a permanent table or a temporary table. When you use a permanent table, you create a table in a BigQuery dataset that is linked to your external data source. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source, and you can query the table at any time.

When you query an external data source using a temporary table, you submit a command that includes a query and creates a non-permanent table linked to the external data source. When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Querying Google Drive data using permanent tables

Before you begin, ensure that you have enabled:

To specify a Google Drive file, use one of these URI formats:

Source Format URL Format
CSV | JSON | AVRO | Google Sheets
https://drive.google.com/open?id=[FILE_ID]

To access a Google Drive file's URI in this format:

  1. Go to Google Drive.
  2. Right-click on the file and select Get shareable link.
Google Sheets
https://docs.google.com/spreadsheets/d/[FILE_ID]/edit#gid=[IDENTIFIER]

To access a Google Drive file's URI in this format:

  1. Go to Google Sheets.
  2. Open the file.
  3. Copy the URL from your browser's URL address bar.

Web UI

  1. Go to the BigQuery web UI.

    Go to the BigQuery web UI

  2. In the navigation panel, hover on an empty dataset, click the down arrow icon down arrow icon image, and click Create new table. You can also hover on an existing table in a dataset.

  3. Under Source data select Google Drive for the Location and then enter the URI to your Google Drive file.

  4. Select the format of your data. For Google Sheets, select the CSV option. Valid formats include the following:

    • Comma-separated values (CSV)
    • JSON (newline delimited)
    • Avro
    • Google Sheets (first tab only)
  5. Under Destination Table input a table ID, then select External table for the Table type field.

  6. For CSV and JSON files, input the table schema. For Avro files, you can omit the table schema and let BigQuery detect the schema. Click the Submit button to complete the process.

You can then run a query against the table as if it were a native BigQuery table, subject to the limitations of external data sources.

After your query completes, you can download the results as CSV or JSON, save the results as a table, or save the results to Google Sheets. See Download, save, and export data for more information.

CLI

To follow along with the example below:

  1. Create a CSV file named fed-sample.csv that contains the following sample data:

    1,Alice,10,one
    2,Alex,11,one
    3,Bob,11,two
    
  2. Upload the file to Google Drive.

To create an external table:

  1. Create the table by running the following command with your own values:

    bq mk --external_table_definition=[SCHEMA]@[SOURCE_FORMAT]=[GOOGLE_DRIVE_URL] [DATASET_ID].[TABLE_NAME]
    

    Where [SOURCE_FORMAT] is one of: CSV, NEWLINE_DELIMITED_JSON, AVRO, or GOOGLE_SHEETS.

    For example, the following command uses the fed-sample.csv file that is part of a dataset named mydataset:

    bq mk --external_table_definition=id:integer,name:string,num:integer,ord:string@CSV=[GOOGLE_DRIVE_URL] mydataset.fed_sample
    
  2. Run a query against the table by running the following command:

    bq query 'select name from [DATASET_ID].fed_sample where name contains "Alex";'
    

    Where [DATASET_ID] is a valid dataset ID.

API

  • The sourceUris must be fully-qualified, in the format https://drive.google.com/open?id=[FILE_ID] or https://docs.google.com/spreadsheets/d/[FILE_ID]/edit#gid=[IDENTIFIER].

  • Specify the data format by setting the externalDataConfiguration.sourceFormat property. Valid values include:

    • CSV
    • NEWLINE_DELIMITED_JSON
    • AVRO
    • GOOGLE_SHEETS

Querying external data using temporary tables

To query an external data source without creating a permanent table, you run a command that combines a table definition with a query. The table definition is used to create the temporary table, and the query runs against the temporary table. Querying an external data source using a temporary table is supported by the BigQuery CLI and API.

When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Querying data from Google Drive

Before you begin, ensure that you have enabled Google Drive access.

To specify a Google Drive file, use one of these URI formats:

Source Format URL Format
CSV | JSON | AVRO | Google Sheets
https://drive.google.com/open?id=[FILE_ID]

To access a Google Drive file's URI in this format:

  1. Go to Google Drive.
  2. Right-click on the file and select Get shareable link.
Google Sheets
https://docs.google.com/spreadsheets/d/[FILE_ID]/edit#gid=[IDENTIFIER]

To access a Google Drive file's URI in this format:

  1. Go to Google Sheets.
  2. Open the file.
  3. Copy the URL from your browser's URL address bar.

CLI

  1. Create a table definition by running the following command, which creates a table definition and pipes the output to the file /tmp/wikipedia:

    CSV
    bq mkdef --source_format=CSV "[GOOGLE_DRIVE_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia

    The table is CSV format and has four fields: id, name, category and depth.

    JSON
    bq mkdef --source_format=NEWLINE_DELIMITED_JSON "[GOOGLE_DRIVE_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia

    The table is JSON (newline delimited) format and has four fields: id, name, category and depth.

    Avro
    bq mkdef --source_format=AVRO "[GOOGLE_DRIVE_URI]" > /tmp/wikipedia

    If you use an Avro file as your external data source, BigQuery can detect the schema if you don't specify one.

    Google Sheets
    bq mkdef --source_format=GOOGLE_SHEETS "[GOOGLE_DRIVE_URI]" > /tmp/wikipedia

    The source is the first tab of a Google Sheets file.

  2. Run a query against the table by running the following command:

    bq query --project_id=<project ID> --external_table_definition=wikipedia::/tmp/wikipedia 'select name from wikipedia where name contains "Alex";'

    The above command runs a query that references the table as "wikipedia" and provides the table definition with the file created in the previous step.

API

Creating table definition files

A table definition file contains metadata about a table, such as the table's source format and related properties. Table definition files can be used to create permanent external tables or temporary external tables. The command-line tool and the API support the creation and use of table definition files.

To create and use a table definition:

CLI

  1. Use the command-line tool's mkdef command to create a table definition or create the table definition manually. The mkdef command generates a table definition file in JSON format.

  2. If necessary, manually edit the table definition file to modify, add, or delete custom parameters.

  3. Create an external table by using bq mk or query a temporary external table by using bq query using the table definition file as the value for the --external_table_definition flag.

API

If you are using the BigQuery API, define a table definition as part of your job configuration. For more information, see configuration.query.tableDefinitions in the API Reference.

Many table definition configuration settings apply to more than one source format. For detailed information about settings and source formats, see the API Reference descriptions for each of the following settings:

General table definition settings
autodetect

Type: boolean

Default value: true

Description: See CSV and JSON schema auto-detection.

API Reference: configuration.query.tableDefinitions.(key).autodetect

compression

Type: string

Default value: GZIP

Valid values: GZIP and NONE.

API Reference: configuration.query.tableDefinitions.(key).compression

ignoreUnknownValues

Type: boolean

Default value: false

API Reference: configuration.query.tableDefinitions.(key).ignoreUnknownValues

maxBadRecords

Type: integer

Default value: 0

API Reference: configuration.query.tableDefinitions.(key).maxBadRecords

schema

Type: nested object

API Reference: configuration.query.tableDefinitions.(key).schema

sourceFormat

Type: string

API Reference: configuration.query.tableDefinitions.(key).sourceFormat

sourceUris

Type: list

API Reference: configuration.query.tableDefinitions.(key).sourceUris

For the complete list of table definition configuration settings, see configuration.query.tableDefinitions in the API Reference.

Table definitions based on CSV source files

The following command creates a table definition based on a CSV file named fed- sample.csv in a Cloud Storage bucket named [BUCKET], where the schema is stored in /tmp/fed-sample-schema.json and the generated definition file is stored in /tmp/fed-sample-def:

bq mkdef --source_format=CSV 'gs://[BUCKET]/fed-sample.csv' /tmp/fed-sample-schema.json > /tmp/fed-sample-def

The contents of fed-sample-def (schema values omitted for brevity):

{
  "csvOptions": {
    "allowJaggedRows": false,
    "allowQuotedNewlines": false,
    "encoding": "UTF-8",
    "fieldDelimiter": ",",
    "quote": "\"",
    "skipLeadingRows": 0
  },
  "schema": {
    "fields": [
      ...
    ]
  },
  "sourceFormat": "CSV",
  "sourceUris": [
    "gs://[BUCKET]/fed-sample.csv"
  ]
}

You can modify a table definition file so that subsequent calls to bq mk or bq query use the updated values. For example, if you decide that you want to allow jagged rows, you can modify that property in fed-sample-def:

{
  "csvOptions": {
    "allowJaggedRows": true,
  ...

The following options are available only for table definitions based on CSV files:

CSV table definition options
allowJaggedRows

Type: boolean

Default value: false

API Reference: configuration.query.tableDefinitions.(key).csvOptions.allowJaggedRows

allowQuotedNewlines

Type: boolean

Default value: false

Description: CSV files with newline characters are difficult to process in parallel. Setting this to true can reduce the performance gains associated with parallel processing.

API Reference: configuration.query.tableDefinitions.(key).csvOptions.allowQuotedNewlines

encoding

Type: string

Default value: UTF-8

Valid values: UTF-8 and ISO_8859_1.

API Reference: configuration.query.tableDefinitions.(key).csvOptions.encoding

fieldDelimiter

Type: string

Default value: ","

API Reference: configuration.query.tableDefinitions.(key).csvOptions.fieldDelimiter

quote

Type: string

Default value: " (quotation mark)

API Reference: configuration.query.tableDefinitions.(key).csvOptions.quote

skipLeadingRows

Type: long

Default value: 0

API Reference: configuration.query.tableDefinitions.(key).csvOptions.skipLeadingRows

Table definitions based on JSON source files

The following command creates a table definition based on a JSON file named fed-sample.json in a Cloud Storage bucket named [BUCKET], where the schema is stored in /tmp/fed-sample-schema.json and the generated definition file is stored in /tmp/fed-sample-def:

bq mkdef --source_format=NEWLINE_DELIMITED_JSON 'gs://[BUCKET]/fed-sample.json' /tmp/fed-sample-schema.json > /tmp/fed-sample-json-def

The contents of fed-sample-json-def (schema values omitted for brevity):

{
  "autodetect": true,
  "schema": {
    "fields": [
      ...
    ]
  },
  "sourceFormat": "NEWLINE_DELIMITED_JSON",
  "sourceUris": [
    "gs://[BUCKET]/fed-sample.json"
  ]
}

There are no configuration settings specific to JSON source files, but many of the general settings apply to JSON files. For more information, see General table definition settings.

Table definitions based on Google Sheets source files

The following command creates a table definition based on a Google Sheet stored on Google Drive specified by [GOOGLE_DRIVE_URI], where the schema is auto- detected and the generated definition file is stored in /tmp/google-sheet-def:

bq mkdef --source_format=GOOGLE_SHEETS "[GOOGLE_DRIVE_URI]" > /tmp/google-sheet-def

The contents of google-sheet-def:

{
  "autodetect": true,
  "sourceFormat": "GOOGLE_SHEETS",
  "sourceUris": [
    "[GOOGLE_DRIVE_URI]"
  ]
}

You can modify the table definition to skip leading rows by adding a new googleSheetsOptions section. The following example modifies the table definition so that it skips the first 3 rows:

{
  "autodetect": true,
  "sourceFormat": "GOOGLE_SHEETS",
  "sourceUris": [
    "[GOOGLE_DRIVE_URI]"
  ],
  "googleSheetsOptions": {
    "skipLeadingRows": 3
  }
}
Google Sheets table definition settings
autodetect

Type: boolean

Default value: true

API Reference: configuration.query.tableDefinitions.(key).autodetect

skipLeadingRows

Type: long

Default value: 0

API Reference: configuration.query.tableDefinitions.(key).googleSheetsOptions.skipLeadingRows

For a list of general configuration settings, , see General table definition settings.

Enabling Google Drive access

Web UI


Follow the web-based authentication steps when creating the table in the web UI.

CLI


To enable Google Drive access

  1. Use the Google Cloud Platform Console to enable the Google Drive API for the project you plan to use.

  2. Ensure that you have the latest version of the command-line tool:

    gcloud components update
    
  3. Authenticate with Google Drive

    gcloud auth login --enable-gdrive-access
    

API


If you are using the BigQuery API:

  1. Use the Google Cloud Platform Console to enable the Google Drive API for the project making the API call.

  2. Request the OAuth scope for Google Drive in addition to the scope for BigQuery.

CSV and JSON schema auto-detection

If you specify a CSV or JSON file without including a schema description, BigQuery makes a best-effort attempt to automatically infer the schema. Automatic detection is currently available in the command-line tool and the BigQuery API.

BigQuery starts the inference process by selecting a random file in the data set and scanning up to 100 rows of the file to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in that sample.

Schema auto-detection is enabled by default when you create a table definition file. To disable auto-detection, you can pass the --noautodetect flag when creating the table definition using the mkdef command. You can also modify an existing table definition file by changing the value of the autodetect property to false. For example, the following excerpt from a table definition JSON file disables auto-detection:

{
  "autodetect": false,
  ...
}

Other automatic detection details include:

Compression
BigQuery recognizes gzip-compatible file compression when opening a file.
CSV Delimiter
BigQuery detects the following delimiters:
  • comma (,)
  • pipe (|)
  • tab (\t)
CSV Header
BigQuery infers headers by comparing the first row of the file with other rows in the data set. If the first line contains only strings, and the other lines do not, BigQuery assumes that the first row is a header row.
CSV Quoted new lines
BigQuery detects quoted new line characters within a CSV field and does not interpret the quoted new line character as a row boundary.
Timestamps
BigQuery detects a wide array of timestamp formats, including, but not limited to:
  • `yyyy-mm-dd` in any order
  • `yyyy-mm-dd hh:mm:ss`
  • `yyyy-mm-dd hh:mm:ss.sss`
Other timestamp details:
  • Date separators can be "-", "/", or "."
  • A time zone can be appended using an offset or name

The _FILE_NAME pseudo column

Tables based on external data sources provide a pseudo column named `_FILE_NAME`. This column contains the fully qualified path to the file to which the row belongs. This column is available only in external tables that reference Google Cloud Storage and Google Drive.

The _FILE_NAME column name is reserved, which means that you cannot create a column by that name in any of your tables. To select the value of _FILE_NAME, you must use an alias. The following example demonstrates selecting _FILE_NAME by assigning the alias fn to the pseudo column.
  1. Create a table definition for the table wikipedia using the URI for the files from a Cloud Storage bucket. For example:

    bq mkdef --source_format=CSV "gs://[BUCKET_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia

    The [BUCKET_URI] comprises your bucket name and your filename. For example, if the Cloud Storage bucket is named mybucket and the data file is named fed-sample.csv, the bucket URI would be gs://mybucket/fed-sample.csv.

  2. Query the _FILE_NAME pseudo column:

    bq query --project_id=<project ID> --external_table_definition=wikipedia::/tmp/wikipedia 'select name, _FILE_NAME as fn from wikipedia where name contains "Alex";' 

Send feedback about...

BigQuery Documentation