Querying Google Cloud Storage Data

This page describes how to use BigQuery to query data stored in Google Cloud Storage.

BigQuery supports the following formats for data stored in Google Cloud Storage:

To directly query a Cloud Storage external data source, provide the Cloud Storage URI path to your data and create a table that references the data source. The table used to reference the Cloud Storage data source can be a permanent table or a temporary table.

Before you begin

  • If you plan to use the bq command-line tool instead of the web UI, ensure that you have activated BigQuery and installed the Google Cloud SDK. For more information, see the Before you begin section of the command-line quickstart.
  • If you are using the command-line tool, set a default project so that you can omit the project_id flag. For information on how to set a default project, see the Cloud SDK documentation for the gcloud config set section of the command-line tool quickstart.

Permanent versus temporary tables

You can query an external data source in BigQuery using a permanent table or a temporary table. When you use a permanent table, you create a table in a BigQuery dataset that is linked to your external data source. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source, and you can query the table at any time.

When you query an external data source using a temporary table, you submit a command that includes a query and creates a non-permanent table linked to the external data source. When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Querying Cloud Storage data using permanent tables

To query an external data source using a permanent table, you create a table in a BigQuery dataset that is linked to your external data source using a table definition. The data is not stored in the BigQuery table. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source.

If you are using the CLI or API, you begin by creating a table definition file that is used to define the schema for the external data source. You can also create and use your own schema file or enter the schema inline. If you are using the web UI, you enter the table schema manually.

To query Cloud Storage data using a permanent table, you:

  • Create a table definition file (for the CLI or API)
  • Create a table in BigQuery linked to the external data source
  • Query the data using the permanent table

Creating a table definition file

A table definition file uses schema auto-detection to define the schema for a Google Cloud Storage external data source. The table definition also contains metadata about a table, such as the table's source format and related properties. You can use a table definition file with both permanent and temporary tables.

Alternatively, you can manually describe an external table's schema definition using the web UI, the CLI, or the API. In addition, you can generate your own JSON schema file for use with the CLI and API. If you define the schema manually or use a schema file, you do not need to create a table definition file.

To create a table definition file:

CLI

  1. Use the command-line tool's mkdef command to create a table definition. The mkdef command generates a table definition file in JSON format.

  1. The following examples create a table definition and pipe the output to a file - /tmp/wikipedia:

    CSV
    bq mkdef --source_format=CSV "gs://[BUCKET_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia

    The table is CSV format and has four fields: id, name, category and depth.

    If you use a CSV file as your external data source, BigQuery can auto-detect the schema if you don't specify one.

    JSON
    bq mkdef --source_format=NEWLINE_DELIMITED_JSON "gs://[BUCKET_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia

    The table is JSON (newline delimited) format and has four fields: id, name, category and depth.

    If you use a JSON file as your external data source, BigQuery can auto-detect the schema if you don't specify one.

    Avro
    bq mkdef --source_format=AVRO "gs://[BUCKET_URI]" > /tmp/wikipedia

    If you use an Avro file as your external data source, BigQuery auto-detects the schema if you don't specify one.

    Cloud Datastore
    bq mkdef --source_format=DATASTORE_BACKUP "gs://[BUCKET_URI]" > /tmp/wikipedia

    If you use a Cloud Datastore backup as your external data source, BigQuery auto-detects the schema if you don't specify one.

    The [BUCKET_URI] comprises your bucket name and your object (filename). For example, if the Cloud Storage bucket is named mybucket and the data file is named fed-sample.csv, the bucket URI would be gs://mybucket/fed-sample.csv.

    If your table is separated into multiple files that share a common base-name, you can refer to the files using a wildcard. You append an asterisk (*) to the base-name and enclose the bucket and filename in quotes. For example, if you have two files named fed-sample000001.csv and fed-sample000002.csv, the bucket URI would be "gs://mybucket/fed-sample*". You can only use one wildcard for objects (filenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported.

  1. If necessary, manually edit the table definition file to modify, add, or delete custom parameters and schema information.

API

If you are using the BigQuery API, define a table definition as part of your job configuration. For more information, see configuration.query.tableDefinitions in the API Reference.

Many table definition configuration settings apply to more than one source format. For detailed information about settings and source formats, see the API Reference descriptions for each of the following settings:

General table definition settings
autodetect

Type: boolean

Default value: true

Description: See CSV and JSON schema auto-detection.

API Reference: configuration.query.tableDefinitions.(key).autodetect

compression

Type: string

Default value: GZIP

Valid values: GZIP and NONE.

API Reference: configuration.query.tableDefinitions.(key).compression

ignoreUnknownValues

Type: boolean

Default value: false

API Reference: configuration.query.tableDefinitions.(key).ignoreUnknownValues

maxBadRecords

Type: integer

Default value: 0

API Reference: configuration.query.tableDefinitions.(key).maxBadRecords

schema

Type: nested object

API Reference: configuration.query.tableDefinitions.(key).schema

sourceFormat

Type: string

API Reference: configuration.query.tableDefinitions.(key).sourceFormat

sourceUris

Type: list

API Reference: configuration.query.tableDefinitions.(key).sourceUris

For the complete list of table definition configuration settings, see configuration.query.tableDefinitions in the API Reference.

Table definitions based on CSV source files

For CSV and JSON files, you can generate a table definition or manually include the table schema. If you generate a table definition, BigQuery attempts to automatically detect the schema using your source data. For Avro and Cloud Datastore backup files, BigQuery reads the schema directly from the data source.

The following command creates a table definition based on a CSV file named fed- sample.csv in a Cloud Storage bucket named [BUCKET], where the schema is stored in /tmp/fed-sample-schema.json and the generated definition file is stored in /tmp/fed-sample-def:

bq mkdef --source_format=CSV 'gs://[BUCKET]/fed-sample.csv' /tmp/fed-sample-schema.json > /tmp/fed-sample-def

The contents of fed-sample-def (schema values omitted for brevity):

{
  "csvOptions": {
    "allowJaggedRows": false,
    "allowQuotedNewlines": false,
    "encoding": "UTF-8",
    "fieldDelimiter": ",",
    "quote": "\"",
    "skipLeadingRows": 0
  },
  "schema": {
    "fields": [
      ...
    ]
  },
  "sourceFormat": "CSV",
  "sourceUris": [
    "gs://[BUCKET]/fed-sample.csv"
  ]
}

You can modify a table definition file so that subsequent calls to bq mk or bq query use the updated values. For example, if you decide that you want to allow jagged rows, you can modify that property in fed-sample-def:

{
  "csvOptions": {
    "allowJaggedRows": true,
  ...

The following options are available only for table definitions based on CSV files:

CSV table definition options
allowJaggedRows

Type: boolean

Default value: false

API Reference: configuration.query.tableDefinitions.(key).csvOptions.allowJaggedRows

allowQuotedNewlines

Type: boolean

Default value: false

Description: CSV files with newline characters are difficult to process in parallel. Setting this to true can reduce the performance gains associated with parallel processing.

API Reference: configuration.query.tableDefinitions.(key).csvOptions.allowQuotedNewlines

encoding

Type: string

Default value: UTF-8

Valid values: UTF-8 and ISO_8859_1.

API Reference: configuration.query.tableDefinitions.(key).csvOptions.encoding

fieldDelimiter

Type: string

Default value: ","

API Reference: configuration.query.tableDefinitions.(key).csvOptions.fieldDelimiter

quote

Type: string

Default value: " (quotation mark)

API Reference: configuration.query.tableDefinitions.(key).csvOptions.quote

skipLeadingRows

Type: long

Default value: 0

API Reference: configuration.query.tableDefinitions.(key).csvOptions.skipLeadingRows

Table definitions based on JSON source files

For CSV and JSON files, you can generate a table definition or manually include the table schema. If you generate a table definition, BigQuery attempts to automatically detect the schema using your source data. For Avro and Cloud Datastore backup files, BigQuery reads the schema directly from the data source.

The following command creates a table definition based on a JSON file named fed-sample.json in a Cloud Storage bucket named [BUCKET], where the schema is stored in /tmp/fed-sample-schema.json and the generated definition file is stored in /tmp/fed-sample-def:

bq mkdef --source_format=NEWLINE_DELIMITED_JSON 'gs://[BUCKET]/fed-sample.json' /tmp/fed-sample-schema.json > /tmp/fed-sample-json-def

The contents of fed-sample-json-def (schema values omitted for brevity):

{
  "autodetect": true,
  "schema": {
    "fields": [
      ...
    ]
  },
  "sourceFormat": "NEWLINE_DELIMITED_JSON",
  "sourceUris": [
    "gs://[BUCKET]/fed-sample.json"
  ]
}

There are no configuration settings specific to JSON source files, but many of the general settings apply to JSON files. For more information, see General table definition settings.

Avro and Cloud Datastore backup schema detection

If you use an Avro file or a Cloud Datastore backup as an external data source, BigQuery detects the schema using the source data. In the web UI, choosing Avro or Cloud Datastore backup hides the Automatically detect option because BigQuery automatically generates schema information for these files.

When BigQuery detects schemas, it might, on rare occasions, change a field name to make it compatible with BigQuery SQL syntax.

To see the detected schema using the CLI, create a permanent table and use the bq show command. Alternatively, you can use the BigQuery web UI to view the schema.

For information about data type conversion, see:

CSV and JSON schema auto-detection

In the web UI, if you check Automatically detect, BigQuery makes a best-effort attempt to automatically infer the schema for CSV and JSON files. In the CLI or API, if you specify a CSV or JSON file without including a schema description, a schema file, or a table definition file, BigQuery makes a best-effort attempt to automatically infer the schema.

BigQuery starts the inference process by selecting a random file in the data set and scanning up to 100 rows of the file to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in that sample.

In the CLI, schema auto-detection is enabled by default when you create a table definition file. To disable auto-detection, you can pass the --noautodetect flag when creating the table definition using the mkdefcommand. You can also modify an existing table definition file by changing the value of the autodetect property to false. For example, the following excerpt from a table definition JSON file disables auto-detection:

{
  "autodetect": false,
  ...
}

Other automatic detection details include:

Compression
BigQuery recognizes gzip-compatible file compression when opening a file.
CSV Delimiter
BigQuery detects the following delimiters:
  • comma (,)
  • pipe (|)
  • tab (\t)
CSV Header
BigQuery infers headers by comparing the first row of the file with other rows in the data set. If the first line contains only strings, and the other lines do not, BigQuery assumes that the first row is a header row.
CSV Quoted new lines
BigQuery detects quoted new line characters within a CSV field and does not interpret the quoted new line character as a row boundary.
Timestamps
BigQuery detects a wide array of timestamp formats, including, but not limited to:
  • `yyyy-mm-dd` in any order
  • `yyyy-mm-dd hh:mm:ss`
  • `yyyy-mm-dd hh:mm:ss.sss`
Other timestamp details:
  • Date separators can be "-", "/", or "."
  • A time zone can be appended using an offset or name

Creating and querying a permanent table

You can create a permanent table linked to your external data source using the web UI, CLI, or API. If you use the CLI or API, you can generate a table definition file, you can manually provide the table's schema definition, or you can provide a schema file. If you do not provide schema information, schema auto-detection is used.

If you use the web UI, you input the table schema manually, or for supported data sources, you can enable schema auto-detection.

To create and query a permanent table:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on an empty dataset, click the down arrow icon down arrow icon image, and click Create new table. You can also hover on an existing table in a dataset.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI in the following format: gs://[BUCKET]/[FILE] Where [BUCKET] is the Cloud Storage bucket name and [FILE] is the name of the file containing the table data. For example, enter gs://mybucket/yob2010.csv for a file named yob2010.csv and a bucket named mybucket.
    • For File format, select the format of your data. Valid formats for Cloud Storage locations include:
      • Comma-separated values (CSV)
      • JSON (newline delimited)
      • Avro
      • Cloud Datastore backup
  4. On the Create Table page, in the Destination Table section:

    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the permanent table you're creating in BigQuery.
    • Verify that Table type is set to External table.
  5. In the Schema section, for CSV and JSON files, enter the table schema or check Automatically detect to allow BigQuery to make a best-effort attempt to automatically infer the schema. For Cloud Datastore backups and Avro files, the table schema is automatically detected.

  6. Select applicable items in the Options section and then click Create Table.

Once the permanent table is created, you can run a query against the table as if it were a native BigQuery table, subject to the limits of external data sources. To query the table in the web UI, click Compose Query, enter your query in the New Query box, and then click Run Query.

After your query completes, you can export the results as CSV or JSON files, save the results as a table, or save the results to Google Sheets. See Download, save, and export data for more information.

CLI

  1. Use the bq mk command to create a permanent table. In the following examples, replace [DATASET] with the name of the appropriate dataset. This example assumes you are manually entering schema information on the command line.

     bq mk --project_id=[PROJECT_ID] --external_table_definition=[SCHEMA]@[SOURCE_FORMAT]=[CLOUD_STORAGE_URI] [DATASET].[TABLE_NAME]
    

    Where [SOURCE_FORMAT] is one of: CSV, NEWLINE_DELIMITED_JSON, AVRO, or DATASTORE_BACKUP.

    For example, the following command uses the fed-sample.csv file that is part of a dataset named mydataset:

    bq mk --project_id=[PROJECT_ID] --external_table_definition=id:integer,name:string,num:integer,ord:string@CSV=gs://[MY_BUCKET]/fed-sample.csv mydataset.fed_sample
    

    Where [MY_BUCKET] is a valid Cloud Storage bucket name.

    If your schema is more complicated, you can store it in a separate file in JSON format. For example, the following JSON object represents the inline schema for the fed-sample.csv data.

    [{
      "name": "id",
      "type": "integer",
      "mode": "nullable"
     },
     {
      "name": "name",
      "type": "string",
      "mode": "nullable"
     },
     {
      "name": "num",
      "type": "integer",
      "mode": "nullable"
      },
     {
      "name": "ord",
      "type": "integer",
      "mode": "nullable"
     }
    ]
    

    If the JSON schema is stored in a file named fed-sample-schema.json in your /tmp directory, you can refer to that file instead of supplying the schema inline. You can also use a generated external table definition file in the command.

    bq mk --project_id=[PROJECT_ID] --external_table_definition=/tmp/fed-sample-schema.json@CSV=gs://[MY_BUCKET]/fed-sample.csv mydataset.fed_sample
    
  2. Run a query against the table by running the following command:

    bq query --project_id=[PROJECT_ID] 'select name from [DATASET].fed_sample where name contains "Alex";'
    

    Where [PROJECT_ID] is a valid project ID.

API

Querying Cloud Storage data using temporary tables

To query an external data source without creating a permanent table, you run a command that combines a table definition with a query. The table definition is used to create the temporary table, and the query runs against the temporary table. Querying an external data source using a temporary table is supported by the BigQuery CLI and API.

When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

To query Cloud Storage data using a temporary table, you:

  • Generate a table definition file or provide your own schema file
  • Run a query using the table definition file or schema file

Creating and querying a temporary table

Creating and querying a temporary table is currently supported by the CLI and API. When you use temporary tables, you submit both a query and a table definition file.

To create and query a temporary table:

CLI

  1. Run a query against a temporary table using the following command:

    bq query --project_id=<project ID> --external_table_definition=[TABLE_NAME]::[FILE_PATH] 'select count(rowkey) from [TABLE_NAME];'

    Where [FILE_PATH] is the path to the table definition file, and [TABLE_NAME] is the name of the temporary table you're creating.

API

The _FILE_NAME pseudo column

Tables based on external data sources provide a pseudo column named `_FILE_NAME`. This column contains the fully qualified path to the file to which the row belongs. This column is available only in external tables that reference Google Cloud Storage and Google Drive.

The _FILE_NAME column name is reserved, which means that you cannot create a column by that name in any of your tables. To select the value of _FILE_NAME, you must use an alias. The following example demonstrates selecting _FILE_NAME by assigning the alias fn to the pseudo column.
  1. Create a table definition for the table wikipedia using the URI for the files from a Cloud Storage bucket. For example:

    bq mkdef --source_format=CSV "gs://[BUCKET_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia

    The [BUCKET_URI] comprises your bucket name and your filename. For example, if the Cloud Storage bucket is named mybucket and the data file is named fed-sample.csv, the bucket URI would be gs://mybucket/fed-sample.csv.

  2. Query the _FILE_NAME pseudo column:

    bq query --project_id=<project ID> --external_table_definition=wikipedia::/tmp/wikipedia 'select name, _FILE_NAME as fn from wikipedia where name contains "Alex";' 

Send feedback about...

BigQuery Documentation