Creating and Querying Federated Data Sources

This page describes how to query data stored outside of BigQuery.

A federated data source is a data source that you can query directly even though the data is not stored in BigQuery. Instead of loading or streaming the data, you create a table that references the external data source. You can create a permanent federated table or a temporary table.

BigQuery offers support for querying data directly from Google Cloud Storage and Google Drive. Use cases include:

  • Loading and cleaning your data in one pass by querying the data from a federated data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
  • Having a small amount of frequently changing data that you join with other tables. As a federated data source, the frequently changing data does not need to be reloaded every time it is updated.

Limitations include the following:

  • BigQuery does not guarantee data consistency for federated data sources. Changes to the underlying data while a query is running can result in unexpected behavior.
  • If query speed is a priority, load the data into BigQuery instead of setting up a federated data source.

BigQuery supports the following data-source formats:

Google Cloud Storage Google Drive
Comma-separated values (CSV) Comma-separated values (CSV)
JSON ( newline-delimited ) JSON ( newline-delimited )
Avro files Avro files
Google Cloud Datastore backup files Google Sheets (first tab only)

To directly query a federated data source, provide the Cloud Storage or Google Drive URI path to your data and create an external table that references that data source. For CSV and JSON files, you can also include the table schema. If you omit the schema, BigQuery attempts to automatically detect the schema. For Avro and Cloud Datastore backup files, BigQuery reads the schema from the data source.

Before you begin

  • Identify a data file that you plan to use as your federated data source.
  • If you plan to use the bq command-line tool instead of the web UI, ensure that you have activated BigQuery and installed the gcloud SDK. For more information, see the Before you begin section of the command-line tool quickstart.
  • The command-line tool examples on this page assume that you have set a default project so that you can omit the project_id flag. For information on how to set a default project, see gcloud SDK documentation for gcloud config set.

Creating federated tables

A federated table is a permanent table that references an external data source, which means that the data is not stored in BigQuery. To create a federated table, use the external table type. Because the table is permanent, you can share the table with others who also have access to the underlying external data source.

Creating a federated table using Cloud Storage

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation, hover on a dataset ID and click the down arrow icon down arrow icon image next to the ID and click Create new table.

  3. Under Source data select Google Cloud Storage for the Location and then enter the URI to your Cloud Storage file. For example, enter gs://mybucket/yob2010.csv for a file named yob2010.csv in the mybucket Cloud Storage bucket.

  4. Choose the format of your data. Valid formats include the following:

    • Comma-separated values (CSV)
    • JSON (newline delimited)
    • Avro
    • Cloud Datastore backup
  5. Under Destination Table input a table ID, then select External table for the Table type field.

  6. For CSV and JSON files, input the table schema. For Cloud Datastore backups and Avro files, you can omit the table schema and let BigQuery detect the schema. Click the Submit button to complete the process.

You can then run a query against the table as if it were a native BigQuery table, subject to the limitations of federated data sources.

After your query completes, you can download the results as CSV or JSON, save the results as a table, or save the results to Google Sheets. See Download, save, and export data for more information.

CLI

To follow along with the example below:

  1. Create a CSV file named fed-sample.csv that contains the following sample data:

    1,Alice,10,one
    2,Alex,11,one
    3,Bob,11,two
    
  2. Upload the file to a Cloud Storage bucket using the gsutil tool (replace [MY_BUCKET] with the name of a Cloud Storage bucket):

    gsutil cp fed-sample.csv gs://[MY_BUCKET]/fed-sample.csv
    

To create an external table:

  1. Create the table by running the following command with your own values for the dataset ID ([DATASET_ID]):

     bq mk --external_table_definition=[SCHEMA]@[SOURCE_FORMAT]=[CLOUD_STORAGE_URI] [DATASET_ID].[TABLE_NAME]
    

    Where [SOURCE_FORMAT] is one of: CSV, NEWLINE_DELIMITED_JSON, AVRO, or DATASTORE_BACKUP.

    For example, the following command uses the fed-sample.csv file that is part of a dataset named mydataset:

    bq mk --external_table_definition=id:integer,name:string,num:integer,ord:string@CSV=gs://[MY_BUCKET]/fed-sample.csv mydataset.fed_sample
    

    Where [MY_BUCKET] is a valid Cloud Storage bucket name. If your schema is more complicated, you can store it in a separate file in JSON format. For example, the following JSON object represents the inline schema for the fed-sample.csv data.

    [{
      "name": "id",
      "type": "integer",
      "mode": "nullable"
     },
     {
      "name": "name",
      "type": "string",
      "mode": "nullable"
     },
     {
      "name": "num",
      "type": "integer",
      "mode": "nullable"
      },
     {
      "name": "ord",
      "type": "integer",
      "mode": "nullable"
     }
    ]
    

    If the JSON schema is stored in a file named fed-sample-schema.json in your /tmp directory, you can refer to that file instead of supplying the schema inline:

    bq mk --external_table_definition=/tmp/fed-sample-schema.json@CSV=gs://[MY_BUCKET]/fed-sample.csv mydataset.fed_sample
    
  2. Run a query against the table by running the following command:

    bq query --project_id=[PROJECT_ID] 'select name from [DATASET_ID].fed_sample where name contains "Alex";'
    

    Where [PROJECT_ID] is a valid project ID.

API

Creating a federated table using Google Drive

Before you begin, ensure that you have enabled:

To specify a Google Drive file, use one of these URI formats:

Source Format URL Format
CSV | JSON | AVRO | Google Sheets
https://drive.google.com/open?id=[FILE_ID]

To access a Google Drive file's URI in this format:

  1. Go to Google Drive.
  2. Right-click on the file and select Get shareable link.
Google Sheets
https://docs.google.com/spreadsheets/d/[FILE_ID]/edit#gid=[IDENTIFIER]

To access a Google Drive file's URI in this format:

  1. Go to Google Sheets.
  2. Open the file.
  3. Copy the URL from your browser's URL address bar.

Web UI

  1. Go to the BigQuery web UI.

    Go to the BigQuery web UI

  2. In the navigation, hover on a dataset ID and click the down arrow icon down arrow icon image next to the ID and click Create new table.

  3. Under Source data select Google Drive for the Location and then enter the URI to your Google Drive file.

  4. Select the format of your data. For Google Sheets, select the CSV option. Valid formats include the following:

    • Comma-separated values (CSV)
    • JSON (newline delimited)
    • Avro
    • Google Sheets (first tab only)
  5. Under Destination Table input a table ID, then select External table for the Table type field.

  6. For CSV and JSON files, input the table schema. For Avro files, you can omit the table schema and let BigQuery detect the schema. Click the Submit button to complete the process.

You can then run a query against the table as if it were a native BigQuery table, subject to the limitations of federated data sources.

After your query completes, you can download the results as CSV or JSON, save the results as a table, or save the results to Google Sheets. See Download, save, and export data for more information.

CLI

To follow along with the example below:

  1. Create a CSV file named fed-sample.csv that contains the following sample data:

    1,Alice,10,one
    2,Alex,11,one
    3,Bob,11,two
    
  2. Upload the file to Google Drive.

To create an external table:

  1. Create the table by running the following command with your own values:

    bq mk --external_table_definition=[SCHEMA]@[SOURCE_FORMAT]=[GOOGLE_DRIVE_URL] [DATASET_ID].[TABLE_NAME]
    

    Where [SOURCE_FORMAT] is one of: CSV, NEWLINE_DELIMITED_JSON, AVRO, or GOOGLE_SHEETS.

    For example, the following command uses the fed-sample.csv file that is part of a dataset named mydataset:

    bq mk --external_table_definition=id:integer,name:string,num:integer,ord:string@CSV=[GOOGLE_DRIVE_URL] mydataset.fed_sample
    
  2. Run a query against the table by running the following command:

    bq query 'select name from [DATASET_ID].fed_sample where name contains "Alex";'
    

    Where [DATASET_ID] is a valid dataset ID.

API

  • The sourceUris must be fully-qualified, in the format https://drive.google.com/open?id=[FILE_ID] or https://docs.google.com/spreadsheets/d/[FILE_ID]/edit#gid=[IDENTIFIER].

  • Specify the data format by setting the externalDataConfiguration.sourceFormat property. Valid values include:

    • CSV
    • NEWLINE_DELIMITED_JSON
    • AVRO
    • GOOGLE_SHEETS

Querying external data using temporary tables

To directly query an external data source without creating a permanent table, run a query that creates a temporary table. Querying a temporary table is useful for one-time, ad-hoc queries over remote data, or for extract, transform, and load (ETL) processes.

The example below creates a table definition file that contains the schema. The example then uses the definition file and the data source URIs to create and query a temporary table.

Querying data from Cloud Storage

CLI

  1. Create a table definition by running the following command, which creates a table definition and pipes the output to the file /tmp/wikipedia:

    CSV
    bq mkdef --source_format=CSV "gs://[BUCKET_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia

    The table is CSV format and has four fields: id, name, category and depth.

    JSON
    bq mkdef --source_format=NEWLINE_DELIMITED_JSON "gs://[BUCKET_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia

    The table is JSON (newline delimited) format and has four fields: id, name, category and depth.

    Avro
    bq mkdef --source_format=AVRO "gs://[BUCKET_URI]" > /tmp/wikipedia

    If you use an Avro file as your federated data source, BigQuery can detect the schema if you don't specify one.

    Cloud Datastore
    bq mkdef --source_format=DATASTORE_BACKUP "gs://[BUCKET_URI]" > /tmp/wikipedia

    If you use a Cloud Datastore backup as your federated data source, BigQuery can auto-detect the schema if you don't specify one.

    The [BUCKET_URI] comprises your bucket name and your filename. For example, if the Cloud Storage bucket is named mybucket and the data file is named fed-sample.csv, the bucket URI would be gs://mybucket/fed-sample.csv.

    If your table is separated into multiple files that share a common base-name, append an asterisk (*) to the base-name and enclose the bucket and filename in quotes. For example, if you have two tables named fed-sample000001.csv and fed-sample000002.csv, the bucket URI would be "gs://mybucket/fed-sample*".

  2. Run a query against the table by running the following command:

    bq query --project_id=<project ID> --external_table_definition=wikipedia::/tmp/wikipedia 'select name from wikipedia where name contains "Alex";'

    The above command runs a query that references the table as "wikipedia" and provides the table definition with the file created in the previous step.

API

Querying data from Google Drive

Before you begin, ensure that you have enabled Google Drive access.

To specify a Google Drive file, use one of these URI formats:

Source Format URL Format
CSV | JSON | AVRO | Google Sheets
https://drive.google.com/open?id=[FILE_ID]

To access a Google Drive file's URI in this format:

  1. Go to Google Drive.
  2. Right-click on the file and select Get shareable link.
Google Sheets
https://docs.google.com/spreadsheets/d/[FILE_ID]/edit#gid=[IDENTIFIER]

To access a Google Drive file's URI in this format:

  1. Go to Google Sheets.
  2. Open the file.
  3. Copy the URL from your browser's URL address bar.

CLI

  1. Create a table definition by running the following command, which creates a table definition and pipes the output to the file /tmp/wikipedia:

    CSV
    bq mkdef --source_format=CSV "[GOOGLE_DRIVE_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia

    The table is CSV format and has four fields: id, name, category and depth.

    JSON
    bq mkdef --source_format=NEWLINE_DELIMITED_JSON "[GOOGLE_DRIVE_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia

    The table is JSON (newline delimited) format and has four fields: id, name, category and depth.

    Avro
    bq mkdef --source_format=AVRO "[GOOGLE_DRIVE_URI]" > /tmp/wikipedia

    If you use an Avro file as your federated data source, BigQuery can detect the schema if you don't specify one.

    Google Sheets
    bq mkdef --source_format=GOOGLE_SHEETS "[GOOGLE_DRIVE_URI]" > /tmp/wikipedia

    The source is the first tab of a Google Sheets file.

    The [BUCKET_URI] comprises your bucket name and your filename. For example, if the Cloud Storage bucket is named mybucket and the data file is named fed-sample.csv, the bucket URI would be gs://mybucket/fed-sample.csv.

    If your table is separated into multiple files that share a common base-name, append an asterisk (*) to the base-name and enclose the bucket and filename in quotes. For example, if you have two tables named fed-sample000001.csv and fed-sample000002.csv, the bucket URI would be "gs://mybucket/fed-sample*".

  2. Run a query against the table by running the following command:

    bq query --project_id=<project ID> --external_table_definition=wikipedia::/tmp/wikipedia 'select name from wikipedia where name contains "Alex";'

    The above command runs a query that references the table as "wikipedia" and provides the table definition with the file created in the previous step.

API

Creating table definition files

A table definition file contains metadata about a table, such as the table's source format and related properties. Table definition files can be used to create permanent federated tables or temporary external tables. The command-line tool and the API support the creation and use of table definition files.

To create and use a table definition:

CLI

  1. Use the command-line tool's mkdef command to create a table definition or create the table definition manually. The mkdef command generates a table definition file in JSON format.

  2. If necessary, manually edit the table definition file to modify, add, or delete custom parameters.

  3. Create a federated table by using bq mk or query a temporary external table by using bq query using the table definition file as the value for the --external_table_definition flag.

API

If you are using the BigQuery API, define a table definition as part of your job configuration. For more information, see configuration.query.tableDefinitions in the API Reference.

Many table definition configuration settings apply to more than one source format. For detailed information about settings and source formats, see the API Reference descriptions for each of the following settings:

General table definition settings
autodetect

Type: boolean

Default value: true

Description: See CSV and JSON schema auto-detection.

API Reference: configuration.query.tableDefinitions.(key).autodetect

compression

Type: string

Default value: GZIP

Valid values: GZIP and NONE.

API Reference: configuration.query.tableDefinitions.(key).compression

ignoreUnknownValues

Type: boolean

Default value: false

API Reference: configuration.query.tableDefinitions.(key).ignoreUnknownValues

maxBadRecords

Type: integer

Default value: 0

API Reference: configuration.query.tableDefinitions.(key).maxBadRecords

schema

Type: nested object

API Reference: configuration.query.tableDefinitions.(key).schema

sourceFormat

Type: string

API Reference: configuration.query.tableDefinitions.(key).sourceFormat

sourceUris

Type: list

API Reference: configuration.query.tableDefinitions.(key).sourceUris

For the complete list of table definition configuration settings, see configuration.query.tableDefinitions in the API Reference.

Table definitions based on CSV source files

The following command creates a table definition based on a CSV file named fed- sample.csv in a Cloud Storage bucket named [BUCKET], where the schema is stored in /tmp/fed-sample-schema.json and the generated definition file is stored in /tmp/fed-sample-def:

bq mkdef --source_format=CSV 'gs://[BUCKET]/fed-sample.csv' /tmp/fed-sample-schema.json > /tmp/fed-sample-def

The contents of fed-sample-def (schema values omitted for brevity):

{
  "csvOptions": {
    "allowJaggedRows": false,
    "allowQuotedNewlines": false,
    "encoding": "UTF-8",
    "fieldDelimiter": ",",
    "quote": "\"",
    "skipLeadingRows": 0
  },
  "schema": {
    "fields": [
      ...
    ]
  },
  "sourceFormat": "CSV",
  "sourceUris": [
    "gs://[BUCKET]/fed-sample.csv"
  ]
}

You can modify a table definition file so that subsequent calls to bq mk or bq query use the updated values. For example, if you decide that you want to allow jagged rows, you can modify that property in fed-sample-def:

{
  "csvOptions": {
    "allowJaggedRows": true,
  ...

The following options are available only for table definitions based on CSV files:

CSV table definition options
allowJaggedRows

Type: boolean

Default value: false

API Reference: configuration.query.tableDefinitions.(key).csvOptions.allowJaggedRows

allowQuotedNewlines

Type: boolean

Default value: false

Description: CSV files with newline characters are difficult to process in parallel. Setting this to true can reduce the performance gains associated with parallel processing.

API Reference: configuration.query.tableDefinitions.(key).csvOptions.allowQuotedNewlines

encoding

Type: string

Default value: UTF-8

Valid values: UTF-8 and ISO_8859_1.

API Reference: configuration.query.tableDefinitions.(key).csvOptions.encoding

fieldDelimiter

Type: string

Default value: ","

API Reference: configuration.query.tableDefinitions.(key).csvOptions.fieldDelimiter

quote

Type: string

Default value: " (quotation mark)

API Reference: configuration.query.tableDefinitions.(key).csvOptions.quote

skipLeadingRows

Type: long

Default value: 0

API Reference: configuration.query.tableDefinitions.(key).csvOptions.skipLeadingRows

Table definitions based on JSON source files

The following command creates a table definition based on a JSON file named fed-sample.json in a Cloud Storage bucket named [BUCKET], where the schema is stored in /tmp/fed-sample-schema.json and the generated definition file is stored in /tmp/fed-sample-def:

bq mkdef --source_format=NEWLINE_DELIMITED_JSON 'gs://[BUCKET]/fed-sample.json' /tmp/fed-sample-schema.json > /tmp/fed-sample-json-def

The contents of fed-sample-json-def (schema values omitted for brevity):

{
  "autodetect": true,
  "schema": {
    "fields": [
      ...
    ]
  },
  "sourceFormat": "NEWLINE_DELIMITED_JSON",
  "sourceUris": [
    "gs://[BUCKET]/fed-sample.json"
  ]
}

There are no configuration settings specific to JSON source files, but many of the general settings apply to JSON files. For more information, see General table definition settings.

Table definitions based on Google Sheets source files

The following command creates a table definition based on a Google Sheet stored on Google Drive specified by [GOOGLE_DRIVE_URI], where the schema is auto- detected and the generated definition file is stored in /tmp/google-sheet-def:

bq mkdef --source_format=GOOGLE_SHEETS "[GOOGLE_DRIVE_URI]" > /tmp/google-sheet-def

The contents of google-sheet-def:

{
  "autodetect": true,
  "sourceFormat": "GOOGLE_SHEETS",
  "sourceUris": [
    "[GOOGLE_DRIVE_URI]"
  ]
}

You can modify the table definition to skip leading rows by adding a new googleSheetsOptions section. The following example modifies the table definition so that it skips the first 3 rows:

{
  "autodetect": true,
  "sourceFormat": "GOOGLE_SHEETS",
  "sourceUris": [
    "[GOOGLE_DRIVE_URI]"
  ],
  "googleSheetsOptions": {
    "skipLeadingRows": 3
  }
}
Google Sheets table definition settings
autodetect

Type: boolean

Default value: true

API Reference: configuration.query.tableDefinitions.(key).autodetect

skipLeadingRows

Type: long

Default value: 0

API Reference: configuration.query.tableDefinitions.(key).googleSheetsOptions.skipLeadingRows

For a list of general configuration settings, , see General table definition settings.

Enabling Google Drive access

Web UI


Follow the web-based authentication steps when creating the table in the web UI.

CLI


To enable Google Drive access

  1. Use the Google Cloud Platform Console to enable the Google Drive API for the project you plan to use.

  2. Ensure that you have the latest version of the command-line tool:

    gcloud components update
    
  3. Authenticate with Google Drive

    gcloud auth login --enable-gdrive-access
    

API


If you are using the BigQuery API:

  1. Use the Google Cloud Platform Console to enable the Google Drive API for the project making the API call.

  2. Request the OAuth scope for Google Drive in addition to the scope for BigQuery.

Avro and Cloud Datastore backup schema detection

If you use an Avro file or a Cloud Datastore backup as a federated data source, BigQuery detects the schema for you.

When BigQuery detects schemas, it might, on rare occasions, change a field name to make it compatible with BigQuery SQL syntax.

To see the detected schema, create an external table and use the bq show command. Alternately, you can use the BigQuery web UI to see the schema.

For information about data type conversion, see:

CSV and JSON schema auto-detection

If you specify a CSV or JSON file without including a schema description, BigQuery makes a best-effort attempt to automatically infer the schema. Automatic detection is currently available in the command-line tool and the BigQuery API.

BigQuery starts the inference process by selecting a random file in the data set and scanning up to 100 rows of the file to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in that sample.

Schema auto-detection is enabled by default. To disable auto-detection, you can pass the --noautodetect flag when creating the table definition using mkdef or modify an existing table definition file by changing the value of the autodetect property to false. For example, the following excerpt from a table definition JSON file disables auto-detection:

{
  "autodetect": false,
  ...
}

Other automatic detection details include:

Compression
BigQuery recognizes gzip-compatible file compression when opening a file.
CSV Delimiter
BigQuery detects the following delimiters:
  • comma (,)
  • pipe (|)
  • tab (\t)
CSV Header
BigQuery infers headers by comparing the first row of the file with other rows in the data set. If the first line contains only strings, and the other lines do not, BigQuery assumes that the first row is a header row.
CSV Quoted new lines
BigQuery detects quoted new line characters within a CSV field and does not interpret the quoted new line character as a row boundary.
Timestamps
BigQuery detects a wide array of timestamp formats, including, but not limited to:
  • `yyyy-mm-dd` in any order
  • `yyyy-mm-dd hh:mm:ss`
  • `yyyy-mm-dd hh:mm:ss.sss`
Other timestamp details:
  • Date separators can be "-", "/", or "."
  • A time zone can be appended using an offset or name

The _FILE_NAME pseudo column

Tables based on federated data sources provide a pseudo column named _FILE_NAME. This column contains the fully qualified path to the file to which the row belongs. This column is available only in external tables that reference Google Cloud Storage and Google Drive.

The _FILE_NAME column name is reserved, which means that you cannot create a column by that name in any of your tables. To select the value of _FILE_NAME, you must use an alias. The following example demonstrates selecting _FILE_NAME by assigning the alias fn to the pseudo column.

  1. Create a table definition for the table wikipedia using the URI for the files from a Cloud Storage bucket. For example:

    bq mkdef --source_format=CSV "gs://[BUCKET_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia

    The [BUCKET_URI] comprises your bucket name and your filename. For example, if the Cloud Storage bucket is named mybucket and the data file is named fed-sample.csv, the bucket URI would be gs://mybucket/fed-sample.csv.

    If your table is separated into multiple files that share a common base-name, append an asterisk (\*) to the base-name and enclose the bucket and filename in quotes. For example, if you have two tables named fed-sample000001.csv and fed-sample000002.csv, the bucket URI would be "gs://mybucket/fed-sample\*".

  2. Query the _FILE_NAME pseudo column:

    bq query --project_id=<project ID> --external_table_definition=wikipedia::/tmp/wikipedia 'select name, _FILE_NAME as fn from wikipedia where name contains "Alex";' 

Limits

The limits for federated data sources are the same as the limits for load jobs, as described in the Load jobs section on the Quota Policy page.

Send feedback about...

BigQuery Documentation