Creating a table definition file for an external data source

This page describes how to create a table definition file for an external data source. An external data source (also known as a federated data source) is a data source that you can query directly even though the data is not stored in BigQuery.

Table definition files

A table definition file contains an external table's schema definition and metadata, such as the table's data format and related properties. When you create a table definition file, you can use schema auto-detection to define the schema for an external data source, you can provide the schema inline (on the command line), or you can provide a JSON file containing the schema definition.

Table definition files are used with the BigQuery CLI. The properties in a table definition file also apply to creating an ExternalDataConfiguration when you use the REST API. You do not use table definition files when you create an external table by using the Cloud Console or the classic BigQuery web UI.

You can create table definition files for the following external data sources:

  • Cloud Storage

    • Comma-separated values (CSV)
    • Newline-delimited JSON
    • Avro files
    • Datastore export files
    • ORC files (beta)
    • Parquet files (beta)
    • Cloud Firestore export files
  • Google Drive

    • Comma-separated values (CSV)
    • Newline-delimited JSON
    • Avro files
    • Google Sheets
  • Cloud Bigtable

Before you begin

To create a table definition file, you need the URI for your data source:

Permanent versus temporary external tables

You can query an external data source in BigQuery by using a permanent table or a temporary table. A permanent table is a table that is created in a dataset and is linked to your external data source. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source, and you can query the table at any time.

When you query an external data source using a temporary table, you submit a command that includes a query and creates a non-permanent table linked to the external data source. When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

You can use a table definition file to describe a permanent or temporary external table.

Creating a table definition using schema auto-detect

If you specify a CSV, JSON, or Google Sheets file without including an inline schema description or a schema file, you can use the --autodetect flag to set the "autodetect" option to true in the table definition file. When auto-detect is enabled, BigQuery makes a best-effort attempt to automatically infer the schema. See Schema auto-detection in BigQuery for more information.

You can use a schema auto-detect when creating table definitions for:

  • JSON files stored in Cloud Storage or Google Drive
  • CSV files stored in Cloud Storage or Google Drive
  • Google Sheets files stored in Google Drive

Cloud Storage

To use the CLI to create a table definition for a Cloud Storage data source:

  1. Use the command-line tool's mkdef command with the --autodetect flag to create a table definition. The mkdef command generates a table definition file in JSON format. The following example creates a table definition and writes the output to a file: /tmp/file_name.

    In the command, replace:

    • source_format with your file format: NEWLINE_DELIMITED_JSON, CSV, or GOOGLE_SHEETS.
    • file_name with the name of your table definition file.
    • bucket_uri with your Cloud Storage URI, for example, gs://mybucket/myfile.
    bq mkdef \
    --autodetect \
    --source_format=source_format \
    "bucket_uri" > /tmp/file_name
    
  2. (Optional) Open the table definition file in a text editor. For example, the command nano /tmp/file_name opens the file in nano. The file should look like the following for a CSV external data source. Notice "autodetect" is set to true.

    {
    "autodetect": true,
    "csvOptions": {
      "allowJaggedRows": false,
      "allowQuotedNewlines": false,
      "encoding": "UTF-8",
      "fieldDelimiter": ",",
      "quote": "\"",
      "skipLeadingRows": 0
    },
    "sourceFormat": "CSV",
    "sourceUris": [
      "bucket_uri"
    ]
    }
    
  3. (Optional) Manually edit the table definition file to modify, add, or delete general settings such as maxBadRecords and ignoreUnknownValues. There are no configuration settings that are specific to JSON source files, but there are settings that apply to CSV and Google Sheets files. For more information, see ExternalDataConfiguration in the API reference.

Externally partitioned data

To configure a table definition file for externally partitioned data on Cloud Storage, see Querying externally partitioned data.

Google Drive

To use the CLI to create a table definition for a Google Drive data source:

  1. Use the command-line tool's mkdef command with the --autodetect flag to create a table definition. The mkdef command generates a table definition file in JSON format. The following example creates a table definition and writes the output to a file: /tmp/file_name.

    In the command, replace:

    • source_format with your file format: NEWLINE_DELIMITED_JSON, CSV, or GOOGLE_SHEETS.
    • file_name with the name of your table definition file.
    • drive_uri with your Google Drive URI, for example, https://drive.google.com/open?id=123ABCD123AbcD123Abcd.
    bq mkdef \
    --autodetect \
    --source_format=source_format \
    "drive_uri" > /tmp/file_name
    
  2. Open the table definition file in a text editor. For example, the command nano /tmp/file_name opens the file in nano. The file should look like the following for a Google Sheets external data source. Notice "autodetect" is set to true.

    {
    "autodetect": true,
    "sourceFormat": "GOOGLE_SHEETS",
    "sourceUris": [
      "drive_uri"
    ]
    }
    
  3. (Optional) Manually edit the table definition file to modify, add, or delete general settings such as maxBadRecords and ignoreUnknownValues. There are no configuration settings that are specific to JSON source files, but there are settings that apply to CSV and Google Sheets files. For more information, see ExternalDataConfiguration in the API reference.

  4. To specify a particular sheet or a cell range in a Google Sheets file, add the range property to the table definition file. To query a particular sheet, specifiy the sheet name. To query a cell range, specify the range in the form: sheet_name!top_left_cell_id:bottom_right_cell_id, for example, "Sheet1!A1:B20". If the range parameter is not specified, the first sheet in the file is used.

Creating a table definition using an inline schema

If you do not want to use schema auto-detect, you can create a table definition file by providing an inline schema definition. To provide an inline schema definition, list the fields and data types on the command line in the following format: field:data_type,field:data_type.

You can use an inline schema definition when creating table definition files for:

  • JSON files stored in Cloud Storage or Google Drive
  • CSV files stored in Cloud Storage or Google Drive
  • Google Sheets files stored in Google Drive

To use the CLI to create a table definition for a Cloud Storage data source by using an inline schema definition:

  1. Use the command-line tool's mkdef command with the --noautodetect flag to create a table definition. The mkdef command generates a table definition file in JSON format. The following example creates a table definition and writes the output to a file: /tmp/file_name.

    In the command, replace:

    • source_format with your file format: NEWLINE_DELIMITED_JSON, CSV, or GOOGLE_SHEETS.
    • uri with your Cloud Storage URI or your Google Drive URI. For example, https://drive.google.com/open?id=123ABCD123AbcD123Abcd for Google Drive or gs://mybucket/myfile for Cloud Storage.
    • field:data_type,field:data_type with your schema definition, for example, Name:STRING,Address:STRING, ....
    • file_name with the name of your table definition file.
    bq mkdef \
    --noautodetect \
    --source_format=source_format \
    "uri" \
    field:data_type,field:data_type > /tmp/file_name
    
  2. (Optional) Open the table definition file in a text editor. For example, the command nano /tmp/file_name opens the file in nano. The file should look like the following. Notice "autodetect" is not enabled, and the schema information is written to the table definition file.

    {
    "schema": {
      "fields": [
        {
          "name": "field",
          "type": "data_type"
        },
        {
          "name": "field",
          "type": "data_type"
        }
        ...
      ]
    },
    "sourceFormat": "NEWLINE_DELIMITED_JSON",
    "sourceUris": [
      "uri"
    ]
    }
    
  3. (Optional) Manually edit the table definition file to modify, add, or delete general settings such as maxBadRecords and ignoreUnknownValues. There are no configuration settings that are specific to JSON source files, but there are settings that apply to CSV and Google Sheets files. For more information, see ExternalDataConfiguration in the API reference.

Creating a table definition using a JSON schema file

If you do not want to use auto-detect or provide an inline schema definition, you can create a JSON schema file and reference it when creating your table definition file. You must create the JSON schema file manually, and it must be on your local machine. Referencing a JSON schema file stored in Cloud Storage or in Google Drive is not supported.

You can use a JSON schema file when creating table definitions for:

  • JSON files stored in Cloud Storage or Google Drive
  • CSV files stored in Cloud Storage or Google Drive
  • Google Sheets files stored in Google Drive

Cloud Storage

To use the CLI to create a table definition for a Cloud Storage data source using a JSON schema file:

  1. Use the command-line tool's mkdef command with the --noautodetect flag to create a table definition. The mkdef command generates a table definition file in JSON format. The following example creates a table definition and writes the output to a file: /tmp/file_name.

    In the command, replace:

    • source_format with your file format: NEWLINE_DELIMITED_JSON, CSV, or GOOGLE_SHEETS.
    • file_name with the name of your table definition file.
    • bucket_uri with your Cloud Storage URI, for example, gs://mybucket/myfile.
    • path_to_schema_file with the location of the JSON schema file on your local machine.
    bq mkdef \
    --noautodetect \
    --source_format=source_format \
    "bucket_uri" \
    path_to_schema_file > /tmp/file_name
    
  2. (Optional) Open the table definition file in a text editor. For example, the command nano /tmp/file_name opens the file in
    nano. The file should look like the following. Notice "autodetect" is not enabled, and the schema information is written to the table definition file.

    {
    "schema": {
      "fields": [
        {
          "name": "field",
          "type": "data_type"
        },
        {
          "name": "field",
          "type": "data_type"
        }
        ...
      ]
    },
    "sourceFormat": "NEWLINE_DELIMITED_JSON",
    "sourceUris": [
      "bucket_uri"
    ]
    }
    
  3. (Optional) Manually edit the table definition file to modify, add, or delete general settings such as maxBadRecords and ignoreUnknownValues. There are no configuration settings that are specific to JSON source files, but there are settings that apply to CSV and Google Sheets files. For more information, see ExternalDataConfiguration in the API reference.

Google Drive

To use the CLI to create a table definition for a Google Drive data source using a JSON schema file:

  1. Use the command-line tool's mkdef command with the --noautodetect flag to create a table definition. The mkdef command generates a table definition file in JSON format. The following example creates a table definition and writes the output to a file: /tmp/file_name.

    In the command, replace:

    • source_format with your file format: NEWLINE_DELIMITED_JSON, CSV, or GOOGLE_SHEETS.
    • drive_uri with your Google Drive URI, for example, https://drive.google.com/open?id=123ABCD123AbcD123Abcd.
    • path_to_schema_file with the location of the JSON schema file on your local machine.
    • file_name with the name of your table definition file.
    bq mkdef \
    --noautodetect \
    --source_format=source_format \
    "drive_uri" \
    path_to_schema_file > /tmp/file_name
    
  2. Open the table definition file in a text editor. For example, the command nano /tmp/file_name opens the file in nano. The file should look like the following. Notice "autodetect" is not enabled, and the schema information is written to the table definition file.

    {
    "schema": {
      "fields": [
        {
          "name": "field",
          "type": "data_type"
        },
        {
          "name": "field",
          "type": "data_type"
        }
        ...
      ]
    },
    "sourceFormat": "GOOGLE_SHEETS",
    "sourceUris": [
      "drive_uri"
    ]
    }
    
  3. (Optional) Manually edit the table definition file to modify, add, or delete general settings such as maxBadRecords and ignoreUnknownValues. There are no configuration settings that are specific to JSON source files, but there are settings that apply to CSV and Google Sheets files. For more information, see ExternalDataConfiguration in the API reference.

  4. To specify a particular sheet or a cell range in a Google Sheets file, add the range property to the table definition file. To query a particular sheet, specifiy the sheet name. To query a cell range, specify the range in the form: sheet_name!top_left_cell_id:bottom_right_cell_id, for example, "Sheet1!A1:B20". If the range parameter is not specified, the first sheet in the file is used.

Creating Avro table definitions

If you use an Avro file as an external data source, BigQuery automatically retrieves the schema using the source data. When creating a table definition for Avro files, you do not need to use schema auto-detect, and you do not need to provide an inline schema definition or schema file.

You can create a table definition file for Avro data stored in Cloud Storage or Google Drive.

To use the CLI to create a table definition file for Avro data:

  1. Use the command-line tool's mkdef command to create a table definition. You do not need to use the --noautodetect flag with Avro files. Schema auto-detect is disabled for Avro files. The mkdef command generates a table definition file in JSON format. The following example creates a table definition and writes the output to a file: /tmp/file_name.

    In the command, replace:

    • uri with your Cloud Storage URI or your Google Drive URI. For example, https://drive.google.com/open?id=123ABCD123AbcD123Abcd for Google Drive or gs://mybucket/myfile for Cloud Storage.
    • file_name with the name of your table definition file.
    bq mkdef \
    --source_format=AVRO \
    "uri" > /tmp/file_name
    
  2. (Optional) Open the table definition file in a text editor. For example, the command nano /tmp/file_name opens the file in nano. The file should look like the following. Notice there is no need for the "autodetect" setting.

    {
    "sourceFormat": "AVRO",
    "sourceUris": [
      "uri"
    ]
    }
    
  3. (Optional) Manually edit the table definition file to modify, add, or delete general settings such as maxBadRecords and ignoreUnknownValues. There are no configuration settings that are specific to Avro source files, but there are settings that apply to CSV and Google Sheets files. For more information, see ExternalDataConfiguration in the API reference.

Creating Datastore and Cloud Firestore export table definitions

If you use a Datastore or Cloud Firestore export as an external data source, BigQuery automatically retrieves the schema using the self-describing source data. When creating a table definition for Datastore and Cloud Firestore backup files, you do not need to use schema auto-detect, and you do not need to provide an inline schema definition or schema file.

You can create a table definition file for Datastore and Cloud Firestore export data stored in Cloud Storage.

To use the CLI to create a table definition file for Datastore or Cloud Firestore export data:

  1. Use the command-line tool's mkdef command to create a table definition. You do not need to use the --noautodetect flag with Datastore or Cloud Firestore backup files. Schema auto-detect is disabled for these file types. The mkdef command generates a table definition file in JSON format. The following example creates a table definition and writes the output to a file: /tmp/file_name.

    In the command, replace:

    • bucket_uri with your Cloud Storage URI.
    • file_name with the name of your table definition file.

    Note that the DATASTORE_BACKUP source format is used for both Datastore and Cloud Firestore.

    bq mkdef \
    --source_format=DATASTORE_BACKUP \
    "uri" > /tmp/file_name
    
  2. (Optional) Open the table definition file in a text editor. For example, the command nano /tmp/file_name opens the file in nano. The file should look like the following. Notice there is no need for the "autodetect" setting.

    {
    "sourceFormat": "DATASTORE_BACKUP",
    "sourceUris": [
      "gs://bucket_uri"
    ]
    }
    
  3. (Optional) Manually edit the table definition file to modify, add, or delete settings such as maxBadRecords and ignoreUnknownValues. There are no configuration settings that are specific to Datastore and Cloud Firestore export files, but there are settings that apply to CSV and Google Sheets files. For more information, see ExternalDataConfiguration in the API reference.

Creating Cloud Bigtable table definitions

When you create a table definition file for Cloud Bigtable, you manually generate the file in JSON format. Using the mkdef command to create a table definition is not currently supported for Cloud Bigtable data sources. Schema auto-detect is also unsupported for Cloud Bigtable. For a list of Cloud Bigtable table definition options, see BigtableOptions in the REST API reference.

A JSON table definition file for Cloud Bigtable looks like the following. Using this table definition file, BigQuery reads data from a single column family, interpreting the values as binary encoded integers.

{
    "sourceFormat": "BIGTABLE",
    "sourceUris": [
        "https://googleapis.com/bigtable/projects/project_id/instances/instance_id/tables/table_name"
    ],
    "bigtableOptions": {
        "columnFamilies" : [
            {
                "familyId": "family_int",
                "type": "INTEGER",
                "encoding": "BINARY"
            }
        ],
    }
}

Wildcard support for table definition files

If your Cloud Storage data is separated into multiple files that share a common base-name, you can use a wildcard in the URI in the table definition file. You append an asterisk (*) to the base-name and enclose the bucket and filename in quotes. For example, if you have two files named fed-sample000001.csv and fed-sample000002.csv, the bucket URI would be "gs://mybucket/fed-sample*".

You can only use one wildcard for objects (filenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported.

For Cloud Bigtable data, only one URI can be specified and it has to be a fully specified and valid HTTPS URL for a Cloud Bigtable table. For Datastore backups, only one URI can be specified, and it must end with .backup_info.

The * wildcard character is not allowed when creating table definition files for:

  • Cloud Bigtable data sources
  • Datastore exports stored in Cloud Storage
  • Cloud Firestore exports stored in Cloud Storage
  • Data stored in Google Drive
Was deze pagina nuttig? Laat ons weten hoe goed we u hebben geholpen:

Feedback verzenden over...

Hulp nodig? Ga naar onze ondersteuningspagina.