Create a table definition file for an external data source
This page describes how to create a table definition file for an external data source. An external data source is a data source that you can query directly even though the data is not stored in BigQuery.
A table definition file contains an external table's schema definition and metadata, such as the table's data format and related properties. When you create a table definition file, you can use schema auto-detection to define the schema for an external data source. You can provide the schema inline or you can provide a JSON file containing the schema definition.
Table definition files are used with the bq command-line tool. The
properties in a table definition file also apply to creating an
ExternalDataConfiguration
when you use the REST API. You do not use table definition files when you create
an external table by using the Google Cloud console.
You can create table definition files to describe a permanent or temporary external table for the following external data sources:
Cloud Storage
- Comma-separated values (CSV)
- Newline-delimited JSON
- Avro files
- Datastore export files
- ORC files
- Parquet files
- Firestore export files
Google Drive
- Comma-separated values (CSV)
- Newline-delimited JSON
- Avro files
- Google Sheets
Bigtable
Before you begin
To create a table definition file, you need the URI for your data source:
- For a Drive data source, you need the Drive URI
- For a Cloud Storage data source, you need the Cloud Storage URI
- For a Bigtable data source, you need the Bigtable URI
Create a definition file for CSV, JSON, or Google Sheets files
Use one of the following methods to create a table definition file for CSV, JSON, or Google Sheets files in Cloud Storage or Drive:
Use the autodetect
flag
If you specify a CSV, JSON, or Google Sheets file without including an inline
schema description or a schema file, you can use the --autodetect
flag
to set the "autodetect"
option to true
in the table definition file. When
auto-detect is enabled, BigQuery makes a best-effort attempt to
automatically infer the schema. For more information, see
Schema auto-detection for external data sources.
Use auto-detect with a Cloud Storage data source
Create a table definition file for a Cloud Storage data source:
Use the
bq mkdef
command with the--autodetect
flag to create a table definition file. Themkdef
command generates a table definition file in JSON format. The following example creates a table definition and writes the output to a file:/tmp/file_name
.bq mkdef \ --autodetect \ --source_format=SOURCE_FORMAT \ "URI" > /tmp/FILE_NAME
Replace the following:
SOURCE_FORMAT
: your file formatFILE_NAME
: the name of your table definition fileURI
: the Cloud Storage URIFor example,
gs://mybucket/myfile
.
(Optional) Open the table definition file in a text editor. For example, the command
nano /tmp/file_name
opens the file in nano. The file should look like the following for a CSV external data source. Notice"autodetect"
is set totrue
.{ "autodetect": true, "csvOptions": { "allowJaggedRows": false, "allowQuotedNewlines": false, "encoding": "UTF-8", "fieldDelimiter": ",", "quote": "\"", "skipLeadingRows": 0 }, "sourceFormat": "CSV", "sourceUris": [ "URI" ] }
(Optional) Manually edit the table definition file to modify, add, or delete general settings such as
maxBadRecords
andignoreUnknownValues
. There are no configuration settings that are specific to JSON source files, but there are settings that apply to CSV and Google Sheets files. For more information, seeExternalDataConfiguration
in the API reference.
Use auto-detect with a Drive data source
Create a table definition file for a Drive data source:
Use the
bq mkdef
command with the--autodetect
flag to create a table definition. Themkdef
command generates a table definition file in JSON format. The following example creates a table definition and writes the output to a file:/tmp/file_name
.bq mkdef \ --autodetect \ --source_format=SOURCE_FORMAT \ "URI" > /tmp/FILE_NAME
Replace the following:
SOURCE_FORMAT
: your file formatFILE_NAME
: the name of your table definition fileURI
: the Drive URIFor example,
https://drive.google.com/open?id=123ABCD123AbcD123Abcd
.
Open the table definition file in a text editor. For example, the command
nano /tmp/file_name
opens the file in nano. The file should look like the following for a Google Sheets external data source. Notice"autodetect"
is set totrue
.{ "autodetect": true, "sourceFormat": "GOOGLE_SHEETS", "sourceUris": [ "URI" ] }
(Optional) Manually edit the table definition file to modify, add, or delete general settings such as
maxBadRecords
andignoreUnknownValues
. There are no configuration settings that are specific to JSON source files, but there are settings that apply to CSV and Google Sheets files. For more information, seeExternalDataConfiguration
in the API reference.To specify a particular sheet or a cell range in a Google Sheets file, add the
range
property to the table definition file. To query a particular sheet, specify the sheet name. To query a cell range, specify the range in the form:sheet_name!top_left_cell_id:bottom_right_cell_id
, for example,"Sheet1!A1:B20"
. If therange
parameter is not specified, the first sheet in the file is used.
Use an inline schema
If you do not want to use schema auto-detect, you can create a table definition
file by providing an inline schema definition. To provide an inline schema
definition, list the fields and data types on the command line in the following
format: FIELD:DATA_TYPE,FIELD:DATA_TYPE
.
Use an inline schema with a Cloud Storage or Drive data source
Create a table definition for a Cloud Storage or Drive data source by using an inline schema definition:
Use the
bq mkdef
command with the--noautodetect
flag to create a table definition. Themkdef
command generates a table definition file in JSON format. The following example creates a table definition and writes the output to a file:/tmp/file_name
.bq mkdef \ --noautodetect \ --source_format=SOURCE_FORMAT \ "URI" \ FIELD:DATA_TYPE,FIELD:DATA_TYPE > /tmp/FILE_NAME
Replace the following
SOURCE_FORMAT
: the source file formatURI
: the Cloud Storage URI or your Drive URIFor example,
gs://mybucket/myfile
for Cloud Storage orhttps://drive.google.com/open?id=123ABCD123AbcD123Abcd
for Drive.FIELD:DATA_TYPE,FIELD:DATA_TYPE
: the schema definitionFor example,
Name:STRING,Address:STRING, ...
.FILE_NAME
: the name of your table definition file
(Optional) Open the table definition file in a text editor. For example, the command
nano /tmp/file_name
opens the file in nano. The file should look like the following. Notice"autodetect"
is not enabled, and the schema information is written to the table definition file.{ "schema": { "fields": [ { "name": "FIELD", "type": "DATA_TYPE" }, { "name": "FIELD", "type": "DATA_TYPE" } ... ] }, "sourceFormat": "NEWLINE_DELIMITED_JSON", "sourceUris": [ "URI" ] }
(Optional) Manually edit the table definition file to modify, add, or delete general settings such as
maxBadRecords
andignoreUnknownValues
. There are no configuration settings that are specific to JSON source files, but there are settings that apply to CSV and Google Sheets files. For more information, seeExternalDataConfiguration
in the API reference.
Use a JSON schema file
If you do not want to use auto-detect or provide an inline schema definition, you can create a JSON schema file and reference it when creating your table definition file. Create the JSON schema file manually on your local machine. Referencing a JSON schema file stored in Cloud Storage or in Drive is not supported.
Use a schema file with a Cloud Storage data source
Create a table definition for a Cloud Storage data source using a JSON schema file:
Use the
bq mkdef
command with the--noautodetect
flag to create a table definition. Themkdef
command generates a table definition file in JSON format. The following example creates a table definition and writes the output to a file:/tmp/file_name
.bq mkdef \ --noautodetect \ --source_format=SOURCE_FORMAT \ "URI" \ PATH_TO_SCHEMA_FILE > /tmp/FILE_NAME
Replace the following:
SOURCE_FORMAT
: your file formatFILE_NAME
: the name of your table definition fileURI
: the Cloud Storage URIFor example,
gs://mybucket/myfile
.PATH_TO_SCHEMA_FILE
: the location of the JSON schema file on your local machine
(Optional) Open the table definition file in a text editor. For example, the command
nano /tmp/file_name
opens the file in
nano. The file should look like the following. Notice"autodetect"
is not enabled, and the schema information is written to the table definition file.{ "schema": { "fields": [ { "name": "FIELD", "type": "DATA_TYPE" }, { "name": "FIELD", "type": "DATA_TYPE" } ... ] }, "sourceFormat": "NEWLINE_DELIMITED_JSON", "sourceUris": [ "URI" ] }
(Optional) Manually edit the table definition file to modify, add, or delete general settings such as
maxBadRecords
andignoreUnknownValues
. There are no configuration settings that are specific to JSON source files, but there are settings that apply to CSV and Google Sheets files. For more information, seeExternalDataConfiguration
in the API reference.
Use a schema file with a Drive data source
Create a table definition for a Drive data source using a JSON schema file:
Use the
bq mkdef
command with the--noautodetect
flag to create a table definition. Themkdef
command generates a table definition file in JSON format. The following example creates a table definition and writes the output to a file:/tmp/file_name
.bq mkdef \ --noautodetect \ --source_format=source_format \ "URI" \ PATH_TO_SCHEMA_FILE > /tmp/FILE_NAME
Replace the following:
SOURCE_FORMAT
: the source file formatURI
: the Drive URIFor example,
https://drive.google.com/open?id=123ABCD123AbcD123Abcd
.PATH_TO_SCHEMA_FILE
: the location of the JSON schema file on your local machineFILE_NAME
: the name of your table definition file
Open the table definition file in a text editor. For example, the command
nano /tmp/file_name
opens the file in nano. The file should look like the following. Notice"autodetect"
is not enabled, and the schema information is written to the table definition file.{ "schema": { "fields": [ { "name": "FIELD", "type": "DATA_TYPE" }, { "name": "FIELD", "type": "DATA_TYPE" } ... ] }, "sourceFormat": "GOOGLE_SHEETS", "sourceUris": [ "URI" ] }
(Optional) Manually edit the table definition file to modify, add, or delete general settings such as
maxBadRecords
andignoreUnknownValues
. There are no configuration settings that are specific to JSON source files, but there are settings that apply to CSV and Google Sheets files. For more information, seeExternalDataConfiguration
in the API reference.To specify a particular sheet or a cell range in a Google Sheets file, add the
range
property to the table definition file. To query a particular sheet, specify the sheet name. To query a cell range, specify the range in the form:sheet_name!top_left_cell_id:bottom_right_cell_id
, for example,"Sheet1!A1:B20"
. If therange
parameter is not specified, the first sheet in the file is used.
Create a definition file for self-describing formats
Avro, Parquet, and ORC are self-describing formats. Data files in these formats contain their own schema information. If you use one of these formats as an external data source, then BigQuery automatically retrieves the schema using the source data. When creating a table definition, you don't need to use schema auto-detection, and you don't need to provide an inline schema definition or schema file.
You can create a table definition file for Avro, Parquet, or ORC data stored in Cloud Storage or Drive:
Use the
bq mkdef
command to create a table definition.bq mkdef \ --source_format=FORMAT \ "URI" > FILE_NAME
Replace the following:
FORMAT
: the source formatURI
: the Cloud Storage URI or your Drive URIFor example,
gs://mybucket/myfile
for Cloud Storage orhttps://drive.google.com/open?id=123ABCD123AbcD123Abcd
for Drive.FILE_NAME
: the name of your table definition file
Optional: Open the table definition file in a text editor. The file looks similar to the following:
{ "sourceFormat": "AVRO", "sourceUris": [ "URI" ] }
Optional: Manually edit the table definition file to modify, add, or delete general settings such as
maxBadRecords
andignoreUnknownValues
. For more information, seeExternalDataConfiguration
in the API reference.
Create a definition file for hive-partitioned data
Use the bq mkdef
command
with the hive_partitioning_mode
and the
hive_partitioning_source_uri_prefix
flags to
create a definition file for hive-partitioned data that's stored in
Cloud Storage, Amazon Simple Storage Service (Amazon S3), or Azure Blob Storage.
Create a definition file for Datastore and Firestore
If you use a Datastore or Firestore export as an external data source, BigQuery automatically retrieves the schema using the self-describing source data. When creating a table definition, you don't need to provide an inline schema definition or schema file.
You can create a table definition file for Datastore and Firestore export data stored in Cloud Storage:
Use the
bq mkdef
command to create a table definition. You do not need to use the--noautodetect
flag with Datastore or Firestore backup files. Schema auto-detect is disabled for these file types. Themkdef
command generates a table definition file in JSON format. The following example creates a table definition and writes the output to a file:/tmp/file_name
.bq mkdef \ --source_format=DATASTORE_BACKUP \ "URI" > /tmp/FILE_NAME
Replace the following:
URI
: the Cloud Storage URIFILE_NAME
: the name of your table definition file
The
DATASTORE_BACKUP
source format is used for both Datastore and Firestore.(Optional) Open the table definition file in a text editor. For example, the command
nano /tmp/file_name
opens the file in nano. The file should look like the following. Notice there is no need for the"autodetect"
setting.{ "sourceFormat": "DATASTORE_BACKUP", "sourceUris": [ "gs://URI" ] }
(Optional) Manually edit the table definition file to modify, add, or delete settings such as
maxBadRecords
andignoreUnknownValues
. There are no configuration settings that are specific to Datastore and Firestore export files. For more information, seeExternalDataConfiguration
in the API reference.
Create a definition file for Bigtable
When you create a table definition file for Bigtable, you manually
generate the file in JSON format. Using the mkdef
command to create a table
definition is not currently supported for Bigtable data sources.
Schema auto-detect is also unsupported for Bigtable. For a list of
Bigtable table definition options, see
BigtableOptions
in
the REST API reference.
A JSON table definition file for Bigtable looks like the following. Using this table definition file, BigQuery reads data from a single column family, interpreting the values as binary encoded integers.
{ "sourceFormat": "BIGTABLE", "sourceUris": [ "https://googleapis.com/bigtable/projects/PROJECT_ID/instances/INSTANCE_ID/tables/TABLE_NAME" ], "bigtableOptions": { "columnFamilies" : [ { "familyId": "FAMILY_ID", "type": "INTEGER", "encoding": "BINARY" } ] } }
Replace the following:
PROJECT_ID
: the project containing your Bigtable clusterINSTANCE_ID
: the Bigtable instance IDTABLE_NAME
: the name of the table you're queryingFAMILY_ID
: the column family identifier
For more information, see Retrieving the Bigtable URI.
Wildcard support for table definition files
If your data is separated into multiple files, you can use an asterisk (*) wildcard to select multiple files. Use of the asterisk wildcard must follow these rules:
- The asterisk can appear inside the object name or at the end of the object name.
- Using multiple asterisks is unsupported. For example, the path
gs://mybucket/fed-*/temp/*.csv
is invalid. - Using an asterisk with the bucket name is unsupported.
Examples:
The following example shows how to select all of the files in all the folders which start with the prefix
gs://mybucket/fed-samples/fed-sample
:gs://mybucket/fed-samples/fed-sample*
The following example shows how to select only files with a
.csv
extension in the folder namedfed-samples
and any subfolders offed-samples
:gs://mybucket/fed-samples/*.csv
The following example shows how to select files with a naming pattern of
fed-sample*.csv
in the folder namedfed-samples
. This example doesn't select files in subfolders offed-samples
.gs://mybucket/fed-samples/fed-sample*.csv
When using the bq command-line tool, you might need to escape the asterisk on some platforms.
If you use an asterisk wildcard, enclose the bucket and filename in quotes. For
example, if you have two files named fed-sample000001.csv
and
fed-sample000002.csv
and you want to use an asterisk to select both of them,
the bucket URI would be "gs://mybucket/fed-sample*"
.
The *
wildcard character is not allowed when creating table definition files
for the following data sources:
- Bigtable. For Bigtable data, only one data source can be specified. The URI value must be a valid HTTPS URL for a Bigtable table.
- Datastore or Firestore. Datastore
or Firestore exports stored in
Cloud Storage. For Datastore
backups, only one data source can be specified. The URI value must end with
.backup_info
or.export_metadata
. - Drive. Data stored in Drive.
What's next
- Learn how to query Cloud Storage data.
- Learn how to query Drive data.
- Learn how to query Bigtable data.