Using Schema Auto-Detection

Schema auto-detection

Schema auto-detection is available when you load data into BigQuery, and when you query an external data source.

When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.

To see the detected schema for a table:

  • Use the command-line tool's bq show command
  • Use the BigQuery web UI to view the table's schema

When BigQuery detects schemas, it might, on rare occasions, change a field name to make it compatible with BigQuery SQL syntax.

For information about data type conversions, see:

Loading data using schema auto-detection

To enable schema auto-detection when loading data:

  • BigQuery web UI: In the Schema section, check the Automatically detect option.
  • CLI: Use the bq load command with the --autodetect parameter.

When enabled, BigQuery makes a best-effort attempt to automatically infer the schema for CSV and JSON files.

Schema auto-detection is not used with Avro files, Parquet files, ORC files, Cloud Firestore export files, or Cloud Datastore export files. When you load these files into BigQuery, the table schema is automatically retrieved from the self-describing source data.

To use schema auto-detection when you load JSON or CSV data:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. Click the down arrow icon down arrow icon next to your dataset name in the navigation and click Create new table.

    Note:In the UI, the process for loading data is the same as creating a table.

  3. On the Create table page:

  4. For Source Data, click Create from source.

  5. For Destination Table, choose your dataset and enter the table name in the Destination table name field.
  6. For Schema, click Automatically detect to determine the schema.

    auto detect link

  7. Click Create Table.

CLI

Issue the bq load command with the --autodetect parameter. Supply the --location flag and set the value to your location.

The following command loads a file using schema auto-detect:

bq --location=[LOCATION] load --autodetect --source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE]

Where:

  • [LOCATION] is the name of your location. The --location flag is optional if your data is in the US or the EU multi-region location. For example, if you are using BigQuery in the Tokyo region, set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • [FORMAT] is either NEWLINE_DELIMITED_JSON or CSV.
  • [DATASET] is the dataset that contains the table into which you're loading data.
  • [TABLE] is the name of the table into which you're loading data.
  • [PATH_TO_SOURCE] is the location of the CSV or JSON file.

Examples:

Enter the following command to load myfile.csv from your local machine into a table named mytable which is stored in a dataset named mydataset. mydataset was created in the US multi-region location.

bq --location=US load --autodetect --source_format=CSV mydataset.mytable ./myfile.csv

Enter the following command to load myfile.csv from your local machine into a table named mytable which is stored in a dataset named mydataset. mydataset was created in the asia-northeast1 region.

bq --location=asia-northeast1 load --autodetect --source_format=CSV mydataset.mytable ./myfile.csv

API

  1. Create a load job that points to the source data. For information about creating jobs, see Running BigQuery jobs programmatically. Specify your location in the location property in the jobReference section.

  2. Specify the data format by setting the configuration.load.sourceFormat property. To use schema autodetection, this value must be set to NEWLINE_DELIMITED_JSON or CSV.

  3. Set schema autodetection to true using the configuration.load.autodetect property.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Go API reference documentation .

// To run this sample, you will need to create (or reuse) a context and
// an instance of the bigquery client.  For example:
// import "cloud.google.com/go/bigquery"
// ctx := context.Background()
// client, err := bigquery.NewClient(ctx, "your-project-id")
gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.json")
gcsRef.SourceFormat = bigquery.JSON
gcsRef.AutoDetect = true
loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)
loader.WriteDisposition = bigquery.WriteEmpty

job, err := loader.Run(ctx)
if err != nil {
	return err
}
status, err := job.Wait(ctx)
if err != nil {
	return err
}

if status.Err() != nil {
	return fmt.Errorf("Job completed with error: %v", status.Err())
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

To enable schema auto-detection, set the LoadJobConfig.autodetect property to True.

# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'

dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.json'
load_job = client.load_table_from_uri(
    uri,
    dataset_ref.table('us_states'),
    job_config=job_config)  # API request

assert load_job.job_type == 'load'

load_job.result()  # Waits for table load to complete.

assert load_job.state == 'DONE'
assert client.get_table(dataset_ref.table('us_states')).num_rows == 50

Schema auto-detection for external data sources

To enable schema auto-detection in the web UI when you create a table that is linked to an external data source, check the Automatically detect option. When enabled, BigQuery makes a best-effort attempt to automatically infer the schema for CSV and JSON external data sources.

Currently, you cannot enable schema auto-detection for Google Sheets external data sources by using the web UI. Also, schema auto-detection is not used with Avro files, Cloud Firestore export files, or Cloud Datastore export files. When you create a table that is linked to one of these file types, BigQuery automatically retrieves the schema from the self-describing source data.

Using the CLI, you can enable schema auto-detection when you create a table definition file for CSV, JSON, or Google Sheets data. When using the CLI to create a table definition file, you can pass the --autodetect flag to the mkdefcommand to enable schema auto-detection, or you can pass the --noautodetect flag to disable auto-detection.

When you use the --autodetect flag, the "autodetect" setting is set to true in the table definition file. When you use the --noautodetect flag, the "autodetect" setting is set to false. If you do not provide a schema definition for the external data source when you create a table definition, and you do not use the --noautodetect or --autodetect flag, the "autodetect" setting defaults to true.

When you create a table definition file by using the API, set the value of the "autodetect" property to true or false. Setting autodetect to true enables auto-detection. Setting autodetect to false disables it.

Auto-detection details

In addition to detecting schema details, auto-detection recognizes the following:

Compression

BigQuery recognizes gzip-compatible file compression when opening a file.

CSV Delimiter

BigQuery detects the following delimiters:

  • comma ( , )
  • pipe ( | )
  • tab ( \t )

CSV Header

BigQuery infers headers by comparing the first row of the file with other rows in the data set. If the first line contains only strings, and the other lines do not, BigQuery assumes that the first row is a header row.

CSV Quoted new lines

BigQuery detects quoted new line characters within a CSV field and does not interpret the quoted new line character as a row boundary.

Dates

When you use schema detection for JSON or CSV data, values in DATE columns must use a dash (-) separator and must be in the following format: YYYY-MM-DD (year-month-day).

Timestamps

BigQuery detects a wide array of timestamp formats, including, but not limited to:

  • yyyy-mm-dd
  • yyyy-mm-dd hh:mm:ss
  • yyyy-mm-dd hh:mm:ss.mmm

A timestamp can also contain a UTC offset and the UTC zone designator — Z. Integer-based timestamp values are also supported.

When you use schema detection for JSON or CSV data, values in TIMESTAMP columns must use a dash (-) separator for the date portion of the timestamp, and the date must be in the following format: YYYY-MM-DD (year-month-day). The hh:mm:ss (hour-minute-second) portion of the timestamp must use a colon (:) separator.

Timestamp examples

The following are examples of timestamp formats auto-detected by BigQuery:

  • 253402300799
  • 2018-07-05 12:54:00 UTC
  • 2018-08-19 07:11:35.220 -05:00
  • 2018-08-19 12:11:35.220 UTC
  • 2018-08-19T12:11:35.220Z
  • 2.53402300799e11
  • 2018-08-19 12:11:35.220000
  • 2018-08-19 12:11:35.220
Was this page helpful? Let us know how we did:

Send feedback about...

Need help? Visit our support page.