Using schema auto-detection

Schema auto-detection

Schema auto-detection is available when you load data into BigQuery, and when you query an external data source.

When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.

To see the detected schema for a table:

  • From the command line, use the bq show command
  • Use the Cloud Console or the classic web UI to view the table's schema

When BigQuery detects schemas, it might, on rare occasions, change a field name to make it compatible with BigQuery SQL syntax.

For information about data type conversions, see:

Loading data using schema auto-detection

To enable schema auto-detection when loading data:

  • Cloud Console: in the Schema section, for Auto detect, check the Schema and input parameters option.
  • Classic BigQuery web UI: in the Schema section, check the Automatically detect option.
  • bq: use the bq load command with the --autodetect parameter.

When enabled, BigQuery makes a best-effort attempt to automatically infer the schema for CSV and JSON files.

Schema auto-detection is not used with Avro files, Parquet files, ORC files, Firestore export files, or Datastore export files. When you load these files into BigQuery, the table schema is automatically retrieved from the self-describing source data.

To use schema auto-detection when you load JSON or CSV data:

Console

  1. In the Cloud Console, go to the BigQuery web UI.
    Go to the Cloud Console

  2. From the Resources section of the navigation panel, select a dataset.

  3. Click Create table.

    Create table.

  4. On the Create table page, in the Source section:

    • For Create table from, select your desired source type.
    • In the source field, browse for the File/Cloud Storage bucket, or enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.

      Select file.

    • For File format, select CSV or JSON.

  5. On the Create table page, in the Destination section:

    • For Dataset name, choose the appropriate dataset.

      Select dataset.

    • In the Table name field, enter the name of the table you're creating.

    • Verify that Table type is set to Native table.

  6. Click Create table.

Classic UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation, next to your dataset name, click the down arrow icon Down arrow icon..

  3. Click Create new table.

    Note:In the UI, the process for loading data is the same as creating a table.
  4. On the Create table page:

    • For Source Data, click Create from source.
    • For Destination Table, choose your dataset and enter the table name in the Destination table name field.
    • For Schema, click Automatically detect to determine the schema.

      Auto detect link.

    • Click Create Table.

bq

Issue the bq load command with the --autodetect parameter.

(Optional) Supply the --location flag and set the value to your location.

The following command loads a file using schema auto-detect:

bq --location=LOCATION load \
--autodetect \
--source_format=FORMAT \
DATASET.TABLE \
PATH_TO_SOURCE

Replace the following:

  • LOCATION: the name of your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, set the flag's value to asia-northeast1. You can set a default value for the location by using the .bigqueryrc file.
  • FORMAT: either NEWLINE_DELIMITED_JSON or CSV.
  • DATASET: the dataset that contains the table into which you're loading data.
  • TABLE: the name of the table into which you're loading data.
  • path_to_source: is the location of the CSV or JSON file.

Examples:

Enter the following command to load myfile.csv from your local machine into a table named mytable that is stored in a dataset named mydataset.

bq load --autodetect --source_format=CSV mydataset.mytable ./myfile.csv

Enter the following command to load myfile.json from your local machine into a table named mytable that is stored in a dataset named mydataset.

bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON \
mydataset.mytable ./myfile.json

API

  1. Create a load job that points to the source data. For information about creating jobs, see Running BigQuery jobs programmatically. Specify your location in the location property in the jobReference section.

  2. Specify the data format by setting the sourceFormat property. To use schema autodetection, this value must be set to NEWLINE_DELIMITED_JSON or CSV.

  3. Use the autodetect property to set schema autodetection to true.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Go API reference documentation.

import (
	"context"
	"fmt"

	"cloud.google.com/go/bigquery"
)

// importJSONAutodetectSchema demonstrates loading data from newline-delimited JSON data in Cloud Storage
// and using schema autodetection to identify the available columns.
func importJSONAutodetectSchema(projectID, datasetID, tableID string) error {
	// projectID := "my-project-id"
	// datasetID := "mydataset"
	// tableID := "mytable"
	ctx := context.Background()
	client, err := bigquery.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("bigquery.NewClient: %v", err)
	}
	defer client.Close()

	gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.json")
	gcsRef.SourceFormat = bigquery.JSON
	gcsRef.AutoDetect = true
	loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)
	loader.WriteDisposition = bigquery.WriteEmpty

	job, err := loader.Run(ctx)
	if err != nil {
		return err
	}
	status, err := job.Wait(ctx)
	if err != nil {
		return err
	}

	if status.Err() != nil {
		return fmt.Errorf("job completed with error: %v", status.Err())
	}
	return nil
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Node.js API reference documentation.

// Import the Google Cloud client libraries
const {BigQuery} = require('@google-cloud/bigquery');
const {Storage} = require('@google-cloud/storage');

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const datasetId = "my_dataset";
// const tableId = "my_table";

/**
 * This sample loads the JSON file at
 * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.json
 *
 * TODO(developer): Replace the following lines with the path to your file.
 */
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.json';

async function loadJSONFromGCSAutodetect() {
  // Imports a GCS file into a table with autodetected schema.

  // Instantiate clients
  const bigquery = new BigQuery();
  const storage = new Storage();

  // Configure the load job. For full list of options, see:
  // https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad
  const metadata = {
    sourceFormat: 'NEWLINE_DELIMITED_JSON',
    autodetect: true,
    location: 'US',
  };

  // Load data from a Google Cloud Storage file into the table
  const [job] = await bigquery
    .dataset(datasetId)
    .table(tableId)
    .load(storage.bucket(bucketName).file(filename), metadata);
  // load() waits for the job to finish
  console.log(`Job ${job.id} completed.`);

  // Check the job's status for errors
  const errors = job.status.errors;
  if (errors && errors.length > 0) {
    throw errors;
  }
}
loadJSONFromGCSAutodetect();

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery PHP API reference documentation.

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Core\ExponentialBackoff;

/** Uncomment and populate these variables in your code */
// $projectId  = 'The Google project ID';
// $datasetId  = 'The BigQuery dataset ID';

// instantiate the bigquery table service
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$dataset = $bigQuery->dataset($datasetId);
$table = $dataset->table('us_states');

// create the import job
$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.json';
$loadConfig = $table->loadFromStorage($gcsUri)->autodetect(true)->sourceFormat('NEWLINE_DELIMITED_JSON');
$job = $table->runJob($loadConfig);
// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
    print('Waiting for job to complete' . PHP_EOL);
    $job->reload();
    if (!$job->isComplete()) {
        throw new Exception('Job has not yet completed', 500);
    }
});
// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
    $error = $job->info()['status']['errorResult']['message'];
    printf('Error running job: %s' . PHP_EOL, $error);
} else {
    print('Data imported successfully' . PHP_EOL);
}

Python

To enable schema auto-detection, set the LoadJobConfig.autodetect property to True.

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Python API reference documentation.

# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'

dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.json"
load_job = client.load_table_from_uri(
    uri, dataset_ref.table("us_states"), job_config=job_config
)  # API request
print("Starting job {}".format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print("Job finished.")

destination_table = client.get_table(dataset_ref.table("us_states"))
print("Loaded {} rows.".format(destination_table.num_rows))

Ruby

Before trying this sample, follow the Ruby setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Ruby API reference documentation.

require "google/cloud/bigquery"

def load_table_gcs_json_autodetect dataset_id = "your_dataset_id"
  bigquery = Google::Cloud::Bigquery.new
  dataset  = bigquery.dataset dataset_id
  gcs_uri  = "gs://cloud-samples-data/bigquery/us-states/us-states.json"
  table_id = "us_states"

  load_job = dataset.load_job table_id,
                              gcs_uri,
                              format:     "json",
                              autodetect: true
  puts "Starting job #{load_job.job_id}"

  load_job.wait_until_done! # Waits for table load to complete.
  puts "Job finished."

  table = dataset.table table_id
  puts "Loaded #{table.rows_count} rows to table #{table.id}"
end

Schema auto-detection for external data sources

When you create a table that is linked to an external data source, enable schema auto-detection:

  • In the Cloud Console, for Auto detect, check the Schema and input parameters option.
  • In the classic BigQuery web UI, check the Automatically detect option.

When enabled, BigQuery makes a best-effort attempt to automatically infer the schema for CSV and JSON external data sources.

Currently, you cannot enable schema auto-detection for Google Sheets external data sources by using the Cloud Console or the classic web UI. Also, schema auto-detection is not used with external Avro files, Firestore export files, or Datastore export files. When you create a table that is linked to one of these file types, BigQuery automatically retrieves the schema from the self-describing source data.

Using the command-line interface, you can enable schema auto-detection when you create a table definition file for CSV, JSON, or Google Sheets data. When using the command-line interface to create a table definition file, pass the --autodetect flag to the mkdef command to enable schema auto-detection, or pass the --noautodetect flag to disable auto-detection.

When you use the --autodetect flag, the autodetect setting is set to true in the table definition file. When you use the --noautodetect flag, the autodetect setting is set to false. If you do not provide a schema definition for the external data source when you create a table definition, and you do not use the --noautodetect or --autodetect flag, the autodetect setting defaults to true.

When you create a table definition file by using the API, set the value of the autodetect property to true or false. Setting autodetect to true enables auto-detection. Setting autodetect to false disables it.

Auto-detection details

In addition to detecting schema details, auto-detection recognizes the following:

Compression

BigQuery recognizes gzip-compatible file compression when opening a file.

CSV delimiter

BigQuery detects the following delimiters:

  • comma ( , )
  • pipe ( | )
  • tab ( \t )

CSV header

BigQuery infers headers by comparing the first row of the file with other rows in the data set. If the first line contains only strings, and the other lines contain other data types, BigQuery assumes that the first row is a header row. In that case, BigQuery assigns column names based on the field names in the header row. The names might be modified to meet the naming rules for columns in BigQuery. For example, spaces will be replaced with underscores.

Otherwise, BigQuery assumes the first row is a data row, and assigns generic column names such as string_field_1. Note that after a table is created, the column names cannot be updated in the schema, although you can change the names manually after the table is created. Another option is to provide an explicit schema instead of using autodetect.

You might have a CSV file with a header row, where all of the data fields are strings. In that case, BigQuery will not automatically detect that the first row is a header. Use the -skip_leading_rows option to skip the header row. Otherwise, the header will be imported as data. Also consider providing an explicit schema in this case, so that you can assign column names.

CSV quoted new lines

BigQuery detects quoted new line characters within a CSV field and does not interpret the quoted new line character as a row boundary.

Date and time values

BigQuery detects date and time values based on the formatting of the source data.

Values in DATE columns must be in the following format: YYYY-MM-DD.

Values in TIME columns must be in the following format: HH:MM:SS[.SSSSSS] (the fractional-second component is optional).

For TIMESTAMP columns, BigQuery detects a wide array of timestamp formats, including, but not limited to:

  • YYYY-MM-DD HH:MM
  • YYYY-MM-DD HH:MM:SS
  • YYYY-MM-DD HH:MM:SS.SSSSSS
  • YYYY/MM/DD HH:MM

A timestamp can also contain a UTC offset or the UTC zone designator ('Z').

Here are some examples of values that BigQuery will automatically detect as timestamp values:

  • 2018-08-19 12:11
  • 2018-08-19 12:11:35.22
  • 2018/08/19 12:11
  • 2018-08-19 07:11:35.220 -05:00

If BigQuery doesn't recognize the format, it loads the column as a string data type. In that case, you might need to preprocess the source data before loading it. For example, if you are exporting CSV data from a spreadsheet, set the date format to match one of the examples shown here. Alternatively, you can transform the data after loading it into BigQuery.