Loading Data

This page describes how to load data into BigQuery.

Introduction

There are many situations where you can query data without loading it. For all other situations, you must first load your data into BigQuery before you can run queries.

You can load data in the following ways:

Loaded data can be added to a new table, appended to a table, or can overwrite a table.

Before you begin

To load data into BigQuery, you must set up billing and ensure that you have read access to the data source and write access to the destination table.

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform Console project.

    Go to the Projects page

  3. Enable billing for your project.

    Enable billing

  4. Ensure that you have read access to your data source. If you are loading content from Google Cloud Storage, and you are an owner of the project that contains your data source, you probably have read access.

    To set READ access on a Cloud Storage object, see Creating and Managing Access Control Lists (ACLs) in the Cloud Storage documentation.

  5. Ensure that you have write access to your destination table. If you are the owner of the dataset that contains your destination table, you probably have write access.

    To set write access to a dataset in BigQuery:

    1. Go to the BigQuery web UI.
      Go to the BigQuery web UI
    2. In the navigation, hover on a dataset ID and click the down arrow icon down arrow icon image next to the ID and click Share dataset.
    3. Add a person and give that person edit access, then click Save changes.

Specifying the source format

BigQuery supports loading data from several source formats, including CSV, JSON, Avro, and Google Cloud Datastore backup files. For more information, see Data Formats.

The default source format for loading data is CSV. To load data that is stored in one of the other supported data formats, specify the format explicitly.

Loading CSV files

For examples of loading CSV files, see:

Loading JSON files

The following instructions describe how to load data in JSON format.

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation, hover on a dataset ID and click the down arrow icon down arrow icon image next to the ID and click Create new table.

  3. Under Source data, select a Location. The supported source formats appear in the File format selector.

  4. Select JSON (newline delimited) for the File format.

  5. Under Destination Table, enter a value for the destination table name.

  6. In the Schema section, specify a schema. If your schema description is in JSON format, click the Edit as Text link and replace the contents of the Schema input area with the JSON file's associated schema.

    edit as text link

  7. Click the Create Table button.

Command-line


Use the bq load command with the source_format flag.

bq load --source_format=NEWLINE_DELIMITED_JSON [DATASET].[TABLE_NAME] [PATH_TO_SOURCE] [SCHEMA]

For example, the following command loads the personsData.json file into a table named persons_data in the mydataset dataset using the schema defined in personsDataSchema.json:

bq load --source_format=NEWLINE_DELIMITED_JSON mydataset.persons_data personsData.json personsDataSchema.json

API

  1. Create a load job that points to the source data in Google Cloud Storage. For information about creating jobs, see Managing Jobs, Datasets, and Projects.

  2. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. Each URI can contain one '*' wildcard character and it must come after the bucket name.

  3. Specify the data format by setting the configuration.load.sourceFormat property. For example, to load a JSON file, set this property to the value NEWLINE_DELIMITED_JSON.

  4. To check the job status, call jobs.get([JOB_ID]) with the ID of the job returned by the initial request.

    • If status.state = DONE, the job completed successfully.
    • If the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. When a request fails, no table is created and no data is added.
    • If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

API notes:

  • Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.

  • As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert() to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.

  • Calling jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

For sample code and information about loading data with a POST request, see loading data with a POST request

Loading Avro files

Follow along with the instructions below by downloading and unzipping this sample binary Avro file that contains baby names recorded in the United States in the year 1900.

To load an Avro file:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation, hover on a dataset ID and click the down arrow icon down arrow icon image next to the ID and click Create new table.

  3. Under Source Data, leave the default setting of File upload for the Location. Click Choose file and select an Avro file. For example, you can use the yob1900.avro file that you previously downloaded and unzipped.

  4. Select Avro for the File format.

  5. Under Destination Table, enter a value for the destination table name. For example:

    names_1900
    
  6. Click the Create Table button.

  7. After the table loads, click on the table name to display its schema.

    screenshot of Avro schema after loading

Command-line


Use the bq load command with source_format set to AVRO.

bq load --source_format=[FORMAT] [DATASET].[TABLE_NAME] [PATH_TO_SOURCE]

where:

  • [FORMAT] is one of the supported values
  • [DATASET].[TABLE_NAME] is the destination table
  • [PATH_TO_SOURCE] is the path to the JSON source file

For example, the following command loads the yob1900.avro file into a table named babynames.names_1900_bq:

bq load --source_format=AVRO babynames.names_1900_bq yob1900.avro

API

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. Each URI can contain one '*' wildcard character and it must come after the bucket name.

  3. Specify the Avro data format by setting the configuration.load.sourceFormat property to AVRO.

  4. To check the job status, call jobs.get([JOB_ID]*), where [JOB_ID] is the ID of the job returned by the initial request.

    • If status.state = DONE, the job completed successfully.
    • If the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. When a request fails, no table is created and no data is added.
    • If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

API notes:

  • Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.

  • As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert() to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.

  • Calling jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

For sample code and information about loading data with a POST request, see loading data with a POST request

Loading nested and repeated data

BigQuery supports loading nested and repeated data from source formats that support object-based schemas, such as JSON, Avro, and Cloud Datastore backup files.

Loading nested and repeated JSON data

Follow along with the instructions below by using these two sample files, which represent a dataset with nested and repeated data:

To load a JSON file that contains nested and repeated data:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation, hover on a dataset ID and click the down arrow icon down arrow icon image next to the ID and click Create new table.

  3. Under Source Data, leave the default setting of File upload for the Location. Click Choose file and select a JSON file. For example, personsData.json.

  4. Select JSON (Newline Delimited) for the File format.

  5. Under Destination Table, enter a value for the destination table name. For example:

    persons_data
    
  6. In the Schema section, click the Edit as Text link to input the nested schema.

    edit as text link

  7. Replace the contents of the Schema input area with the JSON file's associated schema. For example, copy and paste the contents of personsDataSchema.json.

  8. Click the Create Table button.

Command-line


Use the bq load command with source_format set to NEWLINE_DELIMITED_JSON.

bq load --source_format=[FORMAT] --schema=[SCHEMA] [DATASET].[TABLE_NAME] [PATH_TO_SOURCE]

where:

  • [FORMAT] is one of the supported values
  • [SCHEMA] is the schema of the data to be loaded
  • [DATASET].[TABLE_NAME] is the destination table
  • [PATH_TO_SOURCE] is the path to the JSON source file

For example, the following command loads the personsData.json file into a table named mydataset.persons_data using the schema defined in the personsDataSchema.json file:

bq load --source_format=NEWLINE_DELIMITED_JSON --schema=personsDataSchema.json mydataset.persons_data personsData.json

API

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. Each URI can contain one '' wildcard character and it must come after the bucket name.

  3. Specify the JSON data format by setting the configuration.load.sourceFormat property to NEWLINE_DELIMITED_JSON.

  4. To check the job status, call jobs.get([JOB_ID]*), where [JOB_ID] is the ID of the job returned by the initial request, and check for status.state = DONE; if the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. If the request failed, no table will have been created or data added. If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

For more information about loading data with a POST request, see loading data with a POST request

For information about how to query nested and repeated data, see Nested and Repeated Fields.

Loading data from other Google services

Google Analytics

To learn how to export your session and hit data from a Google Analytics Premium reporting view into BigQuery, see BigQuery Export in the Google Analytics Help Center.

Google Cloud Storage

BigQuery supports loading data from Cloud Storage. For more information, see loading data from Cloud Storage.

Google Cloud Datastore

BigQuery supports loading data from Cloud Datastore backups. For more information, see loading data from Cloud Datastore.

Google Cloud Dataflow

Google Cloud Dataflow can load data directly into BigQuery. For more information about using Cloud Dataflow to read from, and write to, BigQuery, see BigQuery I/O in the Cloud Dataflow documentation.

Cloud Storage access and storage logs

Google Cloud Storage provides access and storage log files in CSV format, which can be directly imported into BigQuery for analysis. In order to access these logs, you must:

  1. Set up log delivery and enable logging.

  2. Download the schema files, which you can use to specify the schema for the log data:

    More information is available in the Cloud Storage access logs and storage data documentation.

  3. Use the bq load command to load storage and access logs into BigQuery from the command line:

    bq load --schema=[SCHEMA] [DATASET].[TABLE_NAME] [PATH_TO_SOURCE]
    

    where:

    • [SCHEMA] is the associated schema file that you downloaded
    • [DATASET].[TABLE_NAME] is the dataset and name of the destination table
    • [PATH_TO_SOURCE] is the fully-qualified location of the logs data

    For example, the following command loads logs data into a table named my_dataset.usage_2012_06_18_v0:

    bq load --schema=cloud_storage_usage_schema.json my_dataset.usage_2012_06_18_v0 gs://my_logs/bucket_usage_2012_06_18_14_v0
    

Alternatives to loading data

You do not need to load data before running queries in the following situations:

Public datasets
Public datasets are datasets that another party stores in BigQuery, but shares with the public. For more information, see Public datasets.
Shared datasets
You can share datasets stored in BigQuery. If someone has shared a dataset with you, you can run queries on that dataset without loading the data first.
Federated data sources
You can skip the data loading process by creating a table that is based on an external data source. For information about the benefits and limitations of this approach, see federated data sources.
Stackdriver log files
Cloud Logging provides an option to export log files, such as App Engine request and application logs, straight into BigQuery. See ways to configure logs export for more information.

Another alternative to loading data is to stream the data one record at a time. For information about streaming, see Streaming Data.

Quota policy

For information about the quota policy for loading data, see Load jobs on the Quota Policy page.

What's next

Send feedback about...

BigQuery Documentation