Hide
BigQuery

Loading Data Into BigQuery

Before you can query your data, you first need to load it into BigQuery. You can bulk load the data by using a job, or stream records individually. Alternately, you can skip the loading process by setting up a table as a federated data source.

Load jobs support three data sources:

  1. Objects in Google Cloud Storage
  2. Data sent with the job or streaming insert
  3. A Google Cloud Datastore backup

Loaded data can be added to a new table, appended to a table, or can overwrite a table. Data can be represented as a flat or nested/repeated schema, as described in Data formats. Each individual load job can load data from multiple sources, configured with the sourceUris property.

It can be helpful to prepare the data before loading it into BigQuery, or transform the data if needed.

Contents

Access control

Loading data into BigQuery requires the following access levels.

Product

Access

BigQuery

WRITE access for the dataset that contains the destination table. For more information, see access control.

Google Cloud Storage

READ access for the object in Google Cloud Storage, if loading data from Google Cloud Storage. For more information, see Access Control - Google Cloud Storage.

Google Cloud Datastore

READ access to the Cloud Datastore backup objects in Google Cloud Storage, if loading data from Cloud Datastore. For more information, see Access Control - Google Cloud Storage.

Back to top

Data consistency

Once you've called jobs.insert() to start a job, you can poll the job for its status by calling jobs.get().

We recommend generating a job ID and passing it as jobReference.jobId when calling jobs.insert(). This approach is more robust to network failure because the client can poll or retry on the known job ID.

Note that calling jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

Back to top

Quota policy

The following limits apply for loading data into BigQuery.

  • Daily limit: 1,000 load jobs per table per day (including failures), 10,000 load jobs per project per day (including failures)
  • Maximum File Sizes:

    For CSV and JSON:

    File Type Compressed Uncompressed
    CSV 4 GB
    • With new-lines in strings: 4 GB
    • Without new-lines in strings: 5 TB
    JSON 4 GB 5 TB
  • Maximum size per load job: 5 TB across all input files for CSV and JSON.
  • Maximum number of files per load job: 10,000
  • There are several additional limits that are specific to BigQuery's supported data formats. For more information, see preparing data for BigQuery.

Back to top

Additional limits

The following additional limits apply for loading data into BigQuery.

  • Maximum columns per table: 10,000
  • Data format limits: Depending on which format you use to load your data, additional limits may apply. For more information, see Data formats.

Back to top

Loading data from Google Cloud Storage

To load data from Google Cloud Storage:

  1. Upload your data to Google Cloud Storage.

    The easiest way to upload your data to Google Cloud Storage is to use the Google Developers Console. Be sure that you upload your data to a project the BigQuery service is activated on.

  2. Create a load job pointing to the source data in Google Cloud Storage. The source URIs must be fully-qualified, in the format gs://<bucket>/<object>.
  3. Check the job status.

    Call jobs.get(<jobId>) with the ID of the job returned by the initial request, and check for status.state = DONE; if the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. If the request failed, no table will have been created or data added. If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

Example

The following Python client example loads CSV data from a Google Cloud Storage bucket and prints the results on the command line.


Back to top

Loading data with a POST request

You can load data directly into BigQuery by sending a POST request. For more information, see loading data with a POST request.

Example

The following Python client sample demonstrates one way of constructing and sending a load request from a local file using the httplib2 library. The script asks for the name of the local file and appends the contents of the file to the body of the request, and then submits the request:

Back to top

Loading data from other Google services

Google Cloud Datastore

BigQuery supports loading data from Cloud Datastore backups. For more information, see loading data from Cloud Datastore.

App Engine log files to BigQuery

log2bq is a Python-based App Engine application that provides handlers for moving App Engine log data into BigQuery via Google Cloud Storage.

There are also some open source tools for loading App Engine log files. One example tool is Mache, which is an open source Java App Engine framework for exporting App Engine logs into Google BigQuery.

Cloud Storage access and storage logs

Google Cloud Storage provides access and storage log files in CSV formats which can be directly imported into BigQuery for analysis. In order to access these logs, you must set up log delivery and enable logging. The schemas are available online, in JSON format, for both the storage access logs and storage bucket data. More information is available in the Cloud Storage access logs and storage data documentation.

In order to load storage and access logs into BigQuery from the command line, use a command such as:

bq load --schema=cloud_storage_usage_schema.json my_dataset.usage_2012_06_18_v0 gs://my_logs/bucket_usage_2012_06_18_14_v0

Back to top