Before you can query your data, you must first load it into BigQuery or set up the data as a federated data source. You can bulk load the data by using a job, or stream records individually. Alternately, you can skip the loading process by setting up a table as a federated data source.
Load jobs support three data sources:
- Objects in Google Cloud Storage
- Data sent with the job or streaming insert
- A Google Cloud Datastore backup
Loaded data can be added to a new table, appended to a table, or can overwrite a table. Data can be represented as a flat or nested/repeated schema, as described in Data formats. Each individual load job can load data from multiple sources, configured with the sourceUris property.
You can also use Google Cloud Dataflow to load data into BigQuery. For more information about using Cloud Dataflow to read from, and write to, BigQuery, see BigQuery I/O in the Cloud Dataflow documentation.
- Access control
- Load status, retries and data consistency
- Quota policy
- Additional limits
- Loading data from Google Cloud Storage
- Loading data with a POST request
- Loading data from other Google services
To load data into BigQuery you need the following access levels.
Google Cloud Storage
Load status, retries and data consistency
Once you've called
jobs.insert() to start a load job, you can poll the job for its status by calling
We recommend generating a unique ID and passing it as
jobReference.jobId when calling
jobs.insert(). This approach
is more robust to network failure because the client can poll or retry on the known job ID.
Note that calling
jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you
like on the same job ID, and at most one of those operations will succeed.
The following limits apply for loading data into BigQuery.
- Daily limit: 1,000 load jobs per table per day (including failures), 10,000 load jobs per project per day (including failures)
- Maximum File Sizes:
For CSV and JSON:
File Type Compressed Uncompressed CSV 4 GB
- With quoted new-lines in values: 4 GB
- Without new-lines in values: 5 TB
JSON 4 GB 5 TB
- Maximum size per load job: 12 TB across all input files for CSV and JSON.
- Maximum number of files per load job: 10,000
- There are several additional limits that are specific to BigQuery's supported data formats. For more information, see preparing data for BigQuery.
The following additional limits apply for loading data into BigQuery.
- Maximum columns per table: 10,000
- Data format limits: Depending on which format you use to load your data, additional limits may apply. For more information, see Data formats.
Loading data from Google Cloud Storage
BigQuery supports loading data from two storage classes:
BigQuery does not support loading data from Cloud Storage Nearline.
To load data from Google Cloud Storage:
- Upload your data to Google Cloud Storage.
The easiest way to upload your data to Google Cloud Storage is to use the Google Cloud Platform Console. Be sure that you upload your data to a project the BigQuery service is activated on.
- Create a load job pointing to the source data in Google Cloud Storage.
The source URIs must be fully-qualified, in the format gs://<bucket>/<object>. Each URI can contain one '*' wildcard character and it must come after the bucket name.
Note: BigQuery does not support source URIs that include multiple consecutive slashes after the initial double slash. Cloud Storage object names can contain multiple consecutive slash ("/") characters. However, BigQuery converts multiple consecutives slashes into a single slash. For example, the following source URI, though valid in Cloud Storage, does not work in BigQuery:
Check the job status.
Call jobs.get(<jobId>) with the ID of the job returned by the initial request, and check for
status.state = DONE; if the
status.errorResultproperty is present, the request failed, and that object will include information describing what went wrong. If the request failed, no table will have been created or data added. If
status.errorResultis absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's
The following Python client example loads CSV data from a Google Cloud Storage bucket and prints the results on the command line.
Loading data with a POST request
You can load data directly into BigQuery by sending a POST request. For more information, see loading data with a POST request.
The following Python client sample loads data into BigQuery from a local file:
BigQuery supports loading data from Cloud Datastore backups. For more information, see loading data from Cloud Datastore.
Cloud Logging provides an option to export App Engine request and application logs straight into BigQuery. See ways to configure logs export for more information.
Google Cloud Storage provides access and storage log files in CSV formats which can be directly imported into BigQuery for analysis. In order to access these logs, you must set up log delivery and enable logging. The schemas are available online, in JSON format, for both the storage access logs and storage bucket data. More information is available in the Cloud Storage access logs and storage data documentation.
In order to load storage and access logs into BigQuery from the command line, use a command such as:
bq load --schema=cloud_storage_usage_schema.json my_dataset.usage_2012_06_18_v0 gs://my_logs/bucket_usage_2012_06_18_14_v0