Introduction to loading data

This document provides an overview of loading data into BigQuery.

Overview

There are several ways to ingest data into BigQuery:

  • Batch load a set of data records.
  • Stream individual records or batches of records.
  • Use queries to generate new data and append or overwrite the results to a table.
  • Use a third-party application or service.

Batch loading

With batch loading, you load the source data into a BigQuery table in a single batch operation. For example, the data source could be a CSV file, an external database, or a set of log files. Traditional extract, transform, and load (ETL) jobs fall into this category.

Options for batch loading in BigQuery include the following:

  • Load jobs. Load data from Cloud Storage or from a local file by creating a load job. The records can be in Avro, CSV, JSON, ORC, or Parquet format.
  • SQL. The LOAD DATA SQL statement loads data from one or more files into a new or existing table. You can use the LOAD DATA statement to load Avro, CSV, JSON, ORC, or Parquet files.
  • BigQuery Data Transfer Service. Use BigQuery Data Transfer Service to automate loading data from Google Software as a Service (SaaS) apps or from third-party applications and services.
  • BigQuery Storage Write API. The Storage Write API lets you batch-process an arbitrarily large number of records and commit them in a single atomic operation. If the commit operation fails, you can safely retry the operation. Unlike BigQuery load jobs, the Storage Write API does not require staging the data to intermediate storage such as Cloud Storage.
  • Other managed services. Use other managed services to export data from an external data store and import it into BigQuery. For example, you can load data from Firestore exports.

When choosing a batch load method, most file-based patterns should use either load job or LOAD DATA SQL statement to batch load data. Other services should generally use BigQuery Data Transfer Service or export data from Google services.

Batch loading can be done as a one-time operation or on a recurring schedule. For example, you can do the following:

  • You can run BigQuery Data Transfer Service transfers on a schedule.
  • You can use an orchestration service such as Cloud Composer to schedule load jobs.
  • You can use a cron job to load data on a schedule.

Streaming

With streaming, you continually send smaller batches of data in real time, so the data is available for querying as it arrives. Options for streaming in BigQuery include the following:

  • Storage Write API. The Storage Write API supports high-throughput streaming ingestion with exactly-once delivery semantics.
  • Dataflow. Use Dataflow with the Apache Beam SDK to set up a streaming pipeline that writes to BigQuery. For more information, see BigQuery I/O connector in the Apache Beam documentation and the Stream from Pub/Sub to BigQuery tutorial.
  • Datastream. Datastream uses BigQuery change data capture functionality and the Storage Write API to replicate data and schema updates from operational databases directly into BigQuery. Follow this quickstart for an example of replicating from a Cloud SQL for PostgreSQL database into BigQuery.
  • BigQuery Connector for SAP. The BigQuery Connector for SAP enables near real time replication of SAP data directly into BigQuery. For more information, see the BigQuery Connector for SAP planning guide.
  • Pub/Sub. Pub/Sub is a messaging service you can use to coordinate streaming analytics and data integration pipelines. You can use BigQuery subscriptions to write messages directly to an existing BigQuery table.

Generated data

You can use SQL to generate data and store the results in BigQuery. Options for generating data include:

  • Use data manipulation language (DML) statements to perform bulk inserts into an existing table or store query results in a new table.

  • Use a CREATE TABLE ... AS statement to create a new table from a query result.

  • Run a query and save the results to a table. You can append the results to an existing table or write to a new table. For more information, see Writing query results.

Third-party applications

Some third-party applications and services provide connectors that can ingest data into BigQuery. The details of how to configure and manage the ingestion pipeline depend on the application. For example, to load data from external sources to BigQuery's storage, you can use Informatica Data Loader or Fivetran Data Pipelines. For more information, see Load data using a third-party application.

Choosing a data ingestion method

Here are some considerations to think about when you choose a data ingestion method.

Data source. The source of the data or the data format can determine whether batch loading or streaming is simpler to implement and maintain. Consider the following points:

  • If the BigQuery Data Transfer Service supports the data source, transferring the data directly into BigQuery might be the simplest solution to implement.

  • For data from third-party sources that aren't supported by the BigQuery Data Transfer Service, transform the data into a format supported by batch loading and use that method instead.

  • If your data comes from Spark or Hadoop, consider using BigQuery connectors to simplify data ingestion.

  • For local files, consider batch load jobs, especially if BigQuery supports the file format without requiring a transformation or data cleansing step.

  • For application data such as application events or a log stream, it might be easier to stream the data in real time, rather than implement batch loading.

Slow-changing versus fast-changing data. If you need to ingest and analyze data in near real time, consider streaming the data. With streaming, the data is available for querying as soon as each record arrives. Avoid using DML statements to submit large numbers of individual row updates or insertions. For frequently updated data, it's often better to stream a change log and use a view to obtain the latest results. Another option is to use Cloud SQL as your online transaction processing (OLTP) database and use federated queries to join the data in BigQuery.

If your source data changes slowly or you don't need continuously updated results, consider using a load job. For example, if you use the data to run a daily or hourly report, load jobs can be less expensive and can use fewer system resources.

Another scenario is data that arrives infrequently or in response to an event. In that case, consider using Dataflow to stream the data or use Cloud Functions to call the streaming API in response to a trigger.

Reliability of the solution. BigQuery has a Service Level Agreement (SLA). However, you also need to consider the reliability of the particular solution that you implement. Consider the following points:

  • With loosely typed formats such as JSON or CSV, bad data can make an entire load job fail. Consider whether you need a data cleansing step before loading, and consider how to respond to errors. Also consider using a strongly typed format such as Avro, ORC, or Parquet.
  • Periodic load jobs require scheduling, using Cloud Composer, cron, or another tool. The scheduling component could be a failure point in the solution.
  • With streaming, you can check the success of each record and quickly report an error. Consider writing failed messages to an unprocessed messages queue for later analysis and processing. For more information about BigQuery streaming errors, see Troubleshooting streaming inserts.
  • Streaming and load jobs are subject to quotas. For information about how to handle quota errors, see Troubleshooting BigQuery quota errors.
  • Third-party solutions might differ in configurability, reliability, ordering guarantees, and other factors, so consider these before adopting a solution.

Latency. Consider how much data you load and how soon you need the data to be available. Streaming offers the lowest latency of data being available for analysis. Periodic load jobs have a higher latency, because new data is only available after each load job finishes.

Load jobs use a shared pool of slots by default. A load job might wait in a pending state until slots are available, especially if you load a very large amount of data. If that creates unacceptable wait times, you can purchase dedicated slots, instead of using the shared slot pool. For more information, see Introduction to Reservations.

Query performance for external data sources might not be as high as query performance for data stored in BigQuery. If minimizing query latency is important, then we recommend loading the data into BigQuery.

Data ingestion format. Choose a data ingestion format based on the following factors:

  • Schema support. Avro, ORC, Parquet, and Firestore exports are self-describing formats. BigQuery creates the table schema automatically based on the source data. For JSON and CSV data, you can provide an explicit schema, or you can use schema auto-detection.

  • Flat data or nested and repeated fields. Avro, CSV, JSON, ORC, and Parquet all support flat data. Avro, JSON, ORC, Parquet, and Firestore exports also support data with nested and repeated fields. Nested and repeated data is useful for expressing hierarchical data. Nested and repeated fields also reduce data duplication when loading the data.

  • Embedded newlines. When you are loading data from JSON files, the rows must be newline delimited. BigQuery expects newline-delimited JSON files to contain a single record per line.

  • Encoding. BigQuery supports UTF-8 encoding for both nested or repeated and flat data. BigQuery supports ISO-8859-1 encoding for flat data only for CSV files.

Load nested and repeated data

You can load data into nested and repeated fields in the following data formats:

  • Avro
  • JSON (newline delimited)
  • ORC
  • Parquet
  • Datastore exports
  • Firestore exports

For information about specifying nested and repeated fields in your schema when you're loading data, see Specifying nested and repeated fields.

Load data from other Google services

Some Google services export data to BigQuery using scheduled queries, exports, or transfers. For more information about services that support exports to BigQuery, see Load data from Google services.

Other Google services support data exports initiated from BigQuery Data Transfer Service. For more information about services that support exports initiated by BigQuery Data Transfer Service, see BigQuery Data Transfer Service.

Quota

For information about quotas, see the following sections:

Alternatives to loading data

You don't need to load data before running queries in the following situations:

Public datasets
Public datasets are datasets stored in BigQuery and shared with the public. For more information, see BigQuery public datasets.
Shared datasets
You can share datasets stored in BigQuery. If someone has shared a dataset with you, you can run queries on that dataset without loading the data.
External data sources
BigQuery can run queries on certain forms of external data, without loading the data into BigQuery storage. This approach lets you take advantage of the analytic capabilities of BigQuery without moving data that is stored elsewhere. For information about the benefits and limitations of this approach, see external data sources.
Logging files
Cloud Logging provides an option to export log files into BigQuery. See Configure and manage sinks for more information.

Monitor usage of load jobs

You can monitor usage of load jobs using the following two ways:

  • Use Cloud Monitoring. For more information, see BigQuery metrics. Specifically, you can monitor the amount of data and number of rows uploaded to a specific table. If your load jobs upload data to a specific table, this can be a proxy for monitoring load job upload data usage.

  • Use INFORMATION_SCHEMA.JOBS_BY_PROJECT. You can use the INFORMATION_SCHEMA.JOBS_BY_PROJECT view to get the number of load jobs per table per day.

Example use case

The following examples explain the methods to use based on your use case and how to use them with other data analytics solutions.

Stream data using Storage Write API

Suppose that there is a pipeline processing event data from endpoint logs. Events are generated continuously and need to be available for querying in BigQuery as soon as possible. As data freshness is paramount for this use case, the Storage Write API is the best choice to ingest data into BigQuery. A recommended architecture to keep these endpoints lean is sending events to Pub/Sub, from where they are consumed by a streaming Dataflow pipeline which directly streams to BigQuery.

A primary reliability concern for this architecture is how to deal with failing to insert a record into BigQuery. If each record is important and cannot be lost, data needs to be buffered before attempting to insert. In the recommended architecture above, Pub/Sub can play the role of a buffer with its message retention capabilities. The Dataflow pipeline should be configured to retry BigQuery streaming inserts with truncated exponential backoff. After the capacity of Pub/Sub as a buffer is exhausted, for example in the case of prolonged unavailability of BigQuery or a network failure, data needs to be persisted on the client and the client needs a mechanism to resume inserting persisted records once availability is restored. For more information about how to handle this situation, see the Google Pub/Sub Reliability Guide blog post.

Another failure case to handle is that of a poison record. A poison record is either a record rejected by BigQuery because the record fails to insert with a non-retryable error or a record that has not been successfully inserted after the maximum number of retries. Both types of records should be stored in a "dead letter queue" by the Dataflow pipeline for further investigation.

If exactly-once semantics are required, create a write stream in committed type, with record offsets provided by the client. This avoids duplicates, as the write operation is only performed if the offset value matches the next append offset. Not providing an offset means records are appended to the current end of the stream and retrying a failed append could result in the record appearing more than once in the stream.

If exactly-once guarantees are not required, writing to the default stream allows for a higher throughput and also does not count against the quota limit on creating write streams.

Estimate the throughput of your network and ensure in advance that you have an adequate quota to serve the throughput.

If your workload is generating or processing data at a very uneven rate, then try to smooth out any load spikes on the client and stream into BigQuery with a constant throughput. This can simplify your capacity planning. If that is not possible, ensure you are prepared to handle 429 (resource exhausted) errors if and when your throughput goes over quota during short spikes.

Batch data processing

Suppose there is a nightly batch processing pipeline that needs to be completed by a fixed deadline. Data needs to be available by this deadline for further processing by another batch process to generate reports to be sent to a regulator. This use case is common in regulated industries such as finance.

Batch loading of data with load jobs is the right approach for this use case because latency is not a concern provided the deadline can be met. Ensure your Cloud Storage buckets meet the location requirements for loading data into the BigQuery dataset.

The result of a BigQuery load job is atomic; either all records get inserted or none do. As a best practice, when inserting all data in a single load job, create a new table by using the WRITE_TRUNCATE disposition of the JobConfigurationLoad resource. This is important when retrying a failed load job, as the client might not be able to distinguish between jobs that have failed and the failure caused by for example in communicating the success state back to the client.

Assuming data to be ingested has been successfully copied to Cloud Storage already, retrying with exponential backoff is sufficient to address ingestion failures.

It's recommended that a nightly batch job doesn't hit the default quota of 1,500 loads per table per day even with retries. When loading data incrementally, the default quota is sufficient for running a load job every 5 minutes and have unconsumed quota for at least 1 retry per job on average.

What's next