Instead of using a job to load data into BigQuery, you can choose to stream your data into BigQuery one record at a time by using the tabledata().insertAll() method. This approach enables querying data without the delay of running a load job. There are several important trade-offs to consider before choosing an approach.
- Access control
- Quota policy
- Data availability
- Data consistency
- Example use cases
- Troubleshooting streaming inserts
- Streaming insert examples
Streaming data into BigQuery requires the following access levels.
The following limits apply for streaming data into BigQuery.
- Maximum row size: 1 MB
- HTTP request size limit: 10 MB
- Maximum rows per second: 100,000 rows per second, per table. Exceeding this amount will cause
- Maximum rows per request: 500
- Maximum bytes per second: 100 MB per second, per table. Exceeding this amount will cause
The first time a streaming insert occurs, the streamed data is inaccessible for a warm-up period of up to two minutes. After the warm-up period, all streamed data added during and after the warm-up period is immediately queryable. After several hours of inactivity, the warm-up period will occur again during the next insert.
Data can take up to 90 minutes to become available for copy and export operations.
To help ensure data consistency, you can supply
insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the
insertId property is set, BigQuery uses the
insertId property to de-duplicate your data on a best effort basis.
The primary reason to leverage the de-duplication process is for retrying inserts, described at troubleshooting streaming inserts, as there's no way to determine the state of a streaming insert in certain error conditions. For example, network errors between your system and BigQuery or internal errors within BigQuery. In rare instances of regional data center unavailability, data duplication might occur for the data hosted in the region experiencing the disruption. New row insertions would be routed to data centers in another region, but de-duplication with the unavailable data would not be possible.
Example use cases
High volume event logging
If you have an app that collects a large amount of data in real-time, streaming inserts can be a good choice. Generally, these types of apps have the following criteria:
- Not transactional. High volume, continuously appended rows. The app can tolerate a rare possibility that duplication might occur or that data might be temporarily unavailable.
- Aggregate analysis. Queries generally are performed for trend analysis, as opposed to single or narrow record selection.
One example of high volume event logging is event tracking. Suppose you have a mobile app that tracks events. Your app, or mobile servers, could independently record user interactions or system errors and stream them into BigQuery. You could analyze this data to determine overall trends, such as areas of high interaction or problems, and monitor error conditions in real-time.
Real-time dashboards and queries
In certain situations, streaming data into BigQuery enables real-time analysis over transactional data. Since streaming data comes with a possibility of duplicated data, ensure that you have a primary, transactional data store outside of BigQuery.
You can take a few precautions to ensure that you'll be able to perform analysis over transactional data, and also have an up-to-the-second view of your data:
- Create two tables with an identical schema. The first table is for the reconciled data, and the second table is for the real-time, unreconciled data.
- On the client side, maintain a transactional data store for records.
- Fire-and-forget insertAll() requests for these records. The insertAll() request should specify the real-time, unreconciled table as the destination table.
- At some interval, append the reconciled data from the transactional data store and truncate the unreconciled data table.
- For real-time dashboards and queries, you can select data from both tables. The unreconciled data table might include duplicates or dropped records.
Troubleshooting streaming inserts
Failure HTTP response codes
If you receive a failure HTTP response code such as a network error, there's no way to tell if the streaming insert succeeded. If you try to simply re-send the request, you might end up with duplicated rows in your table. To help protect your table against duplication, set the
insertId property when sending your request. BigQuery uses the
insertId property for de-duplication.
If you receive a permission error, an invalid table name error or an exceeded quota error, no rows are inserted and the entire request fails.
Success HTTP response codes
Even if you receive a success HTTP response code, you'll need to check the
insertErrors property of the response to determine if the row insertions were successful, because it's possible that BigQuery was only partially successful at inserting the rows.
insertErrors property is an empty list, all of the rows inserted successfully. Otherwise, except in cases where there was a schema mismatch in any of the rows, rows indicated in the
insertErrors property were not inserted, and all other rows were inserted successfully. The
errors property contains detailed information about why each unsuccessful row failed. The
index property indicates the 0-based row index of the request that the error applies to.
If BigQuery encounters a schema mismatch on individual rows in the request, none of the rows are inserted and an
insertErrors entry is returned for each row, even the rows that did not have a schema mismatch. Rows that did not have a schema mismatch will have an error with the
reason property set to
stopped, and can be re-sent as-is. Rows that failed include detailed information about the schema mismatch.