Streaming data into BigQuery requires the following access levels.
The following limits apply for streaming data into BigQuery.
- Maximum row size: 1 MB
- HTTP request size limit: 10 MB
- Maximum rows per second: 100,000 rows per second, per table.
1,000,000 rows per second, per project. Exceeding either amount will cause
quota_exceedederrors. If your project requires a higher limit, contact support.
- Maximum rows per request: There is no hard limit, but we recommend a maximum of 500 rows. Batching can increase performance and throughput to a point, but at the cost of per-request latency. Too few rows per request and the overhead of each request can make ingestion inefficient. Too many rows per request and the throughput may drop. We recommend using about 500 rows per request, but experimentation with representative data (schema and data sizes) will help you determine the ideal batch size.
- Maximum bytes per second: 100 MB per second, per table. Exceeding this amount will cause
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table.
Data can take up to 90 minutes to become available for copy and export
operations. To see whether data is available for copy and export, check the
tables.get response for a section named
streamingBuffer. If that section
is absent, your data should be available for copy or export.
To help ensure data consistency, you can supply
insertId for each inserted
row. BigQuery remembers this ID for at least one minute. If you try to stream
the same set of rows within that time period and the
insertId property is
set, BigQuery uses the
insertId property to de-duplicate your data on a
best effort basis.
You might have to retry an insert because there's no way to determine the state
of a streaming insert under certain error conditions, such as network errors
between your system and BigQuery or internal errors within BigQuery. If you
retry an insert, use the same
insertID for the same set of rows so that
BigQuery can attempt to de-duplicate your data. For more information, see
troubleshooting streaming inserts.
In the rare instance of a Google datacenter losing connectivity unexpectedly, automatic deduplication may not be possible.
Streaming data across data locations
You can stream data to datasets in both the US and EU. Data can flow through
machines outside the dataset's location while BigQuery processes the
request. If you are streaming data from a location outside of the
dataset's location, you might experience incresased latency and error rates.
Automatic table creation using template tables
A common usage pattern for streaming data into BigQuery is to split a logical table into many smaller tables, either for creating smaller sets of data (e.g., by date or by user ID) or for scalability (e.g., streaming more than the current limit of 100,000 rows per second). To split a table into many smaller tables without adding complex client-side code, use the BigQuery template tables feature to let BigQuery create the tables for you.
To use a template table via the BigQuery API, add a
insertAll request. For the bq command line tool, add the
template_suffix flag to your
insert command. If BigQuery detects a
templateSuffix parameter or the
template_suffix flag, it treats the targeted
table as a base template, and creates a new table that shares the same
schema as the targeted table and has a name that includes the specified suffix:
<targeted_table_name> + <templateSuffix>
By using a template table, you avoid the overhead of creating each table individually and specifying the schema for each table. You need only create a single template, and supply different suffixes so that BigQuery can create the new tables for you. BigQuery places the tables in the same project and dataset. Templates also make it easier to update the schema because you need only update the template table.
Tables created via template tables are usually available within a few seconds. On rare occasions they may take longer to become available.
How to change the template table schema
If you change a template table schema, all subsequently generated tables will use the updated schema. Previously generated tables will not be affected, unless the existing table still has a streaming buffer.
For existing tables that still have a streaming buffer, if you modify the template table schema in a backward compatible way, the schema of those actively streamed generated tables will also be updated. However, if you modify the template table schema in a non-backward compatible way, any buffered data that uses the old schema will be lost. Additionally, you will not be able to stream new data to existing generated tables that use the old, but now incompatible, schema.
After you change a template table schema, wait until the changes have propagated before you try to insert new data or query generated tables. Requests to insert new fields should succeed within a few minutes. Attempts to query the new fields might require a longer wait of up to 90 minutes.
If you want to change a generated table's schema, do not change the
schema until streaming via the template table has ceased and the generated table's streaming statistics section is absent from the
tables.get() response, which indicates that no data is buffered on the table.
Template table details
- Template suffix value
--template_suffix) value must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_). The maximum combined length of the table name and the table suffix is 1024 characters.
- The same quotas apply to all tables, whether they are based on templates or created manually.
- Time to live
- The generated table inherits its expiration time from the dataset. As with normal streaming data, generated tables cannot be copied or exported immediately.
- Deduplication only happens between uniform references to a destination table.
For example, if you simultaneously stream to a generated table using both
template tables and a regular
insertAllcommand, no deduplication occurs between rows inserted by template tables and a regular
- The template table and the generated tables should not be views.
Example use cases
High volume event logging
If you have an app that collects a large amount of data in real-time, streaming inserts can be a good choice. Generally, these types of apps have the following criteria:
- Not transactional. High volume, continuously appended rows. The app can tolerate a rare possibility that duplication might occur or that data might be temporarily unavailable.
- Aggregate analysis. Queries generally are performed for trend analysis, as opposed to single or narrow record selection.
One example of high volume event logging is event tracking. Suppose you have a mobile app that tracks events. Your app, or mobile servers, could independently record user interactions or system errors and stream them into BigQuery. You could analyze this data to determine overall trends, such as areas of high interaction or problems, and monitor error conditions in real-time.
Real-time dashboards and queries
In certain situations, streaming data into BigQuery enables real-time analysis over transactional data. Since streaming data comes with a possibility of duplicated data, ensure that you have a primary, transactional data store outside of BigQuery.
You can take a few precautions to ensure that you'll be able to perform analysis over transactional data, and also have an up-to-the-second view of your data:
- Create two tables with an identical schema. The first table is for the reconciled data, and the second table is for the real-time, unreconciled data.
- On the client side, maintain a transactional data store for records.
- Fire-and-forget insertAll() requests for these records. The insertAll() request should specify the real-time, unreconciled table as the destination table.
- At some interval, append the reconciled data from the transactional data store and truncate the unreconciled data table.
- For real-time dashboards and queries, you can select data from both tables. The unreconciled data table might include duplicates or dropped records.
Manual Duplicate Removal
You can use the following manual process to ensure that no duplicate rows exist after you are done streaming.
- Add the
insertIDas a column in your table schema and include the
insertIDvalue in the data for each row.
After streaming has stopped, perform the following query to check for duplicates:
SELECT max(count) FROM( SELECT <id_column>, count(*) as count FROM <table> GROUP BY id_column)
If the result is greater than 1, duplicates exist.
To remove duplicates, perform the following query. You should specify a destination table, allow large results, and disable result flattening.
SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY <id_column>) row_number, FROM <table> ) WHERE row_number = 1
Notes about the duplicate removal query:
- The safer strategy for the duplicate removal query is to target a new table.
Alternatively, you can target the source table with write disposition
- The duplicate removal query adds a
row_numbercolumn with the value
1to the end of the table schema. You can select by specific column names to omit this column.
- For querying live data with duplicates removed, you can also create a view over your table using the duplicate removal query. Be aware that query costs against the view will be calculated based on the columns selected in your view, which can result in large bytes scanned sizes.
How to stream more than 100k rows per second
A single table only supports streaming at the rates listed in the Quota policy section. If you need to stream at a higher rate than is supported by a single table, you can use template tables to shard your logical destination table, as each unique template table suffix can support the full table streaming quota.
For example, if your logical table needs to receive 250,000 rows per second, you can create a base template table, and distribute your row insertions evenly among template suffixes "1", "2", and "3". This makes the effective quota of the logical table roughly three times that of the single table limit. Take care that if you retry writing a row that you write to the same shard.
Troubleshooting streaming inserts
For information about how to troubleshoot errors during streaming inserts, see Troubleshooting streaming inserts on the Troubleshooting Errors page.
Streaming insert examples
This sample uses the Google APIs Client Library for Java.
This sample uses the Google APIs Client Library for Python.