This document describes how to stream data into BigQuery by using the
legacy tabledata.insertAll
method.
For new projects, we recommend using the
BigQuery Storage Write API instead of the
tabledata.insertAll
method. The Storage Write API has lower
pricing and more robust features, including exactly-once delivery semantics. If
you are migrating an existing project from the tabledata.insertAll
method
to the Storage Write API, we recommend selecting the
default stream. The
tabledata.insertAll
method is still fully supported.
Before you begin
Ensure that you have write access to the dataset that contains your destination table. The table must exist before you begin writing data to it unless you are using template tables. For more information on template tables, see Creating tables automatically using template tables.
Check the quota policy for streaming data.
-
Make sure that billing is enabled for your Google Cloud project.
Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document.
Streaming is not available through the free tier. If you attempt to use streaming without enabling billing, you receive the following error: BigQuery: Streaming insert is not allowed in the free tier.
Required permissions
To stream data into BigQuery, you need the following IAM permissions:
bigquery.tables.updateData
(lets you insert data into the table)bigquery.tables.get
(lets you obtain table metadata)bigquery.datasets.get
(lets you obtain dataset metadata)bigquery.tables.create
(required if you use a template table to create the table automatically)
Each of the following predefined IAM roles includes the permissions that you need in order to stream data into BigQuery:
roles/bigquery.dataEditor
roles/bigquery.dataOwner
roles/bigquery.admin
For more information about IAM roles and permissions in BigQuery, see Predefined roles and permissions.
Stream data into BigQuery
C#
Before trying this sample, follow the C# setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery C# API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Go
Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Node.js
Before trying this sample, follow the Node.js setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Node.js API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
PHP
Before trying this sample, follow the PHP setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery PHP API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Python
Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Ruby
Before trying this sample, follow the Ruby setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Ruby API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
You don't need to populate the insertID
field when you insert rows.
The following example shows how to avoid sending an insertID
for each row
when streaming.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Python
Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Send date and time data
For date and time fields, format the data in the tabledata.insertAll
method as
follows:
Type | Format |
---|---|
DATE |
A string in the form "YYYY-MM-DD" |
DATETIME |
A string in the form "YYYY-MM-DD [HH:MM:SS]" |
TIME |
A string in the form "HH:MM:SS" |
TIMESTAMP |
The number of seconds since 1970-01-01 (the Unix epoch), or a string
in the form "YYYY-MM-DD HH:MM[:SS]" |
Send range data
For fields with type RANGE<T>
, format the data in the tabledata.insertAll
method as a JSON object with two fields, start
and end
.
Missing or NULL values for the start
and end
fields represent unbounded boundaries.
These fields must have the same supported JSON format of type T
, where
T
can be one of DATE
, DATETIME
, and TIMESTAMP
.
In the following example, the f_range_date
field represents a RANGE<DATE>
column in a table. A row is inserted into this column using the
tabledata.insertAll
API.
{
"f_range_date": {
"start": "1970-01-02",
"end": null
}
}
Stream data availability
Data is available for real-time analysis using GoogleSQL
queries immediately after BigQuery successfully acknowledges a
tabledata.insertAll
request.
Recently streamed rows to an ingestion time partitioned table temporarily have a
NULL value for the _PARTITIONTIME
pseudocolumn. For such rows, BigQuery assigns the final non-NULL
value of the PARTITIONTIME
column in the background, typically within a few
minutes. In rare cases, this can take up to 90 minutes.
Some recently streamed rows might not be available for table copy typically for a
few minutes. In rare cases, this can take up to 90 minutes. To see whether data
is available for table copy, check the tables.get
response for a section named
streamingBuffer
.
If the streamingBuffer
section is absent, your data is available for copy.
You can also use the streamingBuffer.oldestEntryTime
field to identify the
age of records in the streaming buffer.
Best effort de-duplication
When you supply insertId
for an inserted row, BigQuery uses this
ID to support best effort de-duplication for up to one minute. That is, if
you stream the same row with the same insertId
more than once within
that time period into the same table, BigQuery might de-duplicate
the multiple occurrences of that row, retaining only one of those occurrences.
The system expects that rows provided with identical insertId
s are also
identical. If two rows have identical insertId
s, it is nondeterministic
which row BigQuery preserves.
De-duplication is generally meant for retry scenarios in a distributed system where there's
no way to determine the state of a streaming insert under certain error
conditions, such as network errors between your system and BigQuery
or internal errors within BigQuery.
If you retry an insert, use the same insertId
for the same set of rows so that
BigQuery can attempt to de-duplicate your data. For more
information, see troubleshooting streaming inserts.
De-duplication offered by BigQuery is best effort, and it should not be relied upon as a mechanism to guarantee the absence of duplicates in your data. Additionally, BigQuery might degrade the quality of best effort de-duplication at any time in order to guarantee higher reliability and availability for your data.
If you have strict de-duplication requirements for your data, Google Cloud Datastore is an alternative service that supports transactions.
Disabling best effort de-duplication
You can disable best effort de-duplication by not populating the insertId
field for each row inserted. This is the recommended way to insert data.
Apache Beam and Dataflow
To disable best effort de-duplication when you use Apache Beam's
BigQuery I/O connector
for Java, use the
ignoreInsertIds()
method.
Manually removing duplicates
To ensure that no duplicate rows exist after you are done streaming, use the following manual process:
- Add the
insertId
as a column in your table schema and include theinsertId
value in the data for each row. - After streaming has stopped, perform the following query to check for
duplicates:
#standardSQL SELECT MAX(count) FROM( SELECT ID_COLUMN, count(*) as count FROM `TABLE_NAME` GROUP BY ID_COLUMN)
If the result is greater than 1, duplicates exist. - To remove duplicates, run the following query. Specify
a destination table, allow large results, and disable result flattening.
#standardSQL SELECT * EXCEPT(row_number) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY ID_COLUMN) row_number FROM `TABLE_NAME`) WHERE row_number = 1
Notes about the duplicate removal query:
- The safer strategy for the duplicate removal query is to target a new table.
Alternatively, you can target the source table with write disposition
WRITE_TRUNCATE
. - The duplicate removal query adds a
row_number
column with the value1
to the end of the table schema. The query uses aSELECT * EXCEPT
statement from GoogleSQL to exclude therow_number
column from the destination table. The#standardSQL
prefix enables GoogleSQL for this query. Alternatively, you can select by specific column names to omit this column. - For querying live data with duplicates removed, you can also create a view over your table using the duplicate removal query. Be aware that query costs against the view are calculated based on the columns selected in your view, which can result in large bytes scanned sizes.
Stream into time-partitioned tables
When you stream data to a time-partitioned table, each partition has a streaming
buffer. The streaming buffer is retained when you perform a load, query, or copy
job that overwrites a partition by setting the writeDisposition
property to
WRITE_TRUNCATE
. If you want to remove the streaming buffer, verify that the
streaming buffer is empty by calling
tables.get
on the partition.
Ingestion-time partitioning
When you stream to an ingestion-time partitioned table, BigQuery infers the destination partition from the current UTC time.
Newly arriving data is temporarily placed in the __UNPARTITIONED__
partition
while in the streaming buffer. When there's enough unpartitioned data,
BigQuery partitions the data into the correct partition. However,
there is no SLA for how long it takes for data to move out of the
__UNPARTITIONED__
partition. A query
can exclude data in the streaming buffer from a query by filtering out the
NULL
values from the __UNPARTITIONED__
partition by using one of the
pseudocolumns (_PARTITIONTIME
or _PARTITIONDATE
depending on your preferred data type).
If you are streaming data into a daily partitioned table, then you can override
the date inference by supplying a partition decorator as part of the insertAll
request. Include the decorator in the tableId
parameter. For example, you can
stream to the partition corresponding to 2021-03-01 for table table1
using the
partition decorator:
table1$20210301
When streaming using a partition decorator, you can stream to partitions within the last 31 days in the past and 16 days in the future relative to the current date, based on current UTC time. To write to partitions for dates outside these allowed bounds, use a load or query job instead, as described in Appending to and overwriting partitioned table data.
Streaming using a partition decorator is only supported for daily partitioned tables. It is not supported for hourly, monthly, or yearly partitioned tables.
For testing, you can use the bq command-line tool
bq insert
CLI command.
For example, the following command streams a single row to a partition for the
date January 1, 2017 ($20170101
) into a partitioned table named
mydataset.mytable
:
echo '{"a":1, "b":2}' | bq insert 'mydataset.mytable$20170101'
Time-unit column partitioning
You can stream data into a table partitioned on a DATE
, DATETIME
, or
TIMESTAMP
column that is between 5 years in the past and 1 year in the future.
Data outside this range is rejected.
When the data is streamed, it is initially placed in the __UNPARTITIONED__
partition. When there's enough unpartitioned data, BigQuery
automatically repartitions the data, placing it into the appropriate partition.
However, there is no SLA for how long it takes for data to move out of the
__UNPARTITIONED__
partition.
- Note: Daily partitions are processed differently than hourly, monthly and yearly partitions. Only data outside of the date range (last 7 days to future 3 days) is extracted to the UNPARTITIONED partition, waiting to be repartitioned. On the other hand, for hourly partitioned table, data is always extracted to the UNPARTITIONED partition, and later repartitioned.
Create tables automatically using template tables
Template tables provide a mechanism to split a logical table into many smaller tables to create smaller sets of data (for example, by user ID). Template tables have a number of limitations described below. Instead, partitioned tables and clustered tables are the recommended ways to achieve this behavior.
To use a template table through the BigQuery API, add a templateSuffix
parameter
to your insertAll
request. For the bq command-line tool, add the template_suffix
flag
to your insert
command. If BigQuery detects a templateSuffix
parameter or the template_suffix
flag, it treats the targeted table as a base
template. It creates a new table that shares the same schema as the targeted
table and has a name that includes the specified suffix:
<targeted_table_name> + <templateSuffix>
By using a template table, you avoid the overhead of creating each table individually and specifying the schema for each table. You need only create a single template, and supply different suffixes so that BigQuery can create the new tables for you. BigQuery places the tables in the same project and dataset.
Tables created by using template tables are usually available within a few seconds. On rare occasions, they may take longer to become available.
Change the template table schema
If you change a template table schema, all tables that are generated subsequently use the updated schema. Previously generated tables are not affected, unless the existing table still has a streaming buffer.
For existing tables that still have a streaming buffer, if you modify the template table schema in a backward compatible way, the schema of those actively streamed generated tables is also updated. However, if you modify the template table schema in a non-backward compatible way, any buffered data that uses the old schema is lost. Also, you cannot stream new data to existing generated tables that use the old, but now incompatible, schema.
After you change a template table schema, wait until the changes have propagated before you try to insert new data or query the generated tables. Requests to insert new fields should succeed within a few minutes. Attempts to query the new fields might require a longer wait of up to 90 minutes.
If you want to change a generated table's schema, do not change the
schema until streaming through the template table has ceased and the generated
table's streaming statistics section is absent from the tables.get()
response,
which indicates that no data is buffered on the table.
Partitioned tables and clustered tables do not suffer from the preceding limitations and are the recommended mechanism.
Template table details
- Template suffix value
- The
templateSuffix
(or--template_suffix
) value must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_). The maximum combined length of the table name and the table suffix is 1024 characters. - Quota
Template tables are subject to streaming quota limitations. Your project can make up to 10 tables per second with template tables, similar to the
tables.insert
API. This quota only applies to tables being created, not to tables being modified.If your application needs to create more than 10 tables per second, we recommend using clustered tables. For example, you can put the high cardinality table ID into the key column of a single clustering table.
- Time to live
The generated table inherits its expiration time from the dataset. As with normal streaming data, generated tables cannot be copied immediately.
- Deduplication
Deduplication only happens between uniform references to a destination table. For example, if you simultaneously stream to a generated table using both template tables and a regular
insertAll
command, no deduplication occurs between rows inserted by template tables and a regularinsertAll
command.- Views
The template table and the generated tables should not be views.
Troubleshoot streaming inserts
The following sections discuss how to troubleshoot errors that occur when you stream data into BigQuery using the legacy streaming API. For more information on how to resolve quota errors for streaming inserts, see Streaming insert quota errors.
Failure HTTP response codes
If you receive a failure HTTP response code such as a network error, there's
no way to tell whether the streaming insert succeeded. If you try to re-send
the request, you might end up with duplicated rows in your table. To help
protect your table against duplication, set the insertId
property when
sending your request. BigQuery uses the insertId
property
for de-duplication.
If you receive a permission error, an invalid table name error, or an exceeded quota error, no rows are inserted and the entire request fails.
Success HTTP response codes
Even if you receive a
success HTTP response code, you'll need to check the
insertErrors
property of the response to determine whether the row insertions
were successful because it's possible that BigQuery was only partially
successful at inserting the rows. You might encounter one of the following scenarios:
- All rows inserted successfully. If the
insertErrors
property is an empty list, all of the rows were inserted successfully. - Some rows inserted successfully. Except in cases where there is a
schema mismatch in any of the rows, rows indicated in the
insertErrors
property are not inserted, and all other rows are inserted successfully. Theerrors
property contains detailed information about why each unsuccessful row failed. Theindex
property indicates the 0-based row index of the request that the error applies to. - None of the rows inserted successfully. If BigQuery encounters a
schema mismatch on individual rows in the request, none of the rows are inserted and an
insertErrors
entry is returned for each row, even the rows that did not have a schema mismatch. Rows that did not have a schema mismatch have an error with thereason
property set tostopped
, and can be re-sent as-is. Rows that failed include detailed information about the schema mismatch. To learn about the supported protocol buffer types for each BigQuery data type, see Data type conversions.
Metadata errors for streaming inserts
Because BigQuery's streaming API is designed for high insertion rates, modifications to the underlying table metadata exhibit are eventually consistent when interacting with the streaming system. Most of the time, metadata changes are propagated within minutes, but during this period API responses might reflect the inconsistent state of the table.
Some scenarios include:
- Schema Changes. Modifying the schema of a table that has recently received streaming inserts can cause responses with schema mismatch errors because the streaming system might not immediately pick up the schema change.
- Table Creation/Deletion. Streaming to a nonexistent table returns a variation of a
notFound
response. A table created in response might not immediately be recognized by subsequent streaming inserts. Similarly, deleting or recreating a table can create a period of time where streaming inserts are effectively delivered to the old table. The streaming inserts might not be present in the new table. - Table Truncation. Truncating a table's data (by using a query job that uses writeDisposition of WRITE_TRUNCATE) can similarly cause subsequent inserts during the consistency period to be dropped.
Missing/Unavailable data
Streaming inserts reside temporarily in the write-optimized storage, which has different availability
characteristics than managed storage. Certain operations in BigQuery do not interact
with the write-optimized storage, such as table copy jobs and API methods like tabledata.list
.
Recent streaming data won't be present in the destination table or output.