Configure a BigQuery destination

This page describes how to configure your BigQuery destination to stream data from a source database using Datastream.

Configure the destination datasets

When you configure datasets for the BigQuery destination, you can select one of the following options:

Dataset for each schema: The dataset is selected or created in the BigQuery location specified, based on the schema name of the source. As a result, for each schema in the source, Datastream creates a dataset in BigQuery automatically.

For example, if you have a MySQL source, and this source has a mydb database and an employees table within the database, then Datastream creates the mydb dataset and employeestable in BigQuery.

If you select this option, then Datastream creates datasets in the project that contains the stream. Although you don't need to create the datasets in the same region as your stream, we recommend to keep all resources for the stream, as well as datasets, in the same region for cost and performance optimization.
Single dataset for all schemas: You can select a BigQuery dataset for the stream. Datastream streams all data into this dataset. For the dataset that you select, Datastream creates all tables as <schema>_<table>.

For example, if you have a MySQL source, and this source has a mydb database and an employees table within the database, then Datastream creates the mydb_employeestable in the dataset that you select.

Write behavior

The maximum event size when you stream data into BigQuery is 20 MB.
When you configure your stream, you can select the way that Datastream writes your change data to BigQuery. For more information, see Configure write mode.

Configure write mode

There are two modes you can use to define how you want your data written to BigQuery:

Merge: This is the default write mode. When selected, BigQuery reflects the way your data is stored in the source database. This means that Datastream writes all changes to your data to BigQuery, and BigQuery then consolidates the changes with existing data, thus creating final tables that are replicas of the source tables. With Merge mode, no historical record of the change events is kept. For example, if you insert and then update a row, BigQuery only keeps the updated data. If you then delete the row from the source table, BigQuery no longer keeps any record of that row.
Append-only: The append-only write mode lets you add data to BigQuery as a stream of changes (INSERT, UPDATE-INSERT, UPDATE-DELETEand DELETE events). Use this mode when you need to retain the historical state of your data. To get a better understanding of the append-only write mode, consider the following scenarios:
- Initial backfill: after the initial backfill, all events are written to BigQuery as INSERT type events, with the same timestamp, universally unique identifier (UUID), and change sequence number.
- Primary key update: when a primary key changes, two rows are written to BigQuery:
  - An UPDATE-DELETE row with the original primary key
  - An UPDATE-INSERT row with the new primary key
- Row update: when you update a row, a single UPDATE-INSERT row is written to BigQuery
- Row deletion: when you delete a row, a single DELETE row is written to BigQuery

Table metadata

Datastream appends a STRUCT column named datastream_metadata to each table that's written to the BigQuery destination.

Merge write mode

If a table has a primary key at the source, then the column contains the following fields:

UUID: This field has the STRING data type.
SOURCE_TIMESTAMP: This field has the INTEGER data type.

If a table doesn't have a primary key, then the column contains an additional field: IS_DELETED. This field has the BOOLEAN data type, and it indicates whether the data that Datastream streams to the destination is associated with a DELETE operation at the source. Tables without primary keys are append-only.

Append-only write mode

The datastream_metadata column contains the same fields for tables with and without primary keys:

UUID: This field has the STRING data type.
SOURCE_TIMESTAMP: This field has the INTEGER data type.
CHANGE_SEQUENCE_NUMBER: This field has the STRING data type. It's an internal sequence number used by Datastream for each change event.
CHANGE_TYPE: This field has the STRING data type. It indicates the type of the change event: INSERT, UPDATE-INSERT, UPDATE-DELETEor DELETE.
SORT_KEYS: This field contains an array of STRING values. You can use the values to sort the change events.

Use BigQuery tables with the `max_staleness` option

As part of near real-time ingestion, Datastream uses BigQuery's built-in support for upsert operations, such as updating, inserting, and deleting data. Upsert operations let you dynamically update the BigQuery destination as rows are added, modified, or deleted. Datastream streams these upsert operations into the destination table using the BigQuery Storage Write API.

Specify data staleness limit

BigQuery applies source modifications in the background on an ongoing basis, or at query run time, according to the configured data staleness limit. When Datastream creates a new table in BigQuery, the table's max_staleness option is set according to the current data staleness limit value for the stream.

For more information about using BigQuery tables with the max_staleness option, see Table staleness.

Control BigQuery costs

BigQuery costs are charged separately from Datastream. To learn how to control your BigQuery costs, see BigQuery CDC pricing.

Map data types

The following table lists data type conversions from supported source databases to the BigQuery destination.

Source database	Source data type	BigQuery data type
MySQL	`BIGINT(size)`	`LONG`
MySQL	`BIGINT (unsigned)`	`DECIMAL`
MySQL	`BINARY(size)`	`STRING (hex encoded)`
MySQL	`BIT(size)`	`INT64`
MySQL	`BLOB(size)`	`STRING (hex encoded)`
MySQL	`BOOL`	`INT64`
MySQL	`CHAR(size)`	`STRING`
MySQL	`DATE`	`DATE`
MySQL	`DATETIME(fsp)`	`DATETIME`
MySQL	`DECIMAL(precision, scale)`	If the precision value is <=38, and the scale value is <=9 then `NUMERIC`. Otherwise `BIGNUMERIC`
MySQL	`DOUBLE(size, d)`	`FLOAT64`
MySQL	`ENUM(val1, val2, val3, ...)`	`STRING`
MySQL	`FLOAT(precision)`	`FLOAT64`
MySQL	`FLOAT(size, d)`	`FLOAT64`
MySQL	`INTEGER(size)`	`INT64`
MySQL	`INTEGER (unsigned)`	`LONG`
MySQL	`JSON`	`JSON` `BOOLEAN`, `NUMBER`, `OBJECT`, `STRING`, `ARRAY` and `NULL` data types are supported.
MySQL	`LONGBLOB`	`STRING (hex encoded)`
MySQL	`LONGTEXT`	`STRING`
MySQL	`MEDIUMBLOB`	`STRING (hex encoded)`
MySQL	`MEDIUMINT(size)`	`INT64`
MySQL	`MEDIUMTEXT`	`STRING`
MySQL	`SET(val1, val2, val3, ...)`	`STRING`
MySQL	`SMALLINT(size)`	`INT64`
MySQL	`TEXT(size)`	`STRING`
MySQL	`TIME(fsp)`	`INTERVAL`
MySQL	`TIMESTAMP(fsp)`	`TIMESTAMP`
MySQL	`TINYBLOB`	`STRING (hex encoded)`
MySQL	`TINYINT(size)`	`INT64`
MySQL	`TINYTEXT`	`STRING`
MySQL	`VARBINARY(size)`	`STRING (hex encoded)`
MySQL	`VARCHAR`	`STRING`
MySQL	`YEAR`	`INT64`
Oracle	`ANYDATA`	`UNSUPPORTED`
Oracle	`BFILE`	`STRING`
Oracle	`BINARY DOUBLE`	`FLOAT64`
Oracle	`BINARY FLOAT`	`FLOAT64`
Oracle	`BLOB`	`BYTES`
Oracle	`CHAR`	`STRING`
Oracle	`CLOB`	`STRING`
Oracle	`DATE`	`DATETIME` Zero dates aren't supported and are replaced with `NULL` values.
Oracle	`DOUBLE PRECISION`	`FLOAT64`
Oracle	`FLOAT(p)`	`FLOAT64`
Oracle	`INTERVAL DAY TO SECOND`	`UNSUPPORTED`
Oracle	`INTERVAL YEAR TO MONTH`	`UNSUPPORTED`
Oracle	`LONG`/`LONG RAW`	`STRING`
Oracle	`NCHAR`	`STRING`
Oracle	`NCLOB`	`STRING`
Oracle	`NUMBER`	`STRING`
Oracle	`NUMBER(precision=*)`	`STRING`
Oracle	`NUMBER(precision, scale<=0)`	If p<=18, then `INT64`. If 18<p=<78, then map to parameterized decimal types. If p>=79, map to `STRING`
Oracle	`NUMBER(precision, scale>0)`	If 0<p=<78, then map to parameterized decimal types. If p>=79, map to `STRING`
Oracle	`NVARCHAR2`	`STRING`
Oracle	`RAW`	`STRING`
Oracle	`ROWID`	`STRING`
Oracle	`SDO_GEOMETRY`	`UNSUPPORTED`
Oracle	`SMALLINT`	`INT64`
Oracle	`TIMESTAMP`	`TIMESTAMP` Zero dates aren't supported and are replaced with `NULL` values.
Oracle	`TIMESTAMP WITH TIME ZONE`	`TIMESTAMP` Zero dates aren't supported and are replaced with `NULL` values.
Oracle	`UDT (user-defined type)`	`UNSUPPORTED`
Oracle	`UROWID`	`STRING`
Oracle	`VARCHAR`	`STRING`
Oracle	`VARCHAR2`	`STRING`
Oracle	`XMLTYPE`	`UNSUPPORTED`
PostgreSQL	`ARRAY`	`JSON` Arrays of `DATE`, `TIMESTAMP` or `TIMESTAMP WITH TIME ZONE` data types aren't supported.
PostgreSQL	`BIGINT`	`INT64`
PostgreSQL	`BIT`	`BYTES`
PostgreSQL	`BIT_VARYING`	`BYTES`
PostgreSQL	`BOOLEAN`	`BOOLEAN`
PostgreSQL	`BOX`	`UNSUPPORTED`
PostgreSQL	`BYTEA`	`BYTES`
PostgreSQL	`CHARACTER`	`STRING`
PostgreSQL	`CHARACTER_VARYING`	`STRING`
PostgreSQL	`CIDR`	`STRING`
PostgreSQL	`CIRCLE`	`UNSUPPORTED`
PostgreSQL	`DATE`	`DATE`
PostgreSQL	`DOUBLE_PRECISION`	`FLOAT64`
PostgreSQL	`ENUM`	`STRING`
PostgreSQL	`INET`	`STRING`
PostgreSQL	`INTEGER`	`INT64`
PostgreSQL	`INTERVAL`	`INTERVAL`
PostgreSQL	`JSON`	`JSON`
PostgreSQL	`JSONB`	`JSON`
PostgreSQL	`LINE`	`UNSUPPORTED`
PostgreSQL	`LSEG`	`UNSUPPORTED`
PostgreSQL	`MACADDR`	`STRING`
PostgreSQL	`MONEY`	`FLOAT64`
PostgreSQL	`NUMERIC`	If precision = `-1`, then `STRING` (BigQuery `NUMERIC` types require fixed precision). Otherwise `BIGNUMERIC`/`NUMERIC`. For more information, see the Arbitrary precision numbers section in PostgreSQL documentation.
PostgreSQL	`OID`	`INT64`
PostgreSQL	`PATH`	`UNSUPPORTED`
PostgreSQL	`POINT`	`UNSUPPORTED`
PostgreSQL	`POLYGON`	`UNSUPPORTED`
PostgreSQL	`REAL`	`FLOAT64`
PostgreSQL	`SMALLINT`	`INT64`
PostgreSQL	`SMALLSERIAL`	`INT64`
PostgreSQL	`SERIAL`	`INT64`
PostgreSQL	`TEXT`	`STRING`
PostgreSQL	`TIME`	`TIME`
PostgreSQL	`TIMESTAMP`	`TIMESTAMP`
PostgreSQL	`TIMESTAMP_WITH_TIMEZONE`	`TIMESTAMP`
PostgreSQL	`TIME_WITH_TIMEZONE`	`TIME`
PostgreSQL	`TSQUERY`	`STRING`
PostgreSQL	`TSVECTOR`	`STRING`
PostgreSQL	`TXID_SNAPSHOT`	`STRING`
PostgreSQL	`UUID`	`STRING`
PostgreSQL	`XML`	`STRING`
SQL Server	`BIGINT`	`INT64`
SQL Server	`BINARY`	`BYTES`
SQL Server	`BIT`	`BOOL`
SQL Server	`CHAR`	`STRING`
SQL Server	`DATE`	`DATE`
SQL Server	`DATETIME2`	`DATETIME`
SQL Server	`DATETIME`	`DATETIME`
SQL Server	`DATETIMEOFFSET`	`TIMESTAMP`
SQL Server	`DECIMAL`	`BIGNUMERIC`
SQL Server	`FLOAT`	`FLOAT64`
SQL Server	`IMAGE`	`BYTES`
SQL Server	`INT`	`INT64`
SQL Server	`MONEY`	`BIGNUMERIC`
SQL Server	`NCHAR`	`STRING`
SQL Server	`NTEXT`	`STRING`
SQL Server	`NUMERIC`	`BIGNUMERIC`
SQL Server	`NVARCHAR`	`STRING`
SQL Server	`NVARCHAR(MAX)`	`STRING`
SQL Server	`REAL`	`FLOAT64`
SQL Server	`SMALLDATETIME`	`DATETIME`
SQL Server	`SMALLINT`	`INT64`
SQL Server	`SMALLMONEY`	`NUMERIC`
SQL Server	`TEXT`	`STRING`
SQL Server	`TIME`	`TIME`
SQL Server	`TIMESTAMP`/`ROWVERSION`	`BYTES`
SQL Server	`TINYINT`	`INT64`
SQL Server	`UNIQUEIDENTIFIER`	`STRING`
SQL Server	`VARBINARY`	`BYTES`
SQL Server	`VARBINARY(MAX)`	`BYTES`
SQL Server	`VARCHAR`	`STRING`
SQL Server	`VARCHAR(MAX)`	`STRING`
SQL Server	`XML`	`STRING`

Query a PostgreSQL array as a BigQuery array data type

If you prefer to query a PostgreSQL array as a BigQuery ARRAY data type, you can convert the JSON values to a BigQuery array using the BigQuery JSON_VALUE_ARRAY function:

  SELECT ARRAY(SELECT CAST(element AS TYPE) FROM UNNEST(JSON_VALUE_ARRAY(BQ_COLUMN_NAME,'$')) AS element) AS array_col

Replace the following:

TYPE: the BigQuery type that matches the element type in the PostgreSQL source array. For example, if the source type is an array of BIGINT values, then replace TYPE with INT64.

For more information about how to map the data types, see Map data types.
BQ_COLUMN_NAME: the name of the relevant column in the BigQuery table.

There are 2 exceptions to the way that you convert the values:

For arrays of BIT, BIT_VARYING or BYTEA values in the source column, run the following query:

SELECT ARRAY(SELECT FROM_BASE64(element) FROM UNNEST(JSON_VALUE_ARRAY(BQ_COLUMN_NAME,'$')) AS element) AS array_of_bytes

For arrays of JSON or JSONB values in the source column, use the JSON_QUERY_ARRAYfunction:

SELECT ARRAY(SELECT element FROM UNNEST(JSON_QUERY_ARRAY(BQ_COLUMN_NAME,'$')) AS element) AS array_of_jsons

Known limitations

Known limitations for using BigQuery as a destination include:

You can only replicate data into a BigQuery dataset that resides in the same Google Cloud project as the Datastream stream.
By default, Datastream doesn't support adding a primary key to a table that's already replicated to BigQuery without a primary key, or removing a primary key from a table that's replicated to BigQuery with a primary key. If you need to perform such changes, contact Google Support. For information about changing the primary key definition for a source table that already has a primary key, see Diagnose issues.
Primary keys in BigQuery must be of the following data types:
- DATE
- BOOL
- GEOGRAPHY
- INT64
- NUMERIC
- BIGNUMERIC
- STRING
- TIMESTAMP
- DATETIME
Tables that contain primary keys of unsupported data types aren't replicated by Datastream.
BigQuery doesn't support table names with ., $, /, @, or + characters. Datastream replaces such characters with underscores when creating destination tables.

For example, table.name in the source database becomes table_name in BigQuery.

For more information on table names in BigQuery, see Table naming.
BigQuery doesn't support more than four clustering columns. When replicating a table with more than four primary key columns, Datastream uses four primary key columns as the clustering columns.
Datastream maps out-of-range date and time literals such as PostgreSQL infinity date types to the following values:
- Positive DATE to the value of 9999-12-31
- Negative DATE to the value of 0001-01-01
- Positive TIMESTAMP to the value of 9999-12-31 23:59:59.999000 UTC
- Negative TIMESTAMP to the value of 0001-01-01 00:00:00 UTC
BigQuery doesn't support streaming tables which have primary keys of FLOAT or REAL data types. Such tables aren't replicated.

To learn more about BigQuery date types and ranges, see Data types.

What's next

Learn how to replicate data from a source database to BigQuery datasets using Datastream.