This page describes how to configure your BigQuery destination to stream data from a source database using Datastream.
Configure the destination datasets
When you configure datasets for the BigQuery destination, you can select one of the following options:
Dataset for each schema: The dataset is selected or created in the BigQuery location specified, based on the schema name of the source. As a result, for each schema in the source, Datastream creates a dataset in BigQuery automatically.
For example, if you have a MySQL source, and this source has a
mydb
database and anemployees
table within the database, then Datastream creates themydb
dataset andemployees
table in BigQuery.If you select this option, then Datastream creates datasets in the project that contains the stream. Although you don't need to create the datasets in the same region as your stream, we recommend to keep all resources for the stream, as well as datasets, in the same region for cost and performance optimization.
Single dataset for all schemas: You can select a BigQuery dataset for the stream. Datastream streams all data into this dataset. For the dataset that you select, Datastream creates all tables as
<schema>_<table>
.For example, if you have a MySQL source, and this source has a
mydb
database and anemployees
table within the database, then Datastream creates themydb_employees
table in the dataset that you select.
Write behavior
The maximum event size when you stream data into BigQuery is 20 MB.
When you configure your stream, you can select the way that Datastream writes your change data to BigQuery. For more information, see Configure write mode.
Configure write mode
There are two modes you can use to define how you want your data written to BigQuery:
- Merge: This is the default write mode. When selected, BigQuery reflects the way your data is stored in the source database. This means that Datastream writes all changes to your data to BigQuery, and BigQuery then consolidates the changes with existing data, thus creating final tables that are replicas of the source tables. With Merge mode, no historical record of the change events is kept. For example, if you insert and then update a row, BigQuery only keeps the updated data. If you then delete the row from the source table, BigQuery no longer keeps any record of that row.
- Append-only: The append-only write mode lets you add data to BigQuery as a stream of changes (
INSERT
,UPDATE-INSERT
,UPDATE-DELETE
andDELETE
events). Use this mode when you need to retain the historical state of your data. To get a better understanding of the append-only write mode, consider the following scenarios:- Initial backfill: after the initial backfill, all events are written to BigQuery as
INSERT
type events, with the same timestamp, universally unique identifier (UUID), and change sequence number. - Primary key update: when a primary key changes, two rows are written to BigQuery:
- An
UPDATE-DELETE
row with the original primary key - An
UPDATE-INSERT
row with the new primary key
- An
- Row update: when you update a row, a single
UPDATE-INSERT
row is written to BigQuery - Row deletion: when you delete a row, a single
DELETE
row is written to BigQuery
- Initial backfill: after the initial backfill, all events are written to BigQuery as
Table metadata
Datastream appends a STRUCT
column named datastream_metadata
to each
table that's written to the BigQuery destination.
Merge write mode
If a table has a primary key at the source, then the column contains the following fields:
UUID
: This field has theSTRING
data type.SOURCE_TIMESTAMP
: This field has theINTEGER
data type.
If a table doesn't have a primary key, then the column contains an additional field: IS_DELETED
. This field has the BOOLEAN
data type, and it indicates whether the data that Datastream streams to the destination is associated with a DELETE
operation at the source. Tables without primary keys are append-only.
Append-only write mode
The datastream_metadata
column contains the same fields for tables with and
without primary keys:
UUID
: This field has theSTRING
data type.SOURCE_TIMESTAMP
: This field has theINTEGER
data type.CHANGE_SEQUENCE_NUMBER
: This field has theSTRING
data type. It's an internal sequence number used by Datastream for each change event.CHANGE_TYPE
: This field has theSTRING
data type. It indicates the type of the change event:INSERT
,UPDATE-INSERT
,UPDATE-DELETE
orDELETE
.SORT_KEYS
: This field contains an array ofSTRING
values. You can use the values to sort the change events.
Use BigQuery tables with the max_staleness
option
As part of near real-time ingestion, Datastream uses BigQuery's built-in support for upsert operations, such as updating, inserting, and deleting data. Upsert operations let you dynamically update the BigQuery destination as rows are added, modified, or deleted. Datastream streams these upsert operations into the destination table using the BigQuery Storage Write API.
Specify data staleness limit
BigQuery applies source modifications in the background on an ongoing basis, or at query run time, according to the configured data staleness limit. When Datastream creates a new table in BigQuery, the table's max_staleness
option is set according to the current data staleness limit value for the stream.
For more information about using BigQuery tables with the max_staleness
option, see Table staleness.
Control BigQuery costs
BigQuery costs are charged separately from Datastream. To learn how to control your BigQuery costs, see BigQuery CDC pricing.
Map data types
The following table lists data type conversions from supported source databases to the BigQuery destination.
Source database | Source data type | BigQuery data type |
---|---|---|
MySQL | BIGINT(size) |
LONG |
MySQL | BIGINT (unsigned) |
DECIMAL |
MySQL | BINARY(size) |
STRING (hex encoded) |
MySQL | BIT(size) |
INT64 |
MySQL | BLOB(size) |
STRING (hex encoded) |
MySQL | BOOL |
INT64 |
MySQL | CHAR(size) |
STRING |
MySQL | DATE |
DATE |
MySQL | DATETIME(fsp) |
DATETIME |
MySQL | DECIMAL(precision, scale) |
If the precision value is <=38, and the scale value is <=9 then NUMERIC . Otherwise BIGNUMERIC |
MySQL | DOUBLE(size, d) |
FLOAT64 |
MySQL | ENUM(val1, val2, val3, ...) |
STRING |
MySQL | FLOAT(precision) |
FLOAT64 |
MySQL | FLOAT(size, d) |
FLOAT64 |
MySQL | INTEGER(size) |
INT64 |
MySQL | INTEGER (unsigned) |
LONG |
MySQL |
|
JSON
|
MySQL | LONGBLOB |
STRING (hex encoded) |
MySQL | LONGTEXT |
STRING |
MySQL | MEDIUMBLOB |
STRING (hex encoded) |
MySQL | MEDIUMINT(size) |
INT64 |
MySQL | MEDIUMTEXT |
STRING |
MySQL | SET(val1, val2, val3, ...) |
STRING |
MySQL | SMALLINT(size) |
INT64 |
MySQL | TEXT(size) |
STRING |
MySQL | TIME(fsp) |
INTERVAL |
MySQL | TIMESTAMP(fsp) |
TIMESTAMP |
MySQL | TINYBLOB |
STRING (hex encoded) |
MySQL | TINYINT(size) |
INT64 |
MySQL | TINYTEXT |
STRING |
MySQL | VARBINARY(size) |
STRING (hex encoded) |
MySQL | VARCHAR |
STRING |
MySQL | YEAR |
INT64 |
Oracle | ANYDATA |
UNSUPPORTED |
Oracle | BFILE |
STRING |
Oracle | BINARY DOUBLE |
FLOAT64 |
Oracle | BINARY FLOAT |
FLOAT64 |
Oracle | BLOB |
BYTES |
Oracle | CHAR |
STRING |
Oracle | CLOB |
STRING |
Oracle | DATE |
DATETIME
|
Oracle | DOUBLE PRECISION |
FLOAT64 |
Oracle | FLOAT(p) |
FLOAT64 |
Oracle | INTERVAL DAY TO SECOND |
UNSUPPORTED |
Oracle | INTERVAL YEAR TO MONTH |
UNSUPPORTED |
Oracle | LONG /LONG RAW |
STRING |
Oracle | NCHAR |
STRING |
Oracle | NCLOB |
STRING |
Oracle | NUMBER(precision, scale>0) |
If 0<p=<78, then map to parameterized decimal types. If p>=79, map to STRING |
Oracle | NVARCHAR2 |
STRING |
Oracle | RAW |
STRING |
Oracle | ROWID |
STRING |
Oracle | SDO_GEOMETRY |
UNSUPPORTED |
Oracle | SMALLINT |
INT64 |
Oracle | TIMESTAMP |
TIMESTAMP
|
Oracle | TIMESTAMP WITH TIME ZONE |
TIMESTAMP
|
Oracle | UDT (user-defined type) |
UNSUPPORTED |
Oracle | UROWID |
STRING |
Oracle | VARCHAR |
STRING |
Oracle | VARCHAR2 |
STRING |
Oracle | XMLTYPE |
UNSUPPORTED |
PostgreSQL | ARRAY |
JSON
|
PostgreSQL | BIGINT |
INT64 |
PostgreSQL | BIT |
BYTES |
PostgreSQL | BIT_VARYING |
BYTES |
PostgreSQL | BOOLEAN |
BOOLEAN |
PostgreSQL | BOX |
UNSUPPORTED |
PostgreSQL | BYTEA |
BYTES |
PostgreSQL | CHARACTER |
STRING |
PostgreSQL | CHARACTER_VARYING |
STRING |
PostgreSQL | CIDR |
STRING |
PostgreSQL | CIRCLE |
UNSUPPORTED |
PostgreSQL | DATE |
DATE |
PostgreSQL | DOUBLE_PRECISION |
FLOAT64 |
PostgreSQL | ENUM |
STRING |
PostgreSQL | INET |
STRING |
PostgreSQL | INTEGER |
INT64 |
PostgreSQL | INTERVAL |
INTERVAL |
PostgreSQL | JSON |
JSON |
PostgreSQL | JSONB |
JSON |
PostgreSQL | LINE |
UNSUPPORTED |
PostgreSQL | LSEG |
UNSUPPORTED |
PostgreSQL | MACADDR |
STRING |
PostgreSQL | MONEY |
FLOAT64 |
PostgreSQL | NUMERIC |
If precision = -1 , then STRING (BigQuery NUMERIC types require fixed precision). Otherwise BIGNUMERIC /NUMERIC . For more information, see the Arbitrary precision numbers section in PostgreSQL documentation. |
PostgreSQL | OID |
INT64 |
PostgreSQL | PATH |
UNSUPPORTED |
PostgreSQL | POINT |
UNSUPPORTED |
PostgreSQL | POLYGON |
UNSUPPORTED |
PostgreSQL | REAL |
FLOAT64 |
PostgreSQL | SMALLINT |
INT64 |
PostgreSQL | SMALLSERIAL |
INT64 |
PostgreSQL | SERIAL |
INT64 |
PostgreSQL | TEXT |
STRING |
PostgreSQL | TIME |
TIME |
PostgreSQL | TIMESTAMP |
TIMESTAMP |
PostgreSQL | TIMESTAMP_WITH_TIMEZONE |
TIMESTAMP |
PostgreSQL | TIME_WITH_TIMEZONE |
TIME |
PostgreSQL | TSQUERY |
STRING |
PostgreSQL | TSVECTOR |
STRING |
PostgreSQL | TXID_SNAPSHOT |
STRING |
PostgreSQL | UUID |
STRING |
PostgreSQL | XML |
STRING |
SQL Server | BIGINT |
INT64 |
SQL Server | BINARY |
BYTES |
SQL Server | BIT |
BOOL |
SQL Server | CHAR |
STRING |
SQL Server | DATE |
DATE |
SQL Server | DATETIME2 |
DATETIME |
SQL Server | DATETIME |
DATETIME |
SQL Server | DATETIMEOFFSET |
TIMESTAMP |
SQL Server | DECIMAL |
BIGNUMERIC |
SQL Server | FLOAT |
FLOAT64 |
SQL Server | IMAGE |
BYTES |
SQL Server | INT |
INT64 |
SQL Server | MONEY |
BIGNUMERIC |
SQL Server | NCHAR |
STRING |
SQL Server | NTEXT |
STRING |
SQL Server | NUMERIC |
BIGNUMERIC |
SQL Server | NVARCHAR |
STRING |
SQL Server | NVARCHAR(MAX) |
STRING |
SQL Server | REAL |
FLOAT64 |
SQL Server | SMALLDATETIME |
DATETIME |
SQL Server | SMALLINT |
INT64 |
SQL Server | SMALLMONEY |
NUMERIC |
SQL Server | TEXT |
STRING |
SQL Server | TIME |
TIME |
SQL Server | TIMESTAMP /ROWVERSION |
BYTES |
SQL Server | TINYINT |
INT64 |
SQL Server | UNIQUEIDENTIFIER |
STRING |
SQL Server | VARBINARY |
BYTES |
SQL Server | VARBINARY(MAX) |
BYTES |
SQL Server | VARCHAR |
STRING |
SQL Server | VARCHAR(MAX) |
STRING |
SQL Server | XML |
STRING |
Query a PostgreSQL array as a BigQuery array data type
If you prefer to query a PostgreSQL array as a BigQuery ARRAY
data type,
you can convert the JSON
values to a BigQuery array using the BigQuery JSON_VALUE_ARRAY
function:
SELECT ARRAY(SELECT CAST(element AS TYPE) FROM UNNEST(JSON_VALUE_ARRAY(BQ_COLUMN_NAME,'$')) AS element)AS array_col
Replace the following:
TYPE: the BigQuery type that matches the element type in the PostgreSQL source array. For example, if the source type is an array of
BIGINT
values, then replace TYPE withINT64
.For more information about how to map the data types, see Map data types.
BQ_COLUMN_NAME: the name of the relevant column in the BigQuery table.
There are 2 exceptions to the way that you convert the values:
For arrays of
BIT
,BIT_VARYING
orBYTEA
values in the source column, run the following query:SELECT ARRAY(SELECT FROM_BASE64(element) FROM UNNEST(JSON_VALUE_ARRAY(BQ_COLUMN_NAME,'$')) AS element)
AS array_of_bytes For arrays of
JSON
orJSONB
values in the source column, use theJSON_QUERY_ARRAY
function:SELECT ARRAY(SELECT element FROM UNNEST(JSON_QUERY_ARRAY(BQ_COLUMN_NAME,'$')) AS element)
AS array_of_jsons
Known limitations
Known limitations for using BigQuery as a destination include:
- You can only replicate data into a BigQuery dataset that resides in the same Google Cloud project as the Datastream stream.
- By default, Datastream doesn't support adding a primary key to a table that's already replicated to BigQuery without a primary key, or removing a primary key from a table that's replicated to BigQuery with a primary key. If you need to perform such changes, contact Google Support. For information about changing the primary key definition for a source table that already has a primary key, see Diagnose issues.
Primary keys in BigQuery must be of the following data types:
DATE
BOOL
GEOGRAPHY
INT64
NUMERIC
BIGNUMERIC
STRING
TIMESTAMP
DATETIME
Tables that contain primary keys of unsupported data types aren't replicated by Datastream.
BigQuery doesn't support table names with
.
,$
,/
,@
, or+
characters. Datastream replaces such characters with underscores when creating destination tables.For example,
table.name
in the source database becomestable_name
in BigQuery.For more information on table names in BigQuery, see Table naming.
- BigQuery doesn't support more than four clustering columns. When replicating a table with more than four primary key columns, Datastream uses four primary key columns as the clustering columns.
- Datastream maps out-of-range date and time literals such as PostgreSQL infinity date types to the following values:
- Positive
DATE
to the value of9999-12-31
- Negative
DATE
to the value of0001-01-01
- Positive
TIMESTAMP
to the value of9999-12-31 23:59:59.999000 UTC
- Negative
TIMESTAMP
to the value of0001-01-01 00:00:00 UTC
- Positive
- BigQuery doesn't support streaming tables which have primary keys of
FLOAT
orREAL
data types. Such tables aren't replicated.
To learn more about BigQuery date types and ranges, see Data types.