Configure a BigQuery destination

This page describes how to configure your BigQuery destination to stream data from a source database using Datastream.

Configure the destination datasets

When you configure datasets for the BigQuery destination, you can select one of the following options:

  • Dataset for each schema: The dataset is selected or created in the BigQuery location specified, based on the schema name of the source. As a result, for each schema in the source, Datastream creates a dataset in BigQuery automatically.

    For example, if you have a MySQL source, and this source has a mydb database and an employees table within the database, then Datastream creates the mydb dataset and employeestable in BigQuery.

    If you select this option, then Datastream creates datasets in the project that contains the stream. Although you don't need to create the datasets in the same region as your stream, we recommend to keep all resources for the stream, as well as datasets, in the same region for cost and performance optimization.

  • Single dataset for all schemas: You can select a BigQuery dataset for the stream. Datastream streams all data into this dataset. For the dataset that you select, Datastream creates all tables as <schema>_<table>.

    For example, if you have a MySQL source, and this source has a mydb database and an employees table within the database, then Datastream creates the mydb_employeestable in the dataset that you select.

Write behavior

  • The maximum event size when you stream data into BigQuery is 20 MB.

  • When you configure your stream, you can select the way that Datastream writes your change data to BigQuery. For more information, see Configure write mode.

Configure write mode

There are two modes you can use to define how you want your data written to BigQuery:

  • Merge: This is the default write mode. When selected, BigQuery reflects the way your data is stored in the source database. This means that Datastream writes all changes to your data to BigQuery, and BigQuery then consolidates the changes with existing data, thus creating final tables that are replicas of the source tables. With Merge mode, no historical record of the change events is kept. For example, if you insert and then update a row, BigQuery only keeps the updated data. If you then delete the row from the source table, BigQuery no longer keeps any record of that row.
  • Append-only: The append-only write mode lets you add data to BigQuery as a stream of changes (INSERT, UPDATE-INSERT, UPDATE-DELETEand DELETE events). Use this mode when you need to retain the historical state of your data. To get a better understanding of the append-only write mode, consider the following scenarios:
    • Initial backfill: after the initial backfill, all events are written to BigQuery as INSERT type events, with the same timestamp, universally unique identifier (UUID), and change sequence number.
    • Primary key update: when a primary key changes, two rows are written to BigQuery:
      • An UPDATE-DELETE row with the original primary key
      • An UPDATE-INSERT row with the new primary key
    • Row update: when you update a row, a single UPDATE-INSERT row is written to BigQuery
    • Row deletion: when you delete a row, a single DELETE row is written to BigQuery

Table metadata

Datastream appends a STRUCT column named datastream_metadata to each table that's written to the BigQuery destination.

Merge write mode

If a table has a primary key at the source, then the column contains the following fields:

  • UUID: This field has the STRING data type.
  • SOURCE_TIMESTAMP: This field has the INTEGER data type.

If a table doesn't have a primary key, then the column contains an additional field: IS_DELETED. This field has the BOOLEAN data type, and it indicates whether the data that Datastream streams to the destination is associated with a DELETE operation at the source. Tables without primary keys are append-only.

Append-only write mode

The datastream_metadata column contains the same fields for tables with and without primary keys:

  • UUID: This field has the STRING data type.
  • SOURCE_TIMESTAMP: This field has the INTEGER data type.
  • CHANGE_SEQUENCE_NUMBER: This field has the STRING data type. It's an internal sequence number used by Datastream for each change event.
  • CHANGE_TYPE: This field has the STRING data type. It indicates the type of the change event: INSERT, UPDATE-INSERT, UPDATE-DELETEor DELETE.
  • SORT_KEYS: This field contains an array of STRING values. You can use the values to sort the change events.

Use BigQuery tables with the max_staleness option

As part of near real-time ingestion, Datastream uses BigQuery's built-in support for upsert operations, such as updating, inserting, and deleting data. Upsert operations let you dynamically update the BigQuery destination as rows are added, modified, or deleted. Datastream streams these upsert operations into the destination table using the BigQuery Storage Write API.

Specify data staleness limit

BigQuery applies source modifications in the background on an ongoing basis, or at query run time, according to the configured data staleness limit. When Datastream creates a new table in BigQuery, the table's max_staleness option is set according to the current data staleness limit value for the stream.

For more information about using BigQuery tables with the max_staleness option, see Table staleness.

Control BigQuery costs

BigQuery costs are charged separately from Datastream. To learn how to control your BigQuery costs, see BigQuery CDC pricing.

Map data types

The following table lists data type conversions from supported source databases to the BigQuery destination.


Source database Source data type BigQuery data type
MySQL BIGINT(size) LONG
MySQL BIGINT (unsigned) DECIMAL
MySQL BINARY(size) STRING (hex encoded)
MySQL BIT(size) INT64
MySQL BLOB(size) STRING (hex encoded)
MySQL BOOL INT64
MySQL CHAR(size) STRING
MySQL DATE DATE
MySQL DATETIME(fsp) DATETIME
MySQL DECIMAL(precision, scale) If the precision value is <=38, and the scale value is <=9 then NUMERIC. Otherwise BIGNUMERIC
MySQL DOUBLE(size, d) FLOAT64
MySQL ENUM(val1, val2, val3, ...) STRING
MySQL FLOAT(precision) FLOAT64
MySQL FLOAT(size, d) FLOAT64
MySQL INTEGER(size) INT64
MySQL INTEGER (unsigned) LONG
MySQL

JSON

JSON
MySQL LONGBLOB STRING (hex encoded)
MySQL LONGTEXT STRING
MySQL MEDIUMBLOB STRING (hex encoded)
MySQL MEDIUMINT(size) INT64
MySQL MEDIUMTEXT STRING
MySQL SET(val1, val2, val3, ...) STRING
MySQL SMALLINT(size) INT64
MySQL TEXT(size) STRING
MySQL TIME(fsp) INTERVAL
MySQL TIMESTAMP(fsp) TIMESTAMP
MySQL TINYBLOB STRING (hex encoded)
MySQL TINYINT(size) INT64
MySQL TINYTEXT STRING
MySQL VARBINARY(size) STRING (hex encoded)
MySQL VARCHAR STRING
MySQL YEAR INT64
Oracle ANYDATA UNSUPPORTED
Oracle BFILE STRING
Oracle BINARY DOUBLE FLOAT64
Oracle BINARY FLOAT FLOAT64
Oracle BLOB BYTES
Oracle CHAR STRING
Oracle CLOB STRING
Oracle DATE DATETIME
Oracle DOUBLE PRECISION FLOAT64
Oracle FLOAT(p) FLOAT64
Oracle INTERVAL DAY TO SECOND UNSUPPORTED
Oracle INTERVAL YEAR TO MONTH UNSUPPORTED
Oracle LONG/LONG RAW STRING
Oracle NCHAR STRING
Oracle NCLOB STRING
Oracle NUMBER STRING
Oracle NUMBER(precision=*) STRING
Oracle NUMBER(precision, scale<=0) If p<=18, then INT64. If 18<p=<78, then map to parameterized decimal types. If p>=79, map to STRING
Oracle NUMBER(precision, scale>0) If 0<p=<78, then map to parameterized decimal types. If p>=79, map to STRING
Oracle NVARCHAR2 STRING
Oracle RAW STRING
Oracle ROWID STRING
Oracle SDO_GEOMETRY UNSUPPORTED
Oracle SMALLINT INT64
Oracle TIMESTAMP TIMESTAMP
Oracle TIMESTAMP WITH TIME ZONE TIMESTAMP
Oracle UDT (user-defined type) UNSUPPORTED
Oracle UROWID STRING
Oracle VARCHAR STRING
Oracle VARCHAR2 STRING
Oracle XMLTYPE UNSUPPORTED
PostgreSQL ARRAY JSON
PostgreSQL BIGINT INT64
PostgreSQL BIT BYTES
PostgreSQL BIT_VARYING BYTES
PostgreSQL BOOLEAN BOOLEAN
PostgreSQL BOX UNSUPPORTED
PostgreSQL BYTEA BYTES
PostgreSQL CHARACTER STRING
PostgreSQL CHARACTER_VARYING STRING
PostgreSQL CIDR STRING
PostgreSQL CIRCLE UNSUPPORTED
PostgreSQL DATE DATE
PostgreSQL DOUBLE_PRECISION FLOAT64
PostgreSQL ENUM STRING
PostgreSQL INET STRING
PostgreSQL INTEGER INT64
PostgreSQL INTERVAL INTERVAL
PostgreSQL JSON JSON
PostgreSQL JSONB JSON
PostgreSQL LINE UNSUPPORTED
PostgreSQL LSEG UNSUPPORTED
PostgreSQL MACADDR STRING
PostgreSQL MONEY FLOAT64
PostgreSQL NUMERIC If precision = -1, then STRING (BigQuery NUMERIC types require fixed precision). Otherwise BIGNUMERIC/NUMERIC. For more information, see the Arbitrary precision numbers section in PostgreSQL documentation.
PostgreSQL OID INT64
PostgreSQL PATH UNSUPPORTED
PostgreSQL POINT UNSUPPORTED
PostgreSQL POLYGON UNSUPPORTED
PostgreSQL REAL FLOAT64
PostgreSQL SMALLINT INT64
PostgreSQL SMALLSERIAL INT64
PostgreSQL SERIAL INT64
PostgreSQL TEXT STRING
PostgreSQL TIME TIME
PostgreSQL TIMESTAMP TIMESTAMP
PostgreSQL TIMESTAMP_WITH_TIMEZONE TIMESTAMP
PostgreSQL TIME_WITH_TIMEZONE TIME
PostgreSQL TSQUERY STRING
PostgreSQL TSVECTOR STRING
PostgreSQL TXID_SNAPSHOT STRING
PostgreSQL UUID STRING
PostgreSQL XML STRING
SQL Server BIGINT INT64
SQL Server BINARY BYTES
SQL Server BIT BOOL
SQL Server CHAR STRING
SQL Server DATE DATE
SQL Server DATETIME2 DATETIME
SQL Server DATETIME DATETIME
SQL Server DATETIMEOFFSET TIMESTAMP
SQL Server DECIMAL BIGNUMERIC
SQL Server FLOAT FLOAT64
SQL Server IMAGE BYTES
SQL Server INT INT64
SQL Server MONEY BIGNUMERIC
SQL Server NCHAR STRING
SQL Server NTEXT STRING
SQL Server NUMERIC BIGNUMERIC
SQL Server NVARCHAR STRING
SQL Server NVARCHAR(MAX) STRING
SQL Server REAL FLOAT64
SQL Server SMALLDATETIME DATETIME
SQL Server SMALLINT INT64
SQL Server SMALLMONEY NUMERIC
SQL Server TEXT STRING
SQL Server TIME TIME
SQL Server TIMESTAMP/ROWVERSION BYTES
SQL Server TINYINT INT64
SQL Server UNIQUEIDENTIFIER STRING
SQL Server VARBINARY BYTES
SQL Server VARBINARY(MAX) BYTES
SQL Server VARCHAR STRING
SQL Server VARCHAR(MAX) STRING
SQL Server XML STRING

Query a PostgreSQL array as a BigQuery array data type

If you prefer to query a PostgreSQL array as a BigQuery ARRAY data type, you can convert the JSON values to a BigQuery array using the BigQuery JSON_VALUE_ARRAY function:

  SELECT ARRAY(SELECT CAST(element AS TYPE) FROM UNNEST(JSON_VALUE_ARRAY(BQ_COLUMN_NAME,'$')) AS element) AS array_col
  

Replace the following:

  • TYPE: the BigQuery type that matches the element type in the PostgreSQL source array. For example, if the source type is an array of BIGINT values, then replace TYPE with INT64.

    For more information about how to map the data types, see Map data types.

  • BQ_COLUMN_NAME: the name of the relevant column in the BigQuery table.

There are 2 exceptions to the way that you convert the values:

  • For arrays of BIT, BIT_VARYING or BYTEA values in the source column, run the following query:

    SELECT ARRAY(SELECT FROM_BASE64(element) FROM UNNEST(JSON_VALUE_ARRAY(BQ_COLUMN_NAME,'$')) AS element) AS array_of_bytes
    
  • For arrays of JSON or JSONB values in the source column, use the JSON_QUERY_ARRAYfunction:

    SELECT ARRAY(SELECT element FROM UNNEST(JSON_QUERY_ARRAY(BQ_COLUMN_NAME,'$')) AS element) AS array_of_jsons
    

Known limitations

Known limitations for using BigQuery as a destination include:

  • You can only replicate data into a BigQuery dataset that resides in the same Google Cloud project as the Datastream stream.
  • By default, Datastream doesn't support adding a primary key to a table that's already replicated to BigQuery without a primary key, or removing a primary key from a table that's replicated to BigQuery with a primary key. If you need to perform such changes, contact Google Support. For information about changing the primary key definition for a source table that already has a primary key, see Diagnose issues.
  • Primary keys in BigQuery must be of the following data types:

    • DATE
    • BOOL
    • GEOGRAPHY
    • INT64
    • NUMERIC
    • BIGNUMERIC
    • STRING
    • TIMESTAMP
    • DATETIME

    Tables that contain primary keys of unsupported data types aren't replicated by Datastream.

  • BigQuery doesn't support table names with ., $, /, @, or + characters. Datastream replaces such characters with underscores when creating destination tables.

    For example, table.name in the source database becomes table_name in BigQuery.

    For more information on table names in BigQuery, see Table naming.

  • BigQuery doesn't support more than four clustering columns. When replicating a table with more than four primary key columns, Datastream uses four primary key columns as the clustering columns.
  • Datastream maps out-of-range date and time literals such as PostgreSQL infinity date types to the following values:
    • Positive DATE to the value of 9999-12-31
    • Negative DATE to the value of 0001-01-01
    • Positive TIMESTAMP to the value of 9999-12-31 23:59:59.999000 UTC
    • Negative TIMESTAMP to the value of 0001-01-01 00:00:00 UTC
  • BigQuery doesn't support streaming tables which have primary keys of FLOAT or REAL data types. Such tables aren't replicated.
  • To learn more about BigQuery date types and ranges, see Data types.

What's next