Configure a BigQuery destination

This page describes how to configure your BigQuery destination to stream data from a source database using Datastream.

Configure the destination datasets

When you configure datasets for the BigQuery destination, you can select one of the following options:

  • Dataset for each schema: The dataset is selected or created in the BigQuery location specified, based on the schema name of the source. As a result, for each schema in the source, Datastream creates a dataset in BigQuery automatically.

    If you select this option, then Datastream creates datasets in the project that contains the stream.

    For example, if you have a MySQL source, and this source has a mydb database and an employees table within the database, then Datastream creates the mydb dataset and employeestable in BigQuery.

  • Single dataset for all schemas: You can select a BigQuery dataset for the stream. Datastream streams all data into this dataset. For the dataset that you select, Datastream creates all tables as <schema>_<table>.

    For example, if you have a MySQL source, and this source has a mydb database and an employees table within the database, then Datastream creates the mydb_employeestable in the dataset that you select.

Write behavior

Datastream appends a STRUCT column named datastream_metadata to each table that's written to the BigQuery destination.

If a table has a primary key at the source, then the column contains the following fields:

  • UUID: This field has the string data type.
  • SOURCE_TIMESTAMP: This field has the integer data type.

If a table doesn't have a primary key, then the column contains an additional field: IS_DELETED. This field has the boolean data type, and it indicates whether the data that Datastream streams to the destination is associated with a DELETE operation at the source. Tables without primary keys are append-only.

The maximum event size when you stream data into BigQuery is 20 MB.

Use BigQuery tables with the max_staleness option

As part of near real-time ingestion, Datastream uses BigQuery's built-in support for upsert operations, such as updating, inserting, and deleting data. Upsert operations let you dynamically update the BigQuery destination as rows are added, modified, or deleted. Datastream streams these upsert operations into the destination table using the BigQuery Storage Write API.

Specify data staleness limit

BigQuery applies source modifications in the background on an ongoing basis, or at query run time, according to the configured data staleness limit. When Datastream creates a new table in BigQuery, the table's max_staleness option is set according to the current data staleness limit value for the stream.

For more information about using BigQuery tables with the max_staleness option, see Table staleness.

Control BigQuery costs

BigQuery costs are charged separately from Datastream. To learn how to control your BigQuery costs, see BigQuery CDC pricing.

Map data types

The following table lists data type conversions from supported source databases to the BigQuery destination.


Source database Source data type BigQuery data type
MySQL BIGINT(size) INT64
MySQL BIGINT (unsigned) NUMERIC
MySQL BINARY(size) STRING (hex encoded)
MySQL BIT(size) INT64
MySQL BLOB(size) STRING (hex encoded)
MySQL BOOL INT64
MySQL CHAR(size) STRING
MySQL DATE DATE
MySQL DATETIME(fsp) DATETIME
MySQL DECIMAL(precision, scale) If the precision value is <=38, and the scale value is <=9 then NUMERIC. Otherwise BIGNUMERIC
MySQL DOUBLE(size, d) FLOAT64
MySQL ENUM(val1, val2, val3, ...) STRING
MySQL FLOAT(precision) FLOAT64
MySQL FLOAT(size, d) FLOAT64
MySQL INTEGER(size) INT64
MySQL

JSON

JSON
MySQL LONGBLOB STRING (hex encoded)
MySQL LONGTEXT STRING (hex encoded)
MySQL MEDIUMBLOB STRING (hex encoded)
MySQL MEDIUMINT(size) INT64
MySQL MEDIUMTEXT STRING
MySQL SET(val1, val2, val3, ...) STRING
MySQL SMALLINT(size) INT64
MySQL TEXT(size) STRING
MySQL TIME(fsp) INTERVAL
MySQL TIMESTAMP(fsp) TIMESTAMP
MySQL TINYBLOB STRING (hex encoded)
MySQL TINYINT(size) INT64
MySQL TINYTEXT STRING
MySQL VARBINARY(size) STRING (hex encoded)
MySQL VARCHAR STRING
MySQL YEAR INT64
Oracle ANYDATA UNSUPPORTED
Oracle BFILE STRING
Oracle BINARY DOUBLE FLOAT64
Oracle BINARY FLOAT FLOAT64
Oracle BLOB BYTES
Oracle CHAR STRING
Oracle CLOB STRING
Oracle DATE DATETIME
Oracle DOUBLE PRECISION FLOAT64
Oracle FLOAT(p) FLOAT64
Oracle INTERVAL DAY TO SECOND UNSUPPORTED
Oracle INTERVAL YEAR TO MONTH UNSUPPORTED
Oracle LONG/LONG RAW STRING
Oracle NCHAR STRING
Oracle NCLOB STRING
Oracle NUMBER STRING
Oracle NUMBER(precision=*) STRING
Oracle NUMBER(precision, scale<=0) If p<=18, then INT64. If 18<p=<78, then map to parameterized decimal types. If p>=79, map to STRING
Oracle NUMBER(precision, scale>0) If 0<p=<78, then map to parameterized decimal types. If p>=79, map to STRING
Oracle NVARCHAR2 STRING
Oracle RAW STRING
Oracle ROWID STRING
Oracle SDO_GEOMETRY UNSUPPORTED
Oracle SMALLINT INT64
Oracle TIMESTAMP TIMESTAMP
Oracle TIMESTAMP WITH TIME ZONE TIMESTAMP
Oracle UDT (user-defined type) UNSUPPORTED
Oracle UROWID STRING
Oracle VARCHAR STRING
Oracle VARCHAR2 STRING
Oracle XMLTYPE UNSUPPORTED
PostgreSQL ARRAY JSON
PostgreSQL BIGINT INT64
PostgreSQL BIT BYTES
PostgreSQL BIT_VARYING BYTES
PostgreSQL BOOLEAN BOOLEAN
PostgreSQL BOX UNSUPPORTED
PostgreSQL BYTEA BYTES
PostgreSQL CHARACTER STRING
PostgreSQL CHARACTER_VARYING STRING
PostgreSQL CIDR STRING
PostgreSQL CIRCLE UNSUPPORTED
PostgreSQL DATE DATE
PostgreSQL DOUBLE_PRECISION FLOAT64
PostgreSQL ENUM STRING
PostgreSQL INET STRING
PostgreSQL INTEGER INT64
PostgreSQL INTERVAL INTERVAL
PostgreSQL JSON JSON
PostgreSQL JSONB JSON
PostgreSQL LINE UNSUPPORTED
PostgreSQL LSEG UNSUPPORTED
PostgreSQL MACADDR STRING
PostgreSQL MONEY FLOAT64
PostgreSQL NUMERIC If precision = -1, then STRING, otherwise BIGNUMERIC/NUMERIC. For more information, see parameterized decimal types.
PostgreSQL OID INT64
PostgreSQL PATH UNSUPPORTED
PostgreSQL POINT UNSUPPORTED
PostgreSQL POLYGON UNSUPPORTED
PostgreSQL REAL FLOAT64
PostgreSQL SMALLINT INT64
PostgreSQL SMALLSERIAL INT64
PostgreSQL SERIAL INT64
PostgreSQL TEXT STRING
PostgreSQL TIME TIME
PostgreSQL TIMESTAMP TIMESTAMP
PostgreSQL TIMESTAMP_WITH_TIMEZONE TIMESTAMP
PostgreSQL TIME_WITH_TIMEZONE TIME
PostgreSQL TSQUERY STRING
PostgreSQL TSVECTOR STRING
PostgreSQL TXID_SNAPSHOT STRING
PostgreSQL UUID STRING
PostgreSQL XML STRING
SQL Server BIGINT INT64
SQL Server BINARY BYTES
SQL Server BIT BOOL
SQL Server CHAR STRING
SQL Server DATE DATE
SQL Server DATETIME2 DATETIME
SQL Server DATETIME DATETIME
SQL Server DATETIMEOFFSET TIMESTAMP
SQL Server DECIMAL BIGNUMERIC
SQL Server FLOAT FLOAT64
SQL Server IMAGE BYTES
SQL Server INT INT64
SQL Server MONEY BIGNUMERIC
SQL Server NCHAR STRING
SQL Server NTEXT STRING
SQL Server NUMERIC BIGNUMERIC
SQL Server NVARCHAR STRING
SQL Server NVARCHAR(MAX) STRING
SQL Server REAL FLOAT64
SQL Server SMALLDATETIME DATETIME
SQL Server SMALLINT INT64
SQL Server SMALLMONEY NUMERIC
SQL Server TEXT STRING
SQL Server TIME TIME
SQL Server TIMESTAMP/ROWVERSION BYTES
SQL Server TINYINT INT64
SQL Server UNIQUEIDENTIFIER STRING
SQL Server VARBINARY BYTES
SQL Server VARBINARY(MAX) BYTES
SQL Server VARCHAR STRING
SQL Server VARCHAR(MAX) STRING
SQL Server XML STRING

Query a PostgreSQL array as a BigQuery array data type

If you prefer to query a PostgreSQL array as a BigQuery ARRAY data type, you can convert the JSON values to a BigQuery array using the BigQuery JSON_VALUE_ARRAY function:

  SELECT ARRAY(SELECT CAST(element AS TYPE) FROM UNNEST(JSON_VALUE_ARRAY(BQ_COLUMN_NAME,'$')) AS element) AS array_col
  

Replace the following:

  • TYPE: the BigQuery type that matches the element type in the PostgreSQL source array. For example, if the source type is an array of BIGINT values, then replace TYPE with INT64.

    For more information about how to map the data types, see Map data types.

  • BQ_COLUMN_NAME: the name of the relevant column in the BigQuery table.

There are 2 exceptions to the way that you convert the values:

  • For arrays of BIT, BIT_VARYING or BYTEA values in the source column, run the following query:

    SELECT ARRAY(SELECT FROM_BASE64(element) FROM UNNEST(JSON_VALUE_ARRAY(BQ_COLUMN_NAME,'$')) AS element) AS array_of_bytes
    
  • For arrays of JSON or JSONB values in the source column, use the JSON_QUERY_ARRAYfunction:

    SELECT ARRAY(SELECT element FROM UNNEST(JSON_QUERY_ARRAY(BQ_COLUMN_NAME,'$')) AS element) AS array_of_jsons
    

Known limitations

Known limitations for using BigQuery as a destination include:

  • You can only replicate data into a BigQuery dataset that resides in the same Google Cloud project as the Datastream stream.
  • If a table in the source database has a primary key (for example, the user ID), and then there's a change in the table so that the primary key is different (for this example, the primary key is now the user ID and the date), Datastream can't stream this change into the destination.

    To have Datastream stream the table into the destination with the modified primary key, complete the following steps:

    1. Change the primary key for the source database table.
    2. Check the total latency metric for the stream and wait at least as long as the current latency to ensure that any in-flight events are written to the destination. This allows all events with the original primary key to be successfully streamed.
    3. Pause the stream.
    4. Drop the table in the destination.
    5. Resume the stream.
    6. Trigger backfill on the table.
  • Primary keys in BigQuery must be of the following data types:

    • DATE
    • BOOL
    • GEOGRAPHY
    • INT64
    • NUMERIC
    • BIGNUMERIC
    • STRING
    • TIMESTAMP
    • DATETIME

    Tables that contain primary keys of unsupported data types aren't replicated by Datastream.

  • BigQuery doesn't support table names with ., $, /, @, or + characters. Datastream replaces such characters with underscores when creating destination tables.

    For example, table.name in the source database becomes table_name in BigQuery.

    For more information on table names in BigQuery, see Table naming.

  • BigQuery doesn't support more than four clustering columns. When replicating a table with more than four primary key columns, Datastream uses four primary key columns as the clustering columns.
  • Datastream maps out-of-range date and time literals such as PostgreSQL infinity date types to the following values:
    • Positive DATE to the value of 9999-12-31
    • Negative DATE to the value of 0001-01-01
    • Positive TIMESTAMP to the value of 9999-12-31 23:59:59.999000 UTC
    • Negative TIMESTAMP to the value of 0001-01-01 00:00:00 UTC
  • BigQuery doesn't support streaming tables which have primary keys of FLOAT or REAL data types. Such tables aren't replicated.
  • To learn more about BigQuery date types and ranges, see Data types.