Overview
The data hierarchy in Datastream is:
- A stream, which is comprised of a data source and a destination.
- An object, which is a portion of a stream, such as a table from a specific database.
- An event, which is a single change generated by a specific object, such as a database insert.
Streams, objects, and events have data and metadata associated with them. This data and metadata can be used for different purposes.
About events
Each event consists of three types of data:
- Event data: This represents the change to the data itself from the object originating from the stream source. Every event contains the entirety of the row that changed.
- Generic metadata: This metadata appears on every event generated by Datastream which is used for actions, such as removing duplicate data in the destination.
- Source-specific metadata: This metadata appears on every event generated by a specific stream source. This metadata varies by source.
Event data
Event data is the payload of every change from a given object originating from a stream source.
Events are in either the Avro or JSON format. Given the Avro format, for each column, the event will contain the column index and value. Using the column index, the column name and unified type can be retrieved from the schema in the Avro header.
When working with the JSON format, for each column, the event will contain the column name and value.
Event metadata can be used to collect information about the event's origin, as well as to remove duplicate data in the destination and order events by the downstream consumer.
The following tables list and describe the fields and data types for generic and source-specific event metadata.
Generic metadata
This metadata is consistent across streams of all types.
Field | Avro type | JSON type | Description |
---|---|---|---|
stream_name |
string | string | The unique stream name as defined at creation time. |
read_method |
string | string | Indicates if the data was read from the source using a change data capture (CDC) method, as part of historical backfill, or as part of a supplementation task that's created when a transaction is rolled back during CDC replication. Possible values include:
|
object |
string | string | The name used to group different types of events, typically the name of the table or object in the source. |
schema_key |
string | string | The unique identifier for the unified schema of the event. |
uuid |
string | string | A unique identifier for the event generated by Datastream. |
read_timestamp |
timestamp-millis | string | The timestamp (UTC) when the record was read by Datastream (the epoch timestamp in milliseconds). |
source_timestamp |
timestamp-millis | string | The timestamp (UTC) when the record changed on the source (the epoch timestamp in milliseconds). |
sort_keys |
{"type": "array", "items": ["string", "long"]} |
array | An array of values that can be used to sort events in the order in which they occurred. |
Source-specific metadata
This metadata is associated with CDC and backfill events from a source database. To view this metadata, select a source from the drop-down menu below.
Source | Field | Avro type | JSON type | Description |
---|---|---|---|---|
MySQL | log_file |
string | string | The log file that Datastream pulls events from in CDC replication. |
MySQL | log_position |
long | long | The log position (offset) in the MySQL binary log. |
MySQL | primary_keys |
string array | string array | The list of (one or more) column names that make up the tables primary key. If the table doesn't have a primary key, then this field is empty. |
MySQL | is_deleted |
boolean | boolean |
|
MySQL | database |
string | string | The database associated with the event. |
MySQL | table |
string | string | The table associated with the event. |
MySQL | change_type |
string | string | The type of change ( |
Oracle | log_file |
string | string | The log file that Datastream pulls events from in CDC replication. |
Oracle | scn |
long | long | The log position (offset) in the Oracle transaction log. |
Oracle | row_id |
string | string | Oracle's row_id. |
Oracle | is_deleted |
boolean | boolean |
|
Oracle | database |
string | string | The database associated with the event. |
Oracle | schema |
string | string | The schema associated with the table from the event. |
Oracle | table |
string | string | The table associated with the event. |
Oracle | change_type |
string | string | The type of change ( |
Oracle | tx_id |
string | string | The transaction ID to which the event belongs. |
Oracle | rs_id |
string | string | The record set ID. The coupling of rs_id and ssn uniquely identifies a row in V$LOGMNR_CONTENTS . rs_id uniquely identifies the redo record that generated the row. |
Oracle | ssn |
long | long | A SQL sequence number. This number is used with rs_id and uniquely identifies a row in V$LOGMNR_CONTENTS . |
PostgreSQL | schema |
string | string | The schema associated with the table from the event. |
PostgreSQL | table |
string | string | The table associated with the event. |
PostgreSQL | is_deleted |
boolean | boolean |
|
PostgreSQL | change_type |
string | string | The type of change (INSERT , UPDATE , DELETE ) that the event represents. |
PostgreSQL | tx_id |
string | string | The transaction ID to which the event belongs. |
PostgreSQL | lsn |
string | string | The log sequence number for the current entry. |
PostgreSQL | primary_keys |
string array | string array | The list of (one or more) column names that make up the tables primary key. If the table doesn't have a primary key, then this field is empty. |
Example of an event flow
This flow illustrates the events generated by three consecutive operations:
INSERT
, UPDATE
, and DELETE
, on a single row in a SAMPLE
table for a source database.
TIME | THIS_IS_MY_PK (int) | FIELD1 (nchar nullable) | FIELD2 (nchar non-null)> |
---|---|---|---|
0 | 1231535353 | foo | TLV |
1 | 1231535353 | NULL | TLV |
INSERT (T0)
The message payload consists of the entirety of the new row.
{
"stream_name": "projects/myProj/locations/myLoc/streams/Oracle-to-Source",
"read_method": "oracle-cdc-logminer",
"object": "SAMPLE.TBL",
"uuid": "d7989206-380f-0e81-8056-240501101100",
"read_timestamp": "2019-11-07T07:37:16.808Z",
"source_timestamp": "2019-11-07T02:15:39",
"source_metadata": {
"log_file": ""
"scn": 15869116216871,
"row_id": "AAAPwRAALAAMzMBABD",
"is_deleted": false,
"database": "DB1",
"schema": "ROOT",
"table": "SAMPLE"
"change_type": "INSERT",
"tx_id":
"rs_id": "0x0073c9.000a4e4c.01d0",
"ssn": 67,
},
"payload": {
"THIS_IS_MY_PK": "1231535353",
"FIELD1": "foo",
"FIELD2": "TLV",
}
}
UPDATE (T1)
The message payload consists of the entirety of the new row. It doesn't include previous values.
{
"stream_name": "projects/myProj/locations/myLoc/streams/Oracle-to-Source",
"read_method": "oracle-cdc-logminer",
"object": "SAMPLE.TBL",
"uuid": "e6067366-1efc-0a10-a084-0d8701101101",
"read_timestamp": "2019-11-07T07:37:18.808Z",
"source_timestamp": "2019-11-07T02:17:39",
"source_metadata": {
"log_file":
"scn": 15869150473224,
"row_id": "AAAGYPAATAAPIC5AAB",
"is_deleted": false,
"database":
"schema": "ROOT",
"table": "SAMPLE"
"change_type": "UPDATE",
"tx_id":
"rs_id": "0x006cf4.00056b26.0010",
"ssn": 0,
},
"payload": {
"THIS_IS_MY_PK": "1231535353",
"FIELD1": null,
"FIELD2": "TLV",
}
}
DELETE (T2)
The message payload consists of the entirety of the new row.
{
"stream_name": "projects/myProj/locations/myLoc/streams/Oracle-to-Source",
"read_method": "oracle-cdc-logminer",
"object": "SAMPLE.TBL",
"uuid": "c504f4bc-0ffc-4a1a-84df-6aba382fa651",
"read_timestamp": "2019-11-07T07:37:20.808Z",
"source_timestamp": "2019-11-07T02:19:39",
"source_metadata": {
"log_file":
"scn": 158691504732555,
"row_id": "AAAGYPAATAAPIC5AAC",
"is_deleted": true,
"database":
"schema": "ROOT",
"table": "SAMPLE"
"change_type": "DELETE",
"tx_id":
"rs_id": "0x006cf4.00056b26.0011",
"ssn": 0,
},
"payload": {
"THIS_IS_MY_PK": "1231535353",
"FIELD1": null,
"FIELD2": "TLV",
}
}
Ordering and consistency
This section covers how Datastream handles ordering and consistency.
Ordering
Datastream doesn't guarantee ordering, but each event contains the full row of data and the timestamp of when the data was written to the source. In BigQuery, out-of-order events will be merged in the correct sequence automatically. In Cloud Storage, events from the same time can span more than one file.
Events that are generated out of order happen by design when events are backfilled for the initial backfill of data that's created when the stream is initiated.
Ordering can be inferred on a source-by-source basis.
Source | Description |
---|---|
MySQL | Events that are part of the initial backfill have the Events that are part of the ongoing replication have the The order can be inferred by the combination of the |
Oracle | Events that are part of the initial backfill have the Events that are part of the ongoing replication have the The order can be inferred by the combination of the |
PostgreSQL | Events that are part of the initial backfill have the Events that are part of the ongoing replication have the The order can be inferred by the combination of the |