This section contains information about:
- The behavior of how Datastream handles data that's being pulled from a source MySQL database
- The versions of MySQL database that Datastream supports
- Known limitations for using MySQL database as a source
- An overview of how to setup a source MySQL database so that data can be streamed from it to a destination
Behavior
This section describes the behavior of MySQL sources when you replicate data using Datastream. When you ingest data from MySQL databases, you can use binlog-based replication or global transaction identifier (GTID)-based replication (Preview). You select your CDC method when you create a stream.
Binlog-based replication
Datastream can use binary log files to keep a record of data changes in MySQL databases. The information contained in these log files is then replicated to the destination to reproduce the changes made on the source.
The key characteristics of binlog-based replication in Datastream are:
- All databases or specific databases from a given MySQL source, as well as all tables from the databases or specific tables, can be selected.
- All historical data is replicated.
- All data manipulation language (DML) changes, such as inserts, updates, and deletes from the specified databases and tables, are replicated.
- Only committed changes are replicated.
Global transaction identifier (GTID)-based replication
Datastream also supports global identifier (GTID)-based replication.
Global transaction identifier (GTID) is a unique identifier created and associated with each transaction committed on a MySQL source. This identifier is unique not only to the source on which it originated, but also across all servers in a given replication topology, as opposed to the binary log-based replication where each node in the database cluster maintains its own binlog files, with its own numbering. Maintaining separate binlog files and numbering might become an issue in the event of a failure or planned downtime, because the binlog continuity is broken and the binlog-based replication fails.
GTID-based replication supports failovers, self-managed database clusters, and continues to work irrespective of changes in the database cluster.
The key characteristics of GTID-based replication in Datastream are:
- All databases or specific databases from a given MySQL source, as well as all tables from the databases or specific tables, can be selected.
- All historical data is replicated.
- All data manipulation language (DML) changes, such as inserts, updates, and deletes from the specified databases and tables, are replicated.
- Only committed changes are replicated.
- Seamless support for failovers.
Versions
Datastream supports the following versions of MySQL database:
- MySQL 5.6
- MySQL 5.7
MySQL 8.0
Datastream supports the following types of MySQL database:
- Self-hosted MySQL
- Cloud SQL for MySQL
- Amazon RDS for MySQL
- Amazon Aurora MySQL
- MariaDB
- Alibaba Cloud PolarDB
- Percona Server for MySQL
Known limitations
Known limitations for using MySQL database as a source include:
- Streams are limited to 10,000 tables.
- Tables that have a primary key defined as
INVISIBLE
can't be backfilled. - A table that has more than 500 million rows can't be backfilled unless the following conditions are met:
- The table has a unique index.
- None of the columns of the index are nullable.
- The index isn't descending.
- All columns of the index are included in the stream.
- Datastream periodically fetches the latest schema from the source as events are processed. If a schema changes, Datastream detects the schema change and triggers a schema fetch. However, some events might get processed incorrectly or get dropped between the schema fetches, which can cause data discrepancies.
- Not all changes to the source schema can be detected automatically, in which case data corruption may occur. The following schema changes may cause data corruption or failure to process the events downstream:
- Dropping columns
- Adding columns to the middle of a table
- Changing the data type of a column
- Reordering columns
- Dropping tables (relevant if the same table is then recreated with new data added)
- Truncating tables
- Datastream doesn't support replicating views.
- Datastream doesn't support columns of spatial data types. The values in these columns are replaced with
NULL
values. - Datastream doesn't support the zero value (
0000-00-00 00:00:00
) in columns of theDATETIME
,DATE
, orTIMESTAMP
data types. The zero value is replaced with theNULL
value. - Datastream doesn't support replicating rows which include the following values in
JSON
columns:DECIMAL
,NEWDECIMAL
,TIME
,TIME2
DATETIME
,DATETIME2
,DATE
,TIMESTAMP
orTIMESTAMP2
. Events containing such values are discarded. - Datastream doesn't support binary log transaction compression.
- Datastream doesn't support SSL certificate chains in the source MySQL connection profiles. Only single, x509 PEM-encoded certificates are supported.
- Datastream doesn't support cascading deletes. Such events aren't written to the binary log, and as a result, aren't propagated to the destination.
- When using the binary log-based replication, Datastream doesn't support failovers to replicas. Because of this, we don't recommend using Datastream for replication from Cloud SQL for MySQL Enterprise Plus sources. Cloud SQL Enterprise Plus edition instances are subject to near-zero downtime maintenance and fail over to a replica during maintenance. This breaks the binlog continuity, and as a result, affected streams fail permanently.
Additional limitations for the GTID-based replication
- Stream recovery isn't supported when you use the GTID-based replication.
- Creating tables from other tables using the
CREATE TABLE ... SELECT
statements isn't supported. - For MySQL restrictions that apply to GTID-based replication, see MySQL documentation.
What's next
- Learn how to configure a MySQL source for use with Datastream.