Source MySQL database

This section contains information about:

  • The behavior of how Datastream handles data that's being pulled from a source MySQL database
  • The versions of MySQL database that Datastream supports
  • Known limitations for using MySQL database as a source
  • An overview of how to setup a source MySQL database so that data can be streamed from it to a destination

Behavior

This section describes the behavior of MySQL sources when you replicate data using Datastream. When you ingest data from MySQL databases, you can use binlog-based replication or global transaction identifier (GTID)-based replication (Preview). You select your CDC method when you create a stream.

Binlog-based replication

Datastream can use binary log files to keep a record of data changes in MySQL databases. The information contained in these log files is then replicated to the destination to reproduce the changes made on the source.

The key characteristics of binlog-based replication in Datastream are:

  • All databases or specific databases from a given MySQL source, as well as all tables from the databases or specific tables, can be selected.
  • All historical data is replicated.
  • All data manipulation language (DML) changes, such as inserts, updates, and deletes from the specified databases and tables, are replicated.
  • Only committed changes are replicated.

Global transaction identifier (GTID)-based replication

Datastream also supports global identifier (GTID)-based replication.

Global transaction identifier (GTID) is a unique identifier created and associated with each transaction committed on a MySQL source. This identifier is unique not only to the source on which it originated, but also across all servers in a given replication topology, as opposed to the binary log-based replication where each node in the database cluster maintains its own binlog files, with its own numbering. Maintaining separate binlog files and numbering might become an issue in the event of a failure or planned downtime, because the binlog continuity is broken and the binlog-based replication fails.

GTID-based replication supports failovers, self-managed database clusters, and continues to work irrespective of changes in the database cluster.

The key characteristics of GTID-based replication in Datastream are:

  • All databases or specific databases from a given MySQL source, as well as all tables from the databases or specific tables, can be selected.
  • All historical data is replicated.
  • All data manipulation language (DML) changes, such as inserts, updates, and deletes from the specified databases and tables, are replicated.
  • Only committed changes are replicated.
  • Seamless support for failovers.

Versions

Datastream supports the following versions of MySQL database:

  • MySQL 5.6
  • MySQL 5.7
  • MySQL 8.0

Datastream supports the following types of MySQL database:

Known limitations

Known limitations for using MySQL database as a source include:

  • Streams are limited to 10,000 tables.
  • Tables that have a primary key defined as INVISIBLE can't be backfilled.
  • A table that has more than 500 million rows can't be backfilled unless the following conditions are met:
    1. The table has a unique index.
    2. None of the columns of the index are nullable.
    3. The index isn't descending.
    4. All columns of the index are included in the stream.
  • Datastream periodically fetches the latest schema from the source as events are processed. If a schema changes, Datastream detects the schema change and triggers a schema fetch. However, some events might get processed incorrectly or get dropped between the schema fetches, which can cause data discrepancies.
  • Not all changes to the source schema can be detected automatically, in which case data corruption may occur. The following schema changes may cause data corruption or failure to process the events downstream:
    • Dropping columns
    • Adding columns to the middle of a table
    • Changing the data type of a column
    • Reordering columns
    • Dropping tables (relevant if the same table is then recreated with new data added)
    • Truncating tables
  • Datastream doesn't support replicating views.
  • Datastream doesn't support columns of spatial data types. The values in these columns are replaced with NULL values.
  • Datastream doesn't support the zero value (0000-00-00 00:00:00) in columns of the DATETIME, DATE, or TIMESTAMP data types. The zero value is replaced with the NULL value.
  • Datastream doesn't support replicating rows which include the following values in JSON columns: DECIMAL, NEWDECIMAL, TIME, TIME2 DATETIME, DATETIME2, DATE, TIMESTAMP or TIMESTAMP2. Events containing such values are discarded.
  • Datastream doesn't support binary log transaction compression.
  • Datastream doesn't support SSL certificate chains in the source MySQL connection profiles. Only single, x509 PEM-encoded certificates are supported.
  • Datastream doesn't support cascading deletes. Such events aren't written to the binary log, and as a result, aren't propagated to the destination.
  • When using the binary log-based replication, Datastream doesn't support failovers to replicas. Because of this, we don't recommend using Datastream for replication from Cloud SQL for MySQL Enterprise Plus sources. Cloud SQL Enterprise Plus edition instances are subject to near-zero downtime maintenance and fail over to a replica during maintenance. This breaks the binlog continuity, and as a result, affected streams fail permanently.

Additional limitations for the GTID-based replication

  • Stream recovery isn't supported when you use the GTID-based replication.
  • Creating tables from other tables using the CREATE TABLE ... SELECT statements isn't supported.
  • For MySQL restrictions that apply to GTID-based replication, see MySQL documentation.

What's next