FAQ

Stay organized with collections Save and categorize content based on your preferences.

Overview

In this topic, you'll find answers to frequently asked questions (FAQ) about using Datastream. This FAQ is associated with:

  • General questions about Datastream and change data capture (CDC)
  • MySQL change data capture (CDC)
  • Oracle change data capture (CDC)
  • PostgreSQL change data capture (CDC)
  • Common use cases for using Datastream
  • How Datastream integrates with Google Cloud data services
  • The sources and destinations that Datastream supports
  • The behavior and limitations of Datastream
  • Security and connectivity
  • Processing Datastream files in Cloud Storage
  • Monitoring Datastream
  • Pricing
  • Additional information about Datastream

General questions about Datastream and change data capture (CDC)

Question Answer
What's Datastream?

Datastream is a serverless and easy-to-use change data capture (CDC) and replication service. Datastream enables seamless replication of data from relational database sources such as AlloyDB (Preview), PostgreSQL (Preview), MySQL, and Oracle, directly into BigQuery (Preview). Datastream offers the scale, speed, security, and simplicity that modern businesses need. With a serverless, auto-scaling architecture, Datastream allows you to easily set up an ELT (Extract, Load, and Transform) pipeline for low-latency data replication enabling real-time insights.

Datastream also has the ability to synchronize data streams across heterogeneous databases and applications reliably and with minimal latency. The service offers you streamlined integration with Dataflow templates to replicate databases into Cloud SQL or Cloud Spanner for database synchronization, or leverage the event stream directly from Cloud Storage to realize event-driven architectures. In addition, Datastream integrates with Cloud Data Fusion for customers who want to build data pipelines with transformations to BigQuery.

What's CDC?

CDC is an approach to data integration that allows you to integrate and analyze data faster, using fewer system resources. It's a method for pulling only the latest changes (updates, inserts, or deletes) from a data source, often by reading the log of changes that the source keeps for its own internal transactional integrity.

CDC is a highly-efficient mechanism for limiting the impact on the source when loading new data into operational data stores and data warehouses, and it eliminates the need for bulk-load updating and inconvenient batch windows by enabling incremental loading or real-time streaming of data changes into a data destination.

CDC can be used in many use cases that derive value from constant access to data changes as they happen, such as analytics, synchronizing data across geographically distributed systems, and event-driven architectures.

What's backfill? In addition to ongoing changes, Datastream also uses backfill to pull all existing data from a source, and then stream the data into a destination. As a result, the destination is "backfilled" with all historical data from the source.
What are the methods by which Datastream streams data? Datastream streams data from a source to a destination by one of two methods:
  • CDC: Capturing ongoing changes from the source in real time.
  • Backfill: Providing a historical snapshot of data that exists in a table.
What's a change stream? A change stream is a sequence of events that Datastream emits to a destination such as BigQuery (Preview) or Cloud Storage for downstream consumption. The result of a change stream being written to a destination is a set of files that contain change log events from a specific database table over a period of time. These events represent inserts, updates, and deletes to that table from the source, and the events can be consumed and processed downstream of the destination by services such as Dataflow for event-driven architectures.
What's a materialized view?

A materialized view is a 1-to-1, up-to-date representation of a source database table in a destination. An example of this type of view is an Oracle database table where changes from this table are replicated continuously to a BigQuery table. As a result, the BigQuery table is kept up-to-date using the changes that Datastream streamed from the Oracle table.

Materialized views leverage change streams by consuming each change event and using it to update the materialized view. Datastream delivers materialized views directly into BigQuery. Other destinations, such as Cloud SQL and Cloud Spanner, are supported through integration with other services such as Dataflow.

Does Datastream require an agent on the source? You don't have to install an agent on the source. Datastream uses existing interfaces (such as Oracle LogMiner) to get the data out of the source.
What about the performance impact of Datastream's usage on a production database?

CDC is a highly efficient mechanism for limiting the impact on the source when new data is loaded into destination data stores and data warehouses. CDC also eliminates the need for bulk-load updating and inconvenient batch windows by enabling incremental loading or real-time streaming of data changes into a destination.

In addition, Datastream minimizes the impact of an initial backfill by limiting the number of simultaneous backfill tasks, and offering you the control to decide which objects to backfill, and when to backfill them.

Can you copy the log files from the source to Google Cloud? No. Datastream queries the logs directly on the database server, and only changes to the specified tables are written to the destination.
Can Datastream also transfer historical data? Yes. By default, Datastream gets all historical data from the database tables of your source that you specify, in parallel to the CDC stream. Datastream provides you with the flexibility to obtain historical data from either some or all database tables in your source. In addition, when you configure your stream, you can choose to disable historical backfill.
What's the maximum row size that Datastream supports? Datastream is currently limited to rows no larger than 3 MB.
Does Datastream guarantee ordering? Datastream doesn't guarantee ordering. It provides additional metadata on each event, which can be used to ensure eventual consistency in the destination. Depending on the source, rate and frequency of changes, and other parameters, eventual consistency can be achieved within a 1-hour window generally.
Does Datastream guarantee exactly-once delivery? No. Datastream is at-least-once delivery. The additional metadata that Datastream writes to each event can be used to remove duplicate data.
Do you handle encrypted databases? Yes.
Can Datastream stream specific tables and columns from a source? Yes. With Datastream, you can specify include and exclude lists for tables and schemas, to stream only the data that you want from a source to a destination. For included tables, you can exclude specific columns of the tables to further fine-tune exactly which data you want to be streamed into the destination. However, materialized views can't be replicated.
How do you move a stream to another project or region?
  1. Create a stream in a new region or project with the same configuration as the existing stream, but don't select the Backfill historical data check box.
  2. Start the stream that you created.
  3. After the stream that you created has a status of RUNNING, pause the existing stream.
  4. Optionally, modify the new stream by selecting the Backfill historical data check box. Existing data in tables added to the stream in the future will be streamed from the source into the destination.
How does Datastream handle uncommitted transactions in the database log files? When database log files contain uncommitted transactions, if any transactions are rolled back, then the database reflects this in the log files as "reverse" data manipulation language (DML) operations. For example, a rolled-back INSERT operation will have a corresponding DELETE operation. Datastream reads these operations from the log files.
What's Datastream's regional availability?

To view a listing of the regions where Datastream is available, see IP allowlists and regions.

MySQL change data capture (CDC)

Question Answer
How does Datastream get data out of MySQL? Datastream uses MySQL's binary log to extract the change events from MySQL
Does Datastream support Cloud SQL for MySQL read replica instances?

Yes, Datastream supports read replica instances for Cloud SQL for MySQL versions 5.7 and 8.0.

You can enable binary logging on these instances. To do so, execute the following gcloud command:

gcloud sql instances patch INSTANCE_NAME -- enable-bin-log

For more information, see Binary logging on read replicas.

Does Datastream lock MySQL database tables? Why does Datastream require LOCK TABLE permissions?

In most cases, Datastream will not need to lock the tables. However, under certain conditions (for example, MySQL 8 with an InnoDB storage engine and the isolation level set to REPEATABLE READ), Datastream acquires locks on the tables during the backfill.

Oracle change data capture (CDC)

Question Answer
How does Datastream get data out of Oracle? Datastream uses Oracle LogMiner to extract the data from Oracle's redo logs.
Does Datastream require a GoldenGate license from Oracle? No. Datastream uses Oracle LogMiner to read the data from the database's redo logs.
What happens when Oracle LogMiner isn't supported anymore? Oracle will be supported for future releases. For now, Oracle LogMiner is still available in all generally available Oracle releases.
Does Datastream support encryption of data in-transit from Oracle databases?

Datastream supports encryption of data in-transit based on Oracle Net Services. Datastream runs in REQUESTED mode.

Does Datastream support Oracle multi-tenant architecture, specifically Container Databases (CDBs) and Pluggable Databases (PDBs)? Datastream doesn't support CDBs and PDBs currently, but this is on our roadmap.
Your organization doesn't permit granting the GRANT SELECT ANY TABLE permission. Can you still use Datastream?

If you can't grant the GRANT SELECT ANY TABLE permission, then you can grant SELECT permissions to the following list of tables:

  • ALL_COL_COMMENTS
  • ALL_CONS_COLUMNS
  • ALL_CONSTRAINTS
  • ALL_DB_LINKS
  • ALL_EXTERNAL_TABLES
  • ALL_IND_COLUMNS
  • ALL_INDEXES
  • ALL_LOG_GROUPS
  • ALL_MVIEWS
  • ALL_OBJECTS
  • ALL_PART_TABLES
  • ALL_SEQUENCES
  • ALL_SOURCE
  • ALL_SYNONYMS
  • ALL_TAB_COLS
  • ALL_TAB_COLUMNS
  • ALL_TAB_COMMENTS
  • ALL_TABLES
  • ALL_TRIGGERS
  • ALL_TRIGGER_COLS
  • ALL_TYPES
  • ALL_USERS
  • ALL_VIEWS
  • DATABASE_PROPERTIES
  • DBA_ROLE_PRIVS
  • DUAL
  • PRODUCT_COMPONENT_VERSION
  • ROLE_SYS_PRIVS
  • USER_ROLE_PRIVS
  • USER_TAB_PRIVS

To start the stream, bypass the validation using the API. For more information see Start a stream.

PostgreSQL change data capture (CDC)

Question Answer
How does Datastream get data out of PostgreSQL? Datastream uses PostgreSQL's logical decoding feature to extract the change events from PostgreSQL

Use cases

Question Answer
What are some common use cases for using Datastream?

Datastream is a CDC and replication service, which means it's flexible across various use cases that can benefit from access to continuously streaming change data. The most common use cases for Datastream, in order of commonality, are:

  1. Analytics: By integrating Datastream with Dataflow templates for loading data into BigQuery (Preview), you can get up-to-date materialized views of source data in BigQuery. You can use this continuously updated data to build up-to-date dashboards for the data. This can be used, for example, for monitoring your systems and for deriving up-to-date insights about the state of your business.
  2. Database replication and synchronization scenarios: By integrating Datastream with Dataflow templates for loading data into Cloud SQL or Cloud Spanner, you can get up-to-date materialized views of your source data in these databases. You can use this continuously updated data in the destination databases for a low downtime database migration from the source to the destination, or for hybrid-cloud configurations, where the source and destination reside in different hosting environments.
  3. Event-driven architectures: Modern microservices-based architectures rely on central hubs of data that are updated with events continuously from across your organization to be event-driven. By continuously writing event data into destinations such as BigQuery and Cloud Storage, you can build event-driven architectures that are based on consumption of event data from these destinations.

Integrations

Question Answer
How does Datastream integrate with Google Cloud data services?

Datastream complements and enhances the Google Cloud data suite by providing CDC data replication from sources to various Google Cloud services. By integrating seamlessly with these services, Datastream fits into the larger Google Cloud ecosystem.

Datastream integrates with the following data services:

  • BigQuery (Preview): Datastream uses the BigQuery Write API to integrate with BigQuery directly. Datastream writes the change events directly to the BigQuery dataset where the changes are merged continuously (UPSERT) with the existing data.
  • Dataflow: Datastream integrates with Dataflow via Dataflow templates. These templates read data from Cloud Storage and load that data into BigQuery, Cloud SQL for PostgreSQL, or Cloud Spanner. The purpose of these templates is to maintain up-to-date materialized views of the replicated source tables in the destinations. The templates are available in the Dataflow UI, and are built for Datastream-generated files for easy, out-of-the-box processing.
  • Cloud Data Fusion: Datastream integrates with Cloud Data Fusion by powering an Oracle connector through Cloud Data Fusion's replication feature. You can create a replication Oracle source that Datastream "powers" under the hood transparently to build a data pipeline easily. This pipeline includes rich transformations for the Oracle data that Datastream streams into Cloud Data Fusion.
  • Cloud Storage: Datastream integrates with Cloud Storage by writing to it as a change streaming destination.
  • Cloud SQL, Cloud Spanner, and BigQuery: Datastream integrates with these destination databases. By using Dataflow templates, you can maintain up-to-date materialized views in the databases.
Does the Datastream-to-BigQuery template in Dataflow have any limitations for the number of data manipulation language (DML) operations? No. The template uses the streaming inserts API to load data into BigQuery. As a result, there's no limit to the number of DML operations. However, some quota limitations are applicable.

Sources and destinations

Question Answer
Which sources and destinations does Datastream support?
  • Sources: Datastream supports streaming data from Oracle, MySQL and PostgreSQL (Preview) sources, both cloud-hosted and self-managed. For more information about source support, see Behavior and limitations.
  • Destinations: Datastream supports applying changes to data directly to BigQuery tables (Preview), or storing the changes to data in files that are streamed into Cloud Storage. From Cloud Storage, you can then use Dataflow templates to transfer the change data to additional destinations, such as Cloud Spanner and Cloud SQL.
Can Datastream handle encrypted databases? Yes.

Behavior and limitations

Question Answer
What source versions does Datastream support?

For MySQL, Datastream supports versions 5.6, 5.7, and 8.0. Any MySQL-compliant database is supported, including, but not limited to:

  • A self-hosted database, either on-premises or on any cloud provider
  • Cloud SQL for MySQL
  • AWS RDS MySQL and Aurora MySQL
  • MariaDB
  • Alibaba Cloud PolarDB
  • Percona Server for MySQL

For Oracle, Datastream supports 11g Release 2 and above (not RAC in SCAN config, CDB/PDB, or autonomous DB). Any Oracle-compliant database is supported, including, but not limited to:

  • A self-hosted database, either on-premises or on any cloud provider
  • AWS RDS
  • Oracle Cloud

For PostgreSQL, Datastream supports version 10 and above. Any PostgreSQL-compliant database is supported, including, but not limited to:

  • Self-hosted PostgreSQL
  • Cloud SQL for PostgreSQL
  • Amazon RDS for PostgreSQL
  • Amazon Aurora PostgreSQL
  • AlloyDB for PostgreSQL
How does Datastream extract data from the sources?

For MySQL, Datastream processes the MySQL binary log to extract change events.

For Oracle, Datastream uses LogMiner and supplemental logging settings to extract data from Oracle's redo logs.

For PostgreSQL, Datastream relies on PostgreSQL's logical decoding feature. Logical decoding exposes all changes committed to the database and allows consuming and processing these changes.

Can you copy log files directly from a source to Google Cloud? Datastream doesn't copy the entirety of the log files, but queries the log files directly from the database server, and only replicates changes from the specified tables to the destination.
For Oracle sources, does Datastream require a GoldenGate license? Datastream doesn't require a GoldenGate license because it uses Oracle LogMiner to read the data from the database's redo logs.
What happens when Oracle LogMiner won't be supported for newer versions? Oracle has discontinued support for the LogMiner CONTINUOUS_MINE option, however this option isn't used by Datastream, hence this deprecation has no impact on Datastream.
What are the limitations on the data that Datastream can process?

There are general limitations, limitations for MySQL sources, and limitations for Oracle sources.

General limitations include:

  • Throughput: ~5 MBPS with a maximum 3-MB row size limit.
  • Some data definition language (DDL) operations aren't supported during replication, including:
    • Dropping a column from the middle of a table. This may cause a data discrepancy because values are associated with the wrong column.
    • Changing the data type of a column. This may cause a data discrepancy because data isn't mapped properly to the correct Datastream unified type, and the data may get corrupted.
    • Cascading deletes are ignored.
    • Table truncation is ignored.

MySQL limitations include backfill not being supported for tables which don't have a primary key and are larger than 100 million rows.

Oracle limitations include:

  • Column support:
    • Some data types aren't supported, and are replicated with NULL values. These data types include abstract data types (ADT), ANYDATA, collections (VARRAY), BLOB/CLOB/LOB/NCLOB, LONG, UDT, UROWID, and XMLType.
  • Unsupported Oracle features:
    • External tables
    • Database links (dblinks)
    • Index-only tables (IOTs)
    • Oracle Label Security (OLS) isn't replicated.
  • Any tables that have more than 100 million rows can't be backfilled unless the following conditions are met:
    • The table must have a unique index defined on it.
    • This index must also be Btree, which is a default index. The index can be composite.
    • The index can't contain a function-based column.
    • All columns of the index are allowed.
    • All columns of the index aren't nullable.
    • There's no column in the index with the DATE type that contains negative dates as values.

PostgreSQL limitations include:

  • Tables without primary keys must have a REPLICA IDENTITY.
  • Datastream doesn't support columns of the Geometric data types or Range data types.
  • Datastream doesn't support replicating views.
Which data is included in every event that's generated by Datastream? Each generated event (for inserts, updates, and deletes) includes the entire row of data from the source, with the data type and value of each column. Each event also includes event metadata that can be used to establish ordering and ensure exactly-once delivery.
Does Datastream guarantee ordering? Although Datastream doesn't guarantee ordering, it provides additional metadata for each event. This metadata can be used to ensure eventual consistency in the destination. Depending on the source, rate and frequency of changes, and other parameters, eventual consistency can generally be achieved within a 1-hour window.
Does Datastream guarantee exactly-once delivery? Datastream is at-least-once delivery. You can eliminate duplicate data by using additional metadata that Datastream writes to each event.
How does Datastream represent data types from the source for use in downstream processing?

Datastream makes downstream processing of data across data sources easy and straightforward by normalizing data types across all sources. Datastream takes the original source data type (for example, a MySQL or PostgreSQL NUMERIC type or an Oracle NUMBER type), and normalizes it into a Datastream unified type.

Unified types represent a loss-less superset of all possible source types, and normalization means data from different sources can be processed easily and queried downstream in a source-agnostic way.

Should a downstream system need to know the source data type, it can do so through an API call to Datastream's schema registry. This registry stores up-to-date, versioned schemas for every data source. The schema registry API also allows for speedy downstream schema drift resolution because source database schemas change.

How does Datastream handle structure (schema) changes in the source? Datastream fetches the schema from the source periodically. Schemas are stored in Datastream's internal schema registry, and each event references the schema at the time that it was generated. This allows Datastream to track schema changes and adjust based on these changes, ensuring that all data is streamed and loaded to the destination properly. However, in between schema fetches, some schema changes may go undetected, potentially causing data discrepancies.
The stream fails to start due to a validation failure, but I think it's an error or I want to try running the stream regardless. How can I get past this validation? You can start the stream using the API by adding the force=true flag and skipping the validation checks. Note that skipping validations doesn't guarantee that the stream will be able to run, and it may still fail if the database is missing the necessary configuration or permissions.

Security and connectivity

Question Answer
Is Datastream a secure service for sensitive data? Datastream supports multiple secure, private connectivity configurations to protect data in transit as it's streamed from a source to a destination. After the data is replicated, it's encrypted, by default, and leverages BigQuery's or Cloud Storage's security controls. Any data that's buffered by Datastream is encrypted at rest.
Which connectivity options are available for connecting your sources to Datastream?

There are three types of connectivity methods that you can configure:

  • IP allowlist: This method provides you with public connectivity by allowlisting Datastream's regional IP addresses on the source. When you create your stream, Datastream displays these IP addresses in the UI.
  • Forward SSH tunnel: This method provides you with secure connectivity over public networks by leveraging a customer-configured SSH bastion in addition to allowlisting regional IP addresses.
  • Private connectivity over VPC peering: Use this method to connect to your Google Cloud-hosted databases over Google Cloud's internal network, or leverage an existing VPN or Interconnect by establishing VPC peering between Datastream's private network and your organization's Google Cloud VPC.
How can I limit Datastream's processing of sensitive data?

Datastream allows you to specify which specific data elements (schemas, tables, and columns) of your source you want to stream into a destination, and which elements you want to exclude from being streamed.

Database logs may contain change data from elements that were excluded in your stream definition. Because you can't filter these elements at the source, Datastream will read, but ignore, any data associated with the elements.

How does Datastream keep database credentials secure? All user metadata (including the username and password used to access the data sources) is encrypted in transit and at rest. Access to this metadata is tracked and audited.

Process Datastream files in Cloud Storage

Question Answer
How are files created in Cloud Storage? Datastream creates a folder for each table. In each folder, Datastream rotates the file (or creates a new file) whenever it reaches the user-defined threshold of size or time. Datastream also rotates the file whenever a schema change is detected. The file name will be composed of a unique schema key (based on a hash of the schema), followed by the timestamp of the first event in the file. For security purposes, these filenames aren't meant to be read or understood by people.
If data in Cloud Storage isn't ordered, then how can the events be reordered before loading them into the destination? Each event contains several metadata fields that uniquely identify the row in the Oracle redo log. Specifically:
  • rs_id, which is actually composed of three values (for example, 0x0073c9.000a4e4c.01d0). The rs_id uniquely identifies the record within the redo log.
  • ssn, which is used when the record in the redo log is too long. Because of this, Oracle LogMiner splits the record into multiple rows.
To sort the events, you need to sort by each part of the rs_id, and then by the ssn.
If multiple files are created for the same timestamp, then in which order should they be processed? Because ordering within and across files isn't guaranteed, the best way to determine the order to process the files is to get all events from all files for the specific timestamp, and then apply ordering using the method mentioned above.
How are primary key updates handled? Is there before and after information in the event? When a row's primary key changes, Datastream will generate two events for the change: an UPDATE-DELETE and an UPDATE-INSERT.

Monitor Datastream

Question Answer
How do I know when all of my historical data has been copied to the destination? Datastream provides information about its current status in the log files. A log entry is created to indicate when a table is done backfilling.
Latency occasionally goes up and then gradually decreases over time. Is this to be expected? Datastream scales up automatically when event throughput generated by the source increases. However, to protect both the source database as well as Datastream, there are limits to how many concurrent connections and processes Datastream can open at any time. Temporary spikes in latency are expected when there's a significant increase in event throughput, and are expected to decrease gradually as the backlog of events is processed.

Pricing

Question Answer
How is Datastream priced?

Datastream is priced based on the volume (GB) of data that's streamed from your source into a destination.

To learn more about pricing information for Datastream, see Pricing.

How do you calculate the size of the data? Billing is calculated based on the raw (uncompressed) data size that's streamed from the source to the destination. Datastream charges only for data that's streamed into the destination.
If you use Datastream with Dataflow or Cloud Data Fusion, then what do you pay for? Each service is priced and charged for separately.

Additional information

Question Answer
What if I have additional questions or issues with using Datastream? Google's support team can provide support if you're having issues with using Datastream. In addition, the troubleshooting guide handles common issues that you may face while using Datastream.