Overview
In this page, you find answers to frequently asked questions (FAQ) about using Datastream. This FAQ is associated with:
- General questions about Datastream and change data capture (CDC)
- General source behavior and limitations
- MySQL source behavior and limitations
- Oracle source behavior and limitations
- PostgreSQL source behavior and limitations (including AlloyDB for PostgreSQL)
- SQL Server source behavior and limitations
- BigQuery destination behavior
- Cloud Storage destination behavior
- Common use cases for using Datastream
- How Datastream integrates with Google Cloud data services
- Security and connectivity
- Monitoring Datastream
- Pricing
- Additional information about Datastream
General questions about Datastream and change data capture (CDC)
Question | Answer |
---|---|
What's Datastream? | Datastream is a serverless and easy-to-use change data capture (CDC) and replication service. Datastream enables seamless replication of data from relational database sources such as AlloyDB for PostgreSQL, PostgreSQL, MySQL, SQL Server and Oracle, directly into BigQuery. Datastream offers the scale, speed, security, and simplicity that modern businesses need. With a serverless, autoscaling architecture, Datastream lets you easily set up an ELT (Extract, Load, and Transform) pipeline for low-latency data replication enabling near real-time insights. Datastream also has the ability to synchronize data streams across heterogeneous databases and applications reliably and with minimal latency. The service offers you streamlined integration with Dataflow templates to replicate databases into Cloud SQL or Spanner for database synchronization, or leverage the event stream directly from Cloud Storage to realize event-driven architectures. |
What are the methods by which Datastream streams data? | Datastream streams data from a source to a destination by one of two methods:
|
What's CDC? |
CDC is an approach to data integration that lets you integrate and analyze data faster, using fewer system resources. It's a method for pulling only the latest changes (updates, inserts, or deletes) from a data source, often by reading the log of changes that the source keeps for its own internal transactional integrity. CDC is a highly-efficient mechanism for limiting the impact on the source when loading new data into operational data stores and data warehouses, and it eliminates the need for bulk-load updating and inconvenient batch windows by enabling incremental loading or near real-time streaming of data changes into a data destination. CDC can be used in many use cases that derive value from constant access to data changes as they happen, such as analytics, synchronizing data across geographically distributed systems, and event-driven architectures. |
What's backfill? | In addition to ongoing changes, Datastream also uses backfill to pull all existing data from a source, and then stream the data into a destination. As a result, the destination is "backfilled" with all historical data from the source. There are two types of backfill:
|
Are there any limitations you need to consider when performing a backfill? |
For information about backfill constraints, see the known limitations pages for the respective source types:
|
What's the recommended order for CDC and backfill operations? | In most cases, there's no special consideration here. When backfill is enabled on a stream, Datastream performs both backfill and CDC at the same time. In some edge cases, for example when replicating very large tables that can't be backfilled using Datastream, you might need to load the data into BigQuery first before starting the stream. |
Can I monitor the status of CDC and backfill operations? | You can monitor the backfill status per object:
The CDC status is effectively the stream status. If the CDC operation fails, then the entire stream fails. Additional object-level metrics are available in Cloud Monitoring. For more information, see the Cloud Monitoring documentation. |
What's a change stream versus replication? | A change stream is a sequence of events that Datastream emits to a destination for the purposes of downstream consumption. The result of a change stream being written to Cloud Storage is a set of files that contain change log events from a specific table over a period of time. The events represent inserts, updates, and deletes to that table, and the events can be consumed and processed downstream of Cloud Storage by services like Dataflow for use cases like event-driven architectures. Replication in the Datastream context means a 1:1, up-to-date representation of the source table in the destination. For example, a table in Oracle whose changes are replicated continuously to a BigQuery table, where the BigQuery table is kept up-to-date using the changes streamed from the Oracle table. Replication leverages change streams by consuming each change event and using it to update the destination. Datastream enables direct replication into BigQuery, and supports additional destinations like Cloud SQL and Spanner through integration with Dataflow, by leveraging templates that pull the change stream events from Cloud Storage and updating the destination tables accordingly. |
Does Datastream require an agent on the source? | You don't have to install an agent on the source. Datastream uses existing interfaces (such as Oracle LogMiner) to get the data out of the source. |
How many tables can I include in a single stream? | Datastream can handle up to 10,000 tables in a single stream. There's no technical reason to split a database into multiple streams, although there might be some business considerations to use different streams for different logical flows, for example when replicating large transactions from PostgreSQL sources. However, each stream adds load on the source database. Such load is negligible for CDC, but can be significant for backfill. |
What about the performance impact of Datastream's usage on a production database? |
CDC is a highly efficient mechanism for limiting the impact on the source when new data is loaded into destination data stores and data warehouses. CDC also eliminates the need for bulk-load updating and inconvenient batch windows by enabling incremental loading or near real-time streaming of data changes into a destination. In addition, Datastream minimizes the impact of an initial backfill by limiting the number of simultaneous backfill tasks, and offering you the control to decide which objects to backfill, and when to backfill them. |
Can you copy the log files from the source to Google Cloud? | No. Datastream queries the logs directly on the database server, and only changes to the specified tables are written to the destination. |
Can Datastream also transfer historical data? | Yes. By default, Datastream gets all historical data from the database tables of your source that you specify, in parallel to the CDC stream. Datastream provides you with the flexibility to obtain historical data from either some or all database tables in your source. In addition, when you configure your stream, you can choose to disable historical backfill. |
Does Datastream guarantee exactly-once delivery? | No. Datastream is at-least-once delivery. The additional metadata that Datastream writes to each event can be used to remove duplicate data. |
Can Datastream handle encrypted databases? | Yes. |
Can Datastream stream specific tables and columns from a source? | Yes. With Datastream, you can specify include and exclude lists for tables and schemas, to stream only the data that you want from a source to a destination. For included tables, you can exclude specific columns of the tables to further fine-tune exactly which data you want to be streamed into the destination. However, materialized views can't be replicated. |
How do you move a stream to another project or region? |
|
How does Datastream handle uncommitted transactions in the database log files? | When database log files contain uncommitted transactions, if any transactions are rolled back, then the database reflects this in the log files as "reverse" data manipulation language (DML) operations. For example, a rolled-back INSERT operation will have a corresponding DELETE operation. Datastream reads these operations from the log files. |
What's Datastream's regional availability? | To view a listing of the regions where Datastream is available, see IP allowlists and regions. |
General source behavior and limitations
Question | Answer |
---|---|
Which sources does Datastream support? | Datastream supports streaming data from Oracle, MySQL, PostgreSQL, AlloyDB for PostgreSQL and SQL Server sources, both cloud-hosted and self-managed. For information about source-specific versions, see the following pages:
|
Can Datastream handle encrypted databases? | Yes. |
How does Datastream extract data from the sources? | For MySQL, Datastream processes the MySQL binary log to extract change events. For Oracle, Datastream uses LogMiner and supplemental logging settings to extract data from Oracle's redo logs. For PostgreSQL and AlloyDB for PostgreSQL, Datastream relies on PostgreSQL's logical decoding feature. Logical decoding exposes all changes committed to the database and allows consuming and processing these changes. For SQL Server, Datastream tracks data manipulation language (DML) changes using transaction logs. |
Can you copy log files directly from a source to Google Cloud? | Datastream doesn't copy the entirety of the log files, but queries the log files directly from the database server, and only replicates changes from the specified tables to the destination. |
What are the limitations on the data that Datastream can process? |
There are general limitations, and limitations that apply to specific database sources. General limitations include:
For source-specific limitations, see the following pages:
|
Which data is included in every event that's generated by Datastream? | Each generated event (for inserts, updates, and deletes) includes the entire row of data from the source, with the data type and value of each column. Each event also includes event metadata that can be used to establish ordering and ensure exactly-once delivery. |
Does Datastream guarantee ordering? | Although Datastream doesn't guarantee ordering, it provides additional metadata for each event. This metadata can be used to ensure eventual consistency in the destination. Depending on the source, rate and frequency of changes, and other parameters, eventual consistency can generally be achieved within a 1-hour window. |
Does Datastream guarantee exactly-once delivery? | Datastream is at-least-once delivery. You can eliminate duplicate data by using additional metadata that Datastream writes to each event. |
How does Datastream represent data types from the source for use in downstream processing? | When writing to a schemaless destination, such as Cloud Storage, Datastream simplifies downstream processing of data across sources by normalizing data types across all sources. Datastream takes the original source data type (for example, a MySQL or PostgreSQL Unified types represent a loss-less superset of all possible source types, and normalization means data from different sources can be processed easily and queried downstream in a source-agnostic way. |
How does Datastream handle structure (schema) changes in the source? | Datastream fetches the schema from the source periodically. Schemas are stored in Datastream's internal schema registry, and each event references the schema at the time that it was generated. This allows Datastream to track schema changes and adjust based on these changes, ensuring that all data is streamed and loaded to the destination properly. However, in between schema fetches, some schema changes may go undetected, potentially causing data discrepancies. |
I deleted (dropped) a column in my source database, but it still appears in my destination. Why? | Datastream doesn't delete columns in the destination. You must delete columns manually. This behaviour is intentional, because in some cases, you might prefer to keep the deleted column in the destination. For example, Datastream interprets renaming a column in the source as deleting a column and adding another column. You don't want to lose the original column in the destination in this case. |
The stream fails to start due to a validation failure, but I think it's an error or I want to try running the stream regardless. How can I get past this validation? | You can start the stream using the API by adding the force=true flag and skipping the validation checks. Note that skipping validations doesn't guarantee that the stream will be able to run, and it may still fail if the database is missing the necessary configuration or permissions.
|
Can I restart a failed stream? | You can fix a stream with the You can also recover a stream with the |
Does Datastream lock database tables? | No. Datastream doesn't need to lock the tables. |
The limitation pages for each source specify that streams can have up to 10,000 tables. Does this mean that Datastream can't run CDC operations in parallel for more than 10,000 tables at same time? | No. The limit mentioned is per stream. You can have multiple streams with the total number of tables exceeding 10,000 tables. |
MySQL source behavior and limitations
Question | Answer |
---|---|
How does Datastream get data out of MySQL? | Datastream uses MySQL's binary log to extract the change events from MySQL |
Does Datastream support Cloud SQL for MySQL read replica instances? | Yes, Datastream supports read replica instances for Cloud SQL for MySQL versions 5.7 and 8.0. You can enable binary logging on these instances. To do so, execute the following gcloud command: gcloud sql instances patch INSTANCE_NAME -- enable-bin-log For more information, see Binary logging on read replicas. |
What permissions at minimum does Datastream need to replicate data from a source MySQL database? | You must grant the following permissions to your Datastream user to replicate data from a MySQL database:
|
Does Datastream support binary log compression? | No, Datastream does not support compressing transaction events with binary log transaction compression. Disable binary log transaction compression to use the change data capture functionality for MySQL in Datastream. |
Your organization doesn't permit granting the SELECT permission on all its tables and databases. Can you still use Datastream? |
You can grant the GRANT SELECT ON DATABASE_NAME.TABLE_NAME TO USER_NAME@'%'; GRANT SELECT ON DATABASE_NAME.* TO USER_NAME@'%'; Replace the following:
If you want to restrict the permissions to a particular hostname, replace the |
Oracle source behavior and limitations
Question | Answer |
---|---|
How does Datastream get data out of Oracle? | Datastream uses Oracle LogMiner to extract the data from Oracle's redo logs. |
Does Datastream require a GoldenGate license from Oracle? | No. Datastream doesn't require a GoldenGate license because it uses Oracle LogMiner to read the data from the database's redo logs. |
What happens when Oracle LogMiner isn't supported anymore? | Oracle LogMiner is still supported in all generally available Oracle releases, and Datastream will continue to support LogMiner in all future releases. Oracle has discontinued support for the LogMiner CONTINUOUS_MINE option, however this option isn't used by Datastream, hence this deprecation has no impact on Datastream. |
Does Datastream support encryption of data in-transit from Oracle databases? | Datastream supports encryption of data in-transit based on Oracle Net Services. Datastream runs in |
Does Datastream support Secure Sockets Layer (SSL) and Transport Layer Security (TLS) encryption? | Yes, Datastream supports the TCPS protocol for SSL/TLS-encrypted connections to Oracle sources. |
Does Datastream support Oracle multi-tenant architecture, specifically Container Databases (CDBs) and Pluggable Databases (PDBs)? | Yes, Datastream supports Oracle multi-tenant architecture (CDB/PDB), however, you can only replicate a single pluggable database in a stream. |
Your organization doesn't permit granting the GRANT SELECT ANY TABLE permission. Can you still use Datastream? |
If you can't grant the
To stream changes using Datastream, you also need read access to all tables included in the stream. To start the stream, bypass the validation using the API. For more information see Start a stream. |
Does Datastream support replication from Oracle read-replica instances? | Datastream supports replication from Oracle read-replica instances as long as the instance is configured according to the configuration guidelines. |
Does Datastream support Oracle SCAN RAC databases? | Yes, it does. However, you can't establish direct connectivity to databases using the Single Client Access Name (SCAN) feature in Oracle Real Application Clusters (RAC) environments using Datastream. To overcome this limitation, connect directly to one of the nodes, or use the Oracle Connection Manager. You can also create a private connectivity configuration using reverse proxy solutions like HAProxy. |
PostgreSQL source behavior and limitations
Question | Answer |
---|---|
How does Datastream get data out of PostgreSQL? | Datastream uses PostgreSQL's logical decoding feature to extract the change events from PostgreSQL |
Does Datastream support replication from AlloyDB for PostgreSQL or AlloyDB Omni sources? | Yes. Datastream supports both of these sources. For information about configuring your source AlloyDB for PostgreSQL database, see Configure an AlloyDB for PostgreSQL database. For information about configuring your source AlloyDB Omni database, see Configure a self-managed PostgreSQL database. |
Does Datastream support replication from a PostgreSQL read replica instance? | Yes, Datastream supports replication from read replica instances for PostgreSQL versions 16 and later. |
Does Datastream support Row-Level Security (RLS)? | Datastream doesn't support replication of tables with RLS enabled. However, you can bypass this limitation by using the BYPASSRLS clause:
ALTER USER USER_NAME BYPASSRLS; Replace USER_NAME with the name of the user that Datastream uses to connect to the database, and for whom you want to bypass RLS policies. |
Does Datastream support concurrent CDC operations when replicating from a PostgreSQL source? | No. PostgreSQL replication relies on the concept of replication slots, and replication slots don't support concurrent connections. There can only be one CDC task reading from the replication slot at a time. For example, if you delete a large number of records from the source database, the replication slot might get overloaded with delete events. Any subsequent changes to the source are delayed until the delete events already in the replication slot queue are processed. For information about the suggested approach when replicating large transaction data, see Diagnose issues. |
Does Datastream support Secure Sockets Layer (SSL) and Transport Layer Security (TLS) encryption? | Datastream supports SSL/TLS-encrypted connections that don't require client certificates. |
SQL Server source behavior and limitations
Question | Answer |
---|---|
Which SQL Server editions does Datastream support? | Datastream supports SQL Server database versions that support change data capture. For a full list of supported SQL Server versions and editions, see Versions. |
How does the Datastream CDC replication work for SQL Server sources? | The Datastream data ingestion process utilizes transaction logs or change tables, depending on the source configuration. For more information, see Source SQL Server database. |
What permissions at minimum does Datastream need to replicate data from a source SQL Server database? | You must grant the following roles and permissions to your Datastream user to replicate data from a SQL Server database: For the change tables CDC method:
Additionally, for the transaction logs CDC method:
For detailed information about configuring your source, see the respective configuration pages for your database type. |
Why do you need configuration such as changing polling intervals and setting up a truncation safeguard when using transaction logs? | When you replicate data from a SQL Server source using transaction logs, if a log gets truncated before Datastream reads it, data loss occurs. Changing polling intervals and setting up truncation safeguard is optional, but provides an additional layer of protection to ensure that Datastream can read the data even in scenarios where downtime or connectivity issues occur. For more information, see Configure a source SQL Server database. |
What CDC method to select when replicating from a SQL Server source? | You can select one of the two available CDC methods, based on your source SQL Server database configuration:
For more information, see Overview of SQL Server as a source. |
BigQuery destination behavior
Question | Answer |
---|---|
How do Datastream and BigQuery work together to replicate changes from the source database? | Datastream continuously reads the change stream from the source database, and streams the upsert and delete events to the BigQuery destination tables using the Storage Write API. BigQuery then applies the changes to the table based on the table staleness configuration. |
How are Data Manipulation Language (DML) operations applied in BigQuery? |
|
How do Datastream and BigQuery handle event ordering? | BigQuery uses the event metadata and an internal change sequence number (CSN) to apply the events to the table in the correct order. For more information about event metadata, see Events and streams. |
How are BigQuery costs calculated when used with Datastream? | BigQuery costs are calculated and charged separately from Datastream. To learn how to control your BigQuery costs, see BigQuery CDC pricing. |
What's the maximum row size that Datastream supports when streaming data into BigQuery? | The maximum row size that Datastream supports is 20 MB. |
When you stream to BigQuery, tables are prefixed with . Can you change this behavior so that the tables in BigQuery follow the same structure as they have on the source database? |
When you configure datasets for a BigQuery destination, you create a dataset for each schema or a single dataset for all schemas. When you create a dataset for each schema, each schema in the source is mapped to a dataset in BigQuery, and all the tables in the source schema are mapped to the tables in the dataset. When you select the single dataset for all schemas option, the table names are prefixed with , as per the table naming convention in BigQuery. |
Can CDC capture source schema changes, for example, when a column is inserted or deleted in the source? Are such changes automatically reflected in the already populated tables in the destination BigQuery datasets? | Datastream detects new columns automatically and adds them to the destination schema for the new rows, but not the ones already replicated to the destination. Deleted columns are ignored, and Datastream populates them with NULL values in the destination. You can then manually delete these columns in BigQuery. |
Cloud Storage destination behavior
Question | Answer |
---|---|
How are files created in Cloud Storage? | Datastream creates a folder for each table. In each folder, Datastream rotates the file (or creates a new file) whenever it reaches the user-defined threshold of size or time. Datastream also rotates the file whenever a schema change is detected. The filename will be composed of a unique schema key (based on a hash of the schema), followed by the timestamp of the first event in the file. For security purposes, these filenames aren't meant to be read or understood by people. |
If data in Cloud Storage isn't ordered, then how can the events be reordered before loading them into the destination? | Each event contains several metadata fields that uniquely identify the row in the log files, and let you sort the events. These fields include: For Oracle sources:
For MySQL sources:
For PostgreSQL sources:
For more information about metadata fields, see Source-specific metadata. |
If multiple files are created for the same timestamp, then in which order should they be processed? | Because ordering within and across files isn't guaranteed, the best way to determine the order to process the files for the source is to get all events from all files for the specific timestamp, and then apply ordering using the method mentioned earlier in this section. |
How are primary key updates handled? Is there before and after information in the event? | When a row's primary key changes, Datastream generates two events for the change: an UPDATE-DELETE and an UPDATE-INSERT . The UPDATE-DELETE event represents the data before the update and UPDATE-INSERT represents the data after the update. For more information about source-specific metadata, see Events and streams. |
What's the maximum row size that Datastream supports when streaming data into Cloud Storage? | The maximum row size that Datastream supports is 100 MB. |
Use cases
Question | Answer |
---|---|
What are some common use cases for using Datastream? | Datastream is a CDC and replication service, which means it's flexible across various use cases that can benefit from access to continuously streaming change data. The most common use cases for Datastream are:
|
Integrations
Question | Answer |
---|---|
How does Datastream integrate with Google Cloud data services? | Datastream complements and enhances the Google Cloud data suite by providing CDC data replication from sources to various Google Cloud services. By integrating seamlessly with these services, Datastream fits into the larger Google Cloud ecosystem. Datastream integrates with the following data services:
|
Does the Datastream-to-BigQuery template in Dataflow have any limitations for the number of data manipulation language (DML) operations? | No. The template uses the streaming inserts API to load data into BigQuery. As a result, there's no limit to the number of DML operations. However, some quota limitations are applicable. |
Security and connectivity
Question | Answer |
---|---|
Is Datastream a secure service for sensitive data? | Datastream supports multiple secure, private connectivity configurations to protect data in transit as it's streamed from a source to a destination. After the data is replicated, it's encrypted, by default, and leverages BigQuery's or Cloud Storage's security controls. Any data that's buffered by Datastream is encrypted at rest. |
Which connectivity options are available for connecting your sources to Datastream? | There are three types of connectivity methods that you can configure:
|
How can I limit Datastream's processing of sensitive data? |
Datastream lets specify which specific data elements (schemas, tables, and columns) of your source you want to stream into a destination, and which elements you want to exclude from being streamed. Database logs may contain change data from elements that were excluded in your stream definition. Because you can't filter these elements at the source, Datastream will read, but ignore, any data associated with the elements. |
How does Datastream keep database credentials secure? | All user metadata (including the username and password used to access the data sources) is encrypted in transit and at rest, and Google stores it in its proprietary storage systems. Access to this metadata is tracked and audited. |
Why does the Create a private connectivity configuration page state that you must grant the roles/compute.networkAdmin role to the Datastream service account to create a private connectivity configuration on a shared VPC? |
The networkAdmin role is required only to create the VPC peering. After the peering is established, you no longer need the role. If your organization doesn't permit granting the networkAdmin role to the Datastream service account, then create a custom role with the following specific permissions:
Select IP ranges dynamically
Create peered networks
Reserve IP addresses
|
Can I use Private Service Connect to create a private connectivity configuration? | No, Datastream doesn't support Private Service Connect. |
Monitor Datastream
Question | Answer |
---|---|
How do I know when all of my historical data has been copied to the destination? | Datastream provides information about its current status in the log files. A log entry is created to indicate when a table is done backfilling. |
Latency occasionally goes up and then gradually decreases over time. Is this to be expected? | Datastream scales up automatically when event throughput generated by the source increases. However, to protect both the source database as well as Datastream, there are limits to how many concurrent connections and processes Datastream can open at any time. Temporary spikes in latency are expected when there's a significant increase in event throughput, and are expected to decrease gradually as the backlog of events is processed. |
Pricing
Question | Answer |
---|---|
How is Datastream priced? | Datastream is priced based on the volume (GB) of data processed from your source into a destination. To learn more about pricing information for Datastream, see Pricing. |
How do you calculate the size of the data? | Billing is calculated based on the size of the data that's processed by Datastream. Datastream charges only for data that's streamed into the destination. |
If you use Datastream with BigQuery or Dataflow, then what do you pay for? | Each service is priced and charged for separately. |
Additional information
Question | Answer |
---|---|
What if I have additional questions or issues with using Datastream? | Google's support team can provide support if you're having issues with using Datastream. In addition, the troubleshooting guide handles common issues that you may face while using Datastream. |
What's next
- Find out more about Datastream.
- Learn about key concepts and features of Datastream.
- Learn about sources that Datastream supports.
- Learn about destinations that Datastream supports.