Unlock the power of change data capture and replication with new, serverless Datastream
Andi Gutmans
GM & VP of Engineering, Databases
Noaa Cohn
Product Manager, Databases
Today, we’re announcing Datastream, a serverless change data capture (CDC) and replication service, available now in preview. Datastream allows enterprises to synchronize data across heterogeneous databases, storage systems, and applications reliably and with minimal latency to support real-time analytics, database replication, and event-driven architectures. You can now easily and seamlessly deliver change streams from Oracle and MySQL databases into Google Cloud services such as BigQuery, Cloud SQL, Google Cloud Storage, and Cloud Spanner, saving time and resources and ensuring your data is accurate and up-to-date.
Datastream provides an integrated solution for CDC replication use cases with custom sources and destinations
*Check the documentation page for all supported sources and destinations.
"Global companies are demanding change data capture to provide replication capabilities across disparate data sources, and provide a real-time source of streaming data for real-time analytics and business operations," says Stewart Bond, Director, Data Integration and Intelligence Software Research at IDC.
However, companies are finding it difficult to realize these capabilities because commonly used data replication offerings are costly, cumbersome to set up, and require significant management and monitoring overhead to run flexibly or at scale. This leaves customers with a difficult-to-maintain and fragmented architecture.
Datastream's differentiated approach
Datastream is taking on these challenges with a differentiated approach. Its serverless architecture seamlessly and transparently scales up or down as data volumes shift in real time, freeing teams to focus on delivering up-to-date insights instead of managing infrastructure. It also provides the streamlined customer experience, ease of use, and security that our customers have come to expect from Google Cloud, with private connectivity options built into the guided setup experience.
Datastream integrates with purpose-built and extensible Dataflow templates to pull the change streams written to Cloud Storage, and create up-to-date replicated tables in BigQuery for analytics. It also leverages Dataflow templates to replicate and synchronize databases into Cloud SQL or Cloud Spanner for database migrations and hybrid cloud configurations.
Datastream also powers a Google-native Oracle connector in Cloud Data Fusion’s new replication feature for easy ETL/ELT pipelining. And by delivering change streams directly into Cloud Storage, customers can leverage Datastream to implement modern, event-driven architectures.
Customers tell us about the benefits they’ve found using Datastream. That includes Schnuck Markets, Inc., “Leveraging Datastream, we’ve been able to replicate data from our on-premises databases to BigQuery reliably and with little impact to our production workloads. This new method replaced our batch processing and allowed for insights to be leveraged from BigQuery quicker,” says Caleb Carr, principal technologist from Schnuck Markets. “Furthermore, implementing Datastream removed the need for our analytics group to reference on-premises databases to do their work and support our business users.”
Cogeco Communications, Inc. used Datastream to also realize the value of low-latency data access. “Datastream unlocked new customer interaction opportunities not previously possible by enabling low-latency access in BigQuery to our operational Oracle data.” says Jean-Lou Dupont, Senior Director, Enterprise Architecture, Cogeco Communications, Inc. “This streamlined integration process brings data from hundreds of disparate Oracle tables into a unified data hub. Datastream enabled us to achieve this with 10X time and effort efficiency.”
In addition, Major League Baseball (MLB) used Datastream’s replication capabilities to migrate their data from Oracle to Cloud SQL for PostgreSQL. “As we’re modernizing our applications, replicating the database data reliably out of Oracle and into Cloud SQL for PostgreSQL is a critical component of that process,” says Shawn O’Rourke, manager of technology at MLB. “Using Datastream's CDC capabilities, we were able to replicate our database securely and with low latency, resulting in minimal downtime to our application. We can now standardize on this process and repeat it for our next databases, regardless of scale.”
Our partner HCL has worked with many organizations looking to get more out of their data and plan for the future. “HCL customers across every industry are looking for ways to extract more value out of their vast amounts of data,” says Siva G. Subramanian, Global Head for Data & Analytics at HCL Google Business Unit. “CDC plays a big part in the solutions we offer to our customers using Google Cloud. Datastream enables us to deliver a secure and reliable solution to our customers that’s easy to set up and maintain. CDC is a key and integrated part of Google Cloud Data Solutions.”
“Google Cloud’s new CDC offering, Datastream, is a differentiator for Google among hyperscale cloud service providers, by supporting replication of data from Oracle and MySQL databases into the Google Cloud environment using a serverless cloud-native architecture, which removes the burden of infrastructure management for organizations, and provides elastic scalability to handle real-time workloads," says Stewart Bond, Director, Data Integration and Intelligence Software Research at IDC.
Datastream under the hood
Datastream reads CDC events (inserts, updates, and deletes) from source databases, and writes those events with minimal latency to a data destination. It leverages the fact that each database source has its own CDC log—for MySQL it’s the binlog, for Oracle it’s LogMiner—which it uses for its own internal replication and consistency purposes. Using Google-native, agentless, high-scale log reader technology, Datastream can quickly and efficiently generate change streams populated by events based on the database’s CDC log while minimizing performance impact on the source database.
Each generated event includes the entire row of data from the database, with the data type and value of each column. The original source data types, whether it’s, for example, an Oracle NUMBER type or a MySQL NUMERIC type, are normalized into Datastream unified types. The unified types represent a lossless superset of all possible source types, and the normalization means data from different sources can easily be processed and queried downstream in a source-agnostic way. Should a downstream system need to know the original source data type, it can perform a quick API call to Datastream’s Schema Registry, which stores up-to-date, versioned schemas for every data source. This also allows for in-flight downstream schema drift resolution as source database schemas change.
The generated streams of events, referred to as “change streams,” are then written as files, either in JSON or Avro format during preview or in other formats like Parquet in the future, into a Cloud Storage bucket organized by source table and event times. Files are rotated as table schemas change, so events in a single file always have the same schema, as well as on a configurable file size or rotation frequency setting. This way customers can find the best balance between the speed of data availability and the file size that makes the most sense for their business use case.
Through its integration with Dataflow, Datastream powers up-to-date, replicated tables for analytics over BigQuery, and for data replication and synchronization to Cloud SQL and Spanner. Datastream refers to these constantly updated tables as “materialized views.” They are kept up-to-date via Dataflow template-based upserts into Cloud SQL or Spanner, or through consolidations into BigQuery. The consolidations, performed as part of the Dataflow template, take the change streams that are written into a log table in BigQuery, and push those changes into a final table, which mirrors the table from the source.
Datastream offers a variety of secure connectivity methods to sources, so your data is always safe in transit. And with its serverless architecture, Datastream can scale up and down readers and processing power to seamlessly keep up with the speed of data and ensure minimal latency end to end. As data volumes decrease, Datastream automatically scales back down—the result is a “pay for what you use” pricing model, where you never have to pay for idle machines or worry about bottlenecks and delays during data peaks.
Get started with Datastream
Datastream, now available in preview, supports streaming change data from Oracle and MySQL sources, hosted either on-premises or in the cloud, into Cloud Storage. You can start streaming your data today for $2 per GB of data processed by Datastream.
To get started, head over to the Datastream area of your Google Cloud console, under Big Data, and click Create Stream. There you can:
Initiate stream creation, and see what actions you need to take to set up your source and destination for successful streaming.
Define your source and destination, whose connectivity information is saved as connection profiles you can re-use for other streams. Sources support multiple connectivity options, with both private and public connectivity options to suit your business needs.
Select the source data you’d like to stream, and which you’d like to exclude.
Test your stream to ensure it will be successful when you’re ready to go.
Start your stream and your database’s CDC data will start to flow to your Cloud Storage bucket! From there you can integrate with Dataflow templates to load data into BigQuery, Spanner, or Cloud SQL. Datastream’s preview is supported in us-central1, europe-west1, and asia-east1, with additional regions coming soon.
Datastream will become generally available later this year, and will soon expand its support to also include PostgreSQL and SQL Server as sources, as well as out-of-the-box integration with BigQuery for easy delivery of up-to-date replicated tables for analytics, and message queues like Pub/Sub for real-time change stream access.
For more resources to help get you started with change streaming, check out the Datastream documentation.