Jump to Content
Developers & Practitioners

What is Datastream?

July 30, 2021
Priyanka Vergadia

Staff Developer Advocate, Google Cloud

With data volumes constantly growing, many companies find it difficult to use data effectively and gain insights from it. Often these organizations are burdened with cumbersome and difficult-to-maintain data architectures. 

One way that companies are addressing this challenge is with change streaming: the movement of data changes as they happen from a source (typically a database) to a destination. Powered by change data capture (CDC), change streaming has become a critical data architecture building block. We recently announced Datastream, a serverless change data capture and replication service. Datastream’s key capabilities include:

  • Replicate and synchronize data across your organization with minimal latency. You can synchronize data across heterogeneous databases and applications reliably, with low latency, and with minimal impact to the performance of your source. Unlock the power of data streams for analytics, database replication, cloud migration, and event-driven architectures across hybrid environments.
  • Scale up or down with a serverless architecture seamlessly. Get up and running fast with a serverless and easy-to-use service that scales seamlessly as your data volumes shift. Focus on deriving up-to-date insights from your data and responding to high-priority issues, instead of managing infrastructure, performance tuning, or resource provisioning.
  • Integrate with the Google Cloud data integration suite. Connect data across your organization with Google Cloud data integration  products. Datastream leverages Dataflow templates to load data into BigQuery, Cloud Spanner, and Cloud SQL; it also powers Cloud Data Fusion's CDC Replicator connectors for easier-than-ever data pipelining.
Click to enlarge

Datastream use cases

Datastream captures change streams from Oracle, MySQL, and other sources for destinations such as Cloud Storage, Pub/Sub, BigQuery, Spanner and more. Some use cases of Datastream:  

  • For analytics use Datastream with a pre-built Dataflow template to create up-to-date replicated tables in BigQuery in a fully-managed way.
  • For database replication use Datastream with pre-built Dataflow templates to continuously replicate and synchronize database data into Cloud SQL for PostgreSQL or Spanner to power low-downtime database migration or hybrid-cloud configuration.
  • For building event-driven architectures use Datastream to ingest changes from multiple sources into object stores like Google Cloud Storage or, in the future, messaging services such as Pub/Sub or Kafka 
  • Streamline real-time data pipeline that continually streams data from legacy relational data stores (like Oracle and MySQL) using Datastream into MongoDB

How do you set up Datastream?

  1. Create a source connection profile.
  2. Create a destination connection profile.
  3. Create a stream using the source and destination connection profiles, and define the objects to pull from the source.
  4. Validate and start the stream.

Once started, a stream continuously streams data from the source to the destination. You can pause and then resume the stream. 

Connectivity options

To use Datastream to create a stream from the source database to the destination, you must establish connectivity to the source database. Datastream supports the IP allowlist, forward SSH tunnel, and VPC peering network connectivity methods.

Private connectivity configurations enable Datastream to communicate with a data source over a private network (internally within Google Cloud, or with external sources connected over VPN or Interconnect). This communication happens through a Virtual Private Cloud (VPC) peering connection. 

For a more in-depth look into Datastream check out the documentation.

For more #GCPSketchnote, follow the GitHub repo. For similar cloud content follow me on Twitter @pvergadia and keep an eye out on thecloudgirl.dev.

Posted in