What is change data capture?

Change data capture (CDC) is a process used to track and capture changes made to data in a source system, such as a transactional database, data warehouse, or SaaS application. By recording changes in real time, CDC enables you to keep systems synchronized, power analytics, train AI models, and support seamless data migrations.

Knowing what changes occur in a database is essential for maintaining data integrity, optimizing workflows, and unlocking real-time insights. A typical example is in retail—when a customer places an order, CDC captures that sale from the transactional database at the moment it happens. A CDC service like Google Cloud’s Datastream can replicate the changes from an on-premises PostgreSQL database into a BigQuery data warehouse, powering an analytics dashboard in near real time and eliminating the wait for slow, nightly batch jobs.

Introduction to Datastream for BigQuery

Key takeaways

Change data capture (CDC) is a process used to identify and capture data modifications—such as insertions, updates, and deletions—from a source system, often in near real time. CDC is renowned for:

  • Replacing slow, nightly batch jobs: Allows you to capture changes to data as they happen, instead of extracting entire datasets later
  • Powerful use cases: CDC supports seamless database migrations, real-time analytics, and AI training

What is the difference between ETL and CDC?

The ETL process, which stands for ‘extract, transform, load,’ involves extracting raw data from its source system, transforming it into the necessary structure, and then loading it into the destination system, typically a data warehouse. ETL helps to ensure that data is organized, clean, and ready for downstream use in analytics or applications.

Traditionally, ETL runs in large, scheduled batches, often overnight. This means the data in the target system is only as fresh as the most recent batch run.

The key difference is in how CDC extracts data. Instead of extracting the dataset in batches, CDC captures the changes as they happen, making it possible to always query the most current data instead of relying on periodic batch processing. This can help you respond faster to changes and gain more accurate insights.

What is the difference between SCD and CDC?

Change data capture is primarily a process focused on the efficient extraction of all incremental changes from a source database, often in near real time, to minimize the load on the source system.

In contrast, slowly changing dimensions (SCD) is a data modeling technique applied in the target data warehouse or analytical system, focusing on how to store and manage the historical state of dimension data (like customer records or product attributes) over time.

While CDC is the tool for efficient data transport of the changes, these changes are typically applied using a specific SCD strategy to maintain a complete and accurate history within the destination system.

Common CDC sources and destinations

Change data capture can be a powerful bridge for connecting the diverse systems that make up a modern data stack or data cloud. Rather than being limited to specific databases, CDC tools are designed to work with a wide variety of sources and destinations.

Transactional databases are the most common sources for CDC, capturing every insert, update, and delete from the systems that run daily business operations.

Often the primary destination for CDC data, these systems consolidate information from multiple sources for business intelligence, analytics, and AI.

  • Examples: BigQuery, Snowflake, Amazon Redshift, Databricks

Use CDC to pull change data from critical business platforms (often via APIs or webhooks) to synchronize customer, sales, or support data across the organization.

  • Examples: Salesforce, SAP, Stripe, HubSpot

Streaming and messaging platforms often act as a real-time messaging bus, receiving change events from a source and distributing them to multiple downstream consumers simultaneously.

  • Examples: Apache Kafka, Pub/Sub, Amazon Kinesis

A common destination for raw CDC event logs, as object storage provides a scalable and cost-effective foundation for building a data lake.

Key use cases for change data capture

CDC can unlock a wide range of critical use cases, from ensuring data continuity to powering real-time analytics and AI.

  • Replication: Secondary databases or backups remain consistent with the primary system by capturing real-time changes. This is critical for disaster recovery and high-availability architectures.
  • Migration: Database migrations are simplified by tracking ongoing changes and ensuring the destination system reflects the source accurately, even during live migrations.
  • Real-time analytics: By feeding live data into analytics systems, CDC helps you respond to trends and changes as they occur, offering up-to-the-minute insights.
  • Data for AI training: CDC supplies the most current data, enabling AI and machine learning models to be trained with accurate, relevant information, improving predictions and decision-making.
  • Building a data cloud: CDC is an essential service for creating a unified data cloud, streaming real-time changes from diverse, siloed sources (like transactional databases and SaaS applications) into a central analytics platform and back. This helps break down data barriers and creates a single source of truth.

FAQs about change data capture

By capturing and propagating changes as they occur, CDC ensures that multiple systems (like a primary database plus a backup, or a primary database plus an analytics platform) are always synchronized. This eliminates data discrepancies and maintains data integrity across all platforms.

Change data capture works by identifying, tracking, and extracting the changes made to data in a source database using varying approaches including the following:

  • Log-based CDC, which reads the database's internal transaction log (the record of all changes) to stream these specific changes to a target system in near real time, minimizing resource load
  • Trigger-based CDC involves creating database triggers on tables to record changes into separate shadow tables whenever a modification occurs, though this can add overhead to the source database's transaction processing
  • Timestamp-based CDC is the simplest but least performant method, relying on a timestamp or version column in the source table to periodically query for rows where the change time is newer than the last extraction time

How CDC works with databases

Databases are the most common and reliable source for change data capture because of a core component they all share: the transaction log. This log is an immutable, ordered record of every change, making it the perfect foundation for capturing data modifications accurately and efficiently without impacting database performance. Here’s a deeper look at the key concepts.

Understanding the database transaction log

Before any data is written to the actual database tables, the change (an insert, update, or delete) is first recorded in a transaction log. Modern CDC tools tap into this log to capture changes non-intrusively. This is a core part of database replication technology, and while the concept is universal, the name for this log varies across popular databases. Some notable examples:

  • PostgreSQL: write-ahead log (WAL)
  • Oracle: redo log
  • SQL Server: transaction og
  • MySQL: binary log (Binlog)

Combining initial loads with CDC

For large, existing databases, reading the entire history from the transaction log can be impractical due to network performance and other constraints. A common real-world pattern is to first perform a bulk load and then use CDC to synchronize. The process involves:

  1. Taking a consistent snapshot or backup of the database at a specific point in time.
  2. Loading that large snapshot into the destination system.
  3. Starting the CDC stream from the precise log position where the snapshot was taken.

This “snapshot and catch-up” method ensures the destination is fully seeded with historical data and then kept perfectly in sync with all subsequent live changes. This is how managed services like Google Cloud's Datastream and Database Migration Service perform seamless and reliable backfills.

Positioning and recovery with log sequence numbers

To manage this process reliably, CDC relies on unique identifiers within the transaction log, typically called log sequence numbers (LSNs). These numbers act as precise bookmarks for the stream of changes. They are critical for:

  • Positioning: Knowing exactly where in the log to start or stop a stream
  • Recovery: If a CDC process fails, it can restart from the last successfully processed LSN, guaranteeing that every change is captured exactly once without data loss or duplication

These mechanisms—the transaction log, the initial snapshot, and the log sequence number—work in concert to make database CDC both highly efficient and reliable.

Benefits of change data capture

CDC can deliver several advantages that make it indispensable in modern data workflows:

Data consistency across systems

Synchronizes changes as they happen, reducing discrepancies and ensuring more accurate data across platforms.

Efficient data movement

Ensures only changes are captured instead of entire datasets, helping to significantly lessen system strain and network usage.

Seamless integration

Connects your systems using a library of pre-built connectors. Many platforms provide no-code interfaces to configure pipelines and automatically manage complex backend tasks like schema drift detection and in-flight data transformations, ensuring your data flows without interruption.

Operational resilience

Supports disaster recovery and replication efforts by automatically keeping systems and backups aligned without manual intervention.

Real-time insights

Gives access to the latest data nearly instantly, enabling faster and more informed decision-making. This agility can provide a critical edge in fast-paced environments like financial trading, e-commerce, and logistics, where the value of data decays in seconds.

Competing technologies and alternatives

Several technologies overlap with or complement CDC, offering varied approaches to managing data changes and enabling real-time processing. Here’s how CDC compares to other prominent approaches:

Hybrid transactional and analytical processing (HTAP)

HTAP combines transactional (OLTP) and analytical (OLAP) workloads into a single system, providing immediate analytics on the data. While this makes HTAP well suited for unified analytics in a self-contained environment, it can struggle to synchronize with the diverse and distributed data sources that exist across an organization. CDC, on the other hand, excels at maintaining data consistency across multiple systems and enabling distributed workflows.

Federated queries

Federated queries (and a similar approach called ‘external tables’) allow a data warehouse or query engine to read data directly from an external source system, such as a transactional database, at the moment a query is run. This approach avoids the need to copy or move data, providing live access to the information where it resides.

This method can be excellent for ad hoc analysis or occasional queries, as it offers immediate access without the setup of a replication pipeline. However, for regularly used or performance-sensitive queries, it can be inefficient, as it puts a direct load on the source operational system with every query. Unlike CDC, which proactively pushes a stream of changes to a destination for optimized analytics, federation pulls data on demand, which may be unsuitable for high-frequency workloads.

Data streaming platforms

Kafka is an open source platform designed for high-throughput data streaming and processing. While it efficiently manages large-scale data flows, Kafka complements CDC rather than replacing it. CDC focuses on capturing precise database changes in real time, while Kafka streams broader event data, such as a website click, an IoT sensor reading, or an item being added to a shopping cart. Together, these technologies can create a powerful combination for real-time workflows.

Application-level integration

While CDC uses a no-code approach to replicate data, integration technologies that connect at the application layer connect systems and SaaS platforms using a low-code approach. If you want the ability to transform data as it moves between platforms, then application-level integration products, such as Google Cloud’s Application Integration, can be a better tool for the job.

Solve your business challenges with Google Cloud

New customers get $300 in free credits to spend on Google Cloud.

Get started with CDC on Google Cloud

Datastream is Google Cloud’s reliable, easy-to-use, and serverless change data capture (CDC) and replication service. It minimizes latency while enabling the synchronization of data across diverse databases and applications. Here are some resources to help you get started:

Take the next step

Start building on Google Cloud with $300 in free credits and 20+ always free products.

Google Cloud