Change data capture (CDC) is a process used to track and capture changes made to data in a source system, such as a transactional database, data warehouse, or SaaS application. By recording changes in real time, CDC enables you to keep systems synchronized, power analytics, train AI models, and support seamless data migrations.
Knowing what changes occur in a database is essential for maintaining data integrity, optimizing workflows, and unlocking real-time insights. A typical example is in retail—when a customer places an order, CDC captures that sale from the transactional database at the moment it happens. A CDC service like Google Cloud’s Datastream can replicate the changes from an on-premises PostgreSQL database into a BigQuery data warehouse, powering an analytics dashboard in near real time and eliminating the wait for slow, nightly batch jobs.
Change data capture (CDC) is a process used to identify and capture data modifications—such as insertions, updates, and deletions—from a source system, often in near real time. CDC is renowned for:
The ETL process, which stands for ‘extract, transform, load,’ involves extracting raw data from its source system, transforming it into the necessary structure, and then loading it into the destination system, typically a data warehouse. ETL helps to ensure that data is organized, clean, and ready for downstream use in analytics or applications.
Traditionally, ETL runs in large, scheduled batches, often overnight. This means the data in the target system is only as fresh as the most recent batch run.
The key difference is in how CDC extracts data. Instead of extracting the dataset in batches, CDC captures the changes as they happen, making it possible to always query the most current data instead of relying on periodic batch processing. This can help you respond faster to changes and gain more accurate insights.
Change data capture is primarily a process focused on the efficient extraction of all incremental changes from a source database, often in near real time, to minimize the load on the source system.
In contrast, slowly changing dimensions (SCD) is a data modeling technique applied in the target data warehouse or analytical system, focusing on how to store and manage the historical state of dimension data (like customer records or product attributes) over time.
While CDC is the tool for efficient data transport of the changes, these changes are typically applied using a specific SCD strategy to maintain a complete and accurate history within the destination system.
Change data capture can be a powerful bridge for connecting the diverse systems that make up a modern data stack or data cloud. Rather than being limited to specific databases, CDC tools are designed to work with a wide variety of sources and destinations.
Transactional databases are the most common sources for CDC, capturing every insert, update, and delete from the systems that run daily business operations.
Often the primary destination for CDC data, these systems consolidate information from multiple sources for business intelligence, analytics, and AI.
Use CDC to pull change data from critical business platforms (often via APIs or webhooks) to synchronize customer, sales, or support data across the organization.
Streaming and messaging platforms often act as a real-time messaging bus, receiving change events from a source and distributing them to multiple downstream consumers simultaneously.
A common destination for raw CDC event logs, as object storage provides a scalable and cost-effective foundation for building a data lake.
CDC can unlock a wide range of critical use cases, from ensuring data continuity to powering real-time analytics and AI.
By capturing and propagating changes as they occur, CDC ensures that multiple systems (like a primary database plus a backup, or a primary database plus an analytics platform) are always synchronized. This eliminates data discrepancies and maintains data integrity across all platforms.
Change data capture works by identifying, tracking, and extracting the changes made to data in a source database using varying approaches including the following:
Databases are the most common and reliable source for change data capture because of a core component they all share: the transaction log. This log is an immutable, ordered record of every change, making it the perfect foundation for capturing data modifications accurately and efficiently without impacting database performance. Here’s a deeper look at the key concepts.
Before any data is written to the actual database tables, the change (an insert, update, or delete) is first recorded in a transaction log. Modern CDC tools tap into this log to capture changes non-intrusively. This is a core part of database replication technology, and while the concept is universal, the name for this log varies across popular databases. Some notable examples:
For large, existing databases, reading the entire history from the transaction log can be impractical due to network performance and other constraints. A common real-world pattern is to first perform a bulk load and then use CDC to synchronize. The process involves:
This “snapshot and catch-up” method ensures the destination is fully seeded with historical data and then kept perfectly in sync with all subsequent live changes. This is how managed services like Google Cloud's Datastream and Database Migration Service perform seamless and reliable backfills.
To manage this process reliably, CDC relies on unique identifiers within the transaction log, typically called log sequence numbers (LSNs). These numbers act as precise bookmarks for the stream of changes. They are critical for:
These mechanisms—the transaction log, the initial snapshot, and the log sequence number—work in concert to make database CDC both highly efficient and reliable.
CDC can deliver several advantages that make it indispensable in modern data workflows:
Data consistency across systems
Synchronizes changes as they happen, reducing discrepancies and ensuring more accurate data across platforms.
Efficient data movement
Ensures only changes are captured instead of entire datasets, helping to significantly lessen system strain and network usage.
Seamless integration
Connects your systems using a library of pre-built connectors. Many platforms provide no-code interfaces to configure pipelines and automatically manage complex backend tasks like schema drift detection and in-flight data transformations, ensuring your data flows without interruption.
Operational resilience
Supports disaster recovery and replication efforts by automatically keeping systems and backups aligned without manual intervention.
Real-time insights
Gives access to the latest data nearly instantly, enabling faster and more informed decision-making. This agility can provide a critical edge in fast-paced environments like financial trading, e-commerce, and logistics, where the value of data decays in seconds.
Several technologies overlap with or complement CDC, offering varied approaches to managing data changes and enabling real-time processing. Here’s how CDC compares to other prominent approaches:
HTAP combines transactional (OLTP) and analytical (OLAP) workloads into a single system, providing immediate analytics on the data. While this makes HTAP well suited for unified analytics in a self-contained environment, it can struggle to synchronize with the diverse and distributed data sources that exist across an organization. CDC, on the other hand, excels at maintaining data consistency across multiple systems and enabling distributed workflows.
Federated queries (and a similar approach called ‘external tables’) allow a data warehouse or query engine to read data directly from an external source system, such as a transactional database, at the moment a query is run. This approach avoids the need to copy or move data, providing live access to the information where it resides.
This method can be excellent for ad hoc analysis or occasional queries, as it offers immediate access without the setup of a replication pipeline. However, for regularly used or performance-sensitive queries, it can be inefficient, as it puts a direct load on the source operational system with every query. Unlike CDC, which proactively pushes a stream of changes to a destination for optimized analytics, federation pulls data on demand, which may be unsuitable for high-frequency workloads.
Kafka is an open source platform designed for high-throughput data streaming and processing. While it efficiently manages large-scale data flows, Kafka complements CDC rather than replacing it. CDC focuses on capturing precise database changes in real time, while Kafka streams broader event data, such as a website click, an IoT sensor reading, or an item being added to a shopping cart. Together, these technologies can create a powerful combination for real-time workflows.
While CDC uses a no-code approach to replicate data, integration technologies that connect at the application layer connect systems and SaaS platforms using a low-code approach. If you want the ability to transform data as it moves between platforms, then application-level integration products, such as Google Cloud’s Application Integration, can be a better tool for the job.
Datastream is Google Cloud’s reliable, easy-to-use, and serverless change data capture (CDC) and replication service. It minimizes latency while enabling the synchronization of data across diverse databases and applications. Here are some resources to help you get started:
Start building on Google Cloud with $300 in free credits and 20+ always free products.