Introducing real-time data integration for BigQuery with Cloud Data Fusion
Businesses today have a growing demand for real-time data integration, analysis, and action. More often than not, the valuable data driving these actions—transactional and operational data—is stored either on-prem or in public clouds in traditional relational databases that aren’t suitable for continuous analytics. While old-school migrations or batch ETL loads can achieve the objective of loading data to a data warehouse, these high-latency approaches don’t cut it when it comes to making the accurate decisions based upon the most up-to-date insights.
Cloud Data Fusion is a fully managed, cloud-native data integration and ingestion service that helps developers, data engineers, and business analysts alike to efficiently build and manage ETL/ELT jobs. Today we’re announcing the public preview launch of the replication application in Data Fusion that enables low-latency, real-time data replication from transactional and operational databases such as SQL Server and MySQL directly into BigQuery.
Let’s take a closer look at the benefits of replication in Data Fusion:
Remove technical bottlenecks so even citizen developers can set up replication easily
Cloud Data Fusion features a simple, wizard-driven interface that enables even citizen developers such as ETL developers and data analysts to easily set up data replication. This standard, easy-to-use interface eliminates the need for development of complicated, bespoke tools for each type of operational database, thereby enabling self-service, continuous replication of data to BigQuery.
Feasibility assessment and actionable recommendations
It also includes an assessment tool to help identify schema incompatibilities, connectivity issues, and missing features prior to starting replication, then provides corrective actions. This helps users get ahead of potential issues during replication, thereby leading to faster development and iteration.
Easily access the latest operational data in real time for analysis within BigQuery
Change data capture, or CDC, provides a representation of data that has changed in a stream, allowing computations and processing to focus specifically on only the most recently changed records, thereby minimizing egress toll on sensitive production systems. With this release, Data Fusion now offers log-based replication directly into BigQuery. It integrates with Debezium as the change provider for making CDC logs from various databases available in a common format. It currently includes support for Microsoft SQL Server (which relies upon SQL Server CDC) and MySQL (which relies upon MySQL Binary Log). With support for CDC streams, Google Cloud users have access to the latest data in BigQuery for analysis and action.
Enterprise scalability to support high-volume transactional databases
Initial loads of data to BigQuery are supported with zero-downtime snapshot replication to make the data warehouse ready for consuming changes continuously. Once the initial snapshot is done, high-throughput, continuous replication of changes then starts in real-time.
End-to-end operational visibility
Data Fusion also provides operational dashboards to monitor throughput, latency, and errors in replication jobs. These dashboards provide real-time insights into replication performance. This lets users proactively identify potential bottlenecks, and monitor data delivery SLAs.
Take advantage of key Google Cloud features and integrations
Replication is available in all Google Cloud regions supported today for Data Fusion. This launch includes support for Customer-Managed Encryption Keys (CMEK) and VPC-SC. Cloud Data Fusion’s integration within the Google Cloud platform ensures that the highest levels of enterprise security and privacy are observed while making the latest data available in your data warehouse for analytics.
Ready to try out replication? Create a new instance of Data Fusion and add the replication app. Don’t forget to bring the getting started guide along for the ride.