Apache Kafka for GCP users: connectors for Pub/Sub, Dataflow and BigQuery
Eric Anderson
Engineering Lead, gRPC
Kir Titievsky
Product Manager, Google Cloud
Google Cloud Platform allows you to choose between using managed services and deploying services on top of processors, storage and network. All this choice means new design decisions for engineers: should I run my own services or let Google do the heavy lifting? For big data teams, that’s a particularly thorny question, as the Apache Hadoop ecosystem has so many great open-source solutions, while GCP offers incredible services such as Google BigQuery and Google Cloud Dataflow. For developers of event-stream processing pipelines and distributed systems in particular, one key decision is between Apache Kafka, a high-throughput, distributed, publish-subscribe messaging system, and Google Cloud Pub/Sub, our managed offering.
We’re here today to say: Why choose? Our goal is to make GCP the best platform to run your own services while offering incredible managed alternatives — and make it easy to run them in parallel or migrate between them. This is why, for example, customers building streaming and batch processing systems can choose Google Cloud Dataproc for the familiar open source Apache Spark and Hadoop tools or Cloud Dataflow, based on Apache Beam (incubating), for Google’s fully-managed unified batch and stream processing stack.
Today, we are happy to talk about several connector projects that make GCP services interoperate with Apache Kafka.
Google Cloud Pub/Sub sink and source connectors using Kafka Connect
This code is actively maintained by the Google Cloud Pub/Sub team. This general solution is useful if you're building a system that combines GCP services such as Stackdriver Logging, Cloud Dataflow, or Cloud Functions with an existing Kafka deployment.KafkaIO for Apache Beam and Dataflow
This native connector developed by the Beam team at Google provides the full processing power of Dataflow as well as simple connectivity to other services, including Pub/Sub and BigQuery, in order to consume data directly from existing Kafka clusters.Next year, Dataflow’s integrated autoscaling, update and drain features will become available for KafkaIO (available today with Cloud Pub/Sub). This means you'll be able to write a single pipeline capable of both backfill and live data processing that scales automatically and updates in-place with no data loss or downtime, with no changes to your Kafka deployment.
Kafka to BigQuery connector
Recently developed by our friends at WePay, the BigQuery connector is an easy path to BigQuery from existing Kafka clusters.In the short term, our plan is to work together with our partner Confluent, which offers a real-time data streaming platform —to offer broader support for Kafka on GCP, to develop benchmarks and to offer guidance on choosing between OSS and managed services. Connectors play a critical role in developing the Apache Kafka ecosystem, said Neha Narkhede, Confluent CTO.
As one of the creators of Apache Kafka and a co-founder of Confluent, it’s always exciting to see a growing open source ecosystem. With these new connectors, customers who are using Google Cloud Platform can experience the power of the Apache Kafka technology and Confluent platform, and we’re happy to collaborate with Google to make this experience easier for our joint customers.
In the meantime, we welcome you to try the Kafka connectors, and to be part of our effort to make GCP the best place to run your own streaming services or our managed offerings. Please send us your feedback, bugs and pull requests on GitHub.