Jump to Content
Data Analytics

Dataflow Under the Hood: Comparing Dataflow with other tools

August 24, 2020
Sam McVeety

Principal Engineer

Ryan Lippert

Product Manager

Editor's note: This is the third blog in a three-part series examining the internal Google history that led to Dataflow, how Dataflow works as a Google Cloud service, and here, how it compares and contrasts with other products in the marketplace.

To place Google Cloud’s stream and batch processing tool Dataflow in the larger ecosystem, we'll discuss how it compares to other data processing systems. Each system that we talk about has a unique set of strengths and applications that it has been optimized for. We’re biased, of course, but we think that we've balanced these needs particularly well in Dataflow.

Apache Kafka is a very popular system for message delivery and subscription, and provides a number of extensions that increase its versatility and power. Here, we'll talk specifically about the core Kafka experience. Because it is a message delivery system, Kafka does not have direct support for state storage for aggregates or timers. These can be layered on top through abstractions like Kafka Streams. Kafka does support transactional interactions between two topics in order to provide exactly once communication between two systems that support these transactional semantics. It does not natively support watermark semantics (though can support them through Kafka Streams) or autoscaling, and users must re-shard their application in order to scale the system up or down.

Apache Spark is a data processing engine that was (and still is) developed with many of the same goals as Google Flume and Dataflow—providing higher-level abstractions that hide underlying infrastructure from users. Spark has a rich ecosystem, including a number of tools for ML workloads. Spark has native exactly once support, as well as support for event time processing. Spark does have some limitations as far as its ability to handle late data, because its event processing capabilities (and thus garbage collection) are based on static thresholds rather than watermarks. State management in Spark is similar to the original MillWheel concept of providing a coarse-grained persistence mechanism. Users need to manually scale their Spark clusters up and down. One major limitation of structured streaming like this is that it is currently unable to handle multi-stage aggregations within a single pipeline.

Apache Flink is a data processing engine that incorporates many of the concepts from MillWheel streaming. It has native support for exactly-once processing and event time, and provides coarse-grained state that is persisted through periodic checkpointing. The effect of this on the cost of state persistence is ambiguous, since most Flink deployments still write to a local RocksDB instance frequently, and periodically checkpoint this to an external file system. Depending on the frequency of checkpointing, this can increase time to recovery in the case that computation has to be repeated. Flink also requires manual scaling by its users; some vendors are working towards autoscaling Flink, but that would still require learning the ins and outs of a new vendor’s platform.

Finally, a brief word on Apache Beam, Dataflow’s SDK. Given Google Cloud’s broad open source commitment (Cloud Composer, Cloud Dataproc, and Cloud Data Fusion are all managed OSS offerings), Beam is often confused for an execution engine, with the assumption that Dataflow is a managed offering of Beam. That’s not the case—Dataflow jobs are authored in Beam, with Dataflow acting as the execution engine. The benefits of Apache Beam come from open-source development and portability. Jobs can be written to Beam in a variety of languages, and those jobs can be run on Dataflow, Apache Flink, Apache Spark, and other execution engines. That means you’re never locked into Google Cloud.

This concludes our three-part Under the Hood walk-through covering Dataflow. Check out part 1 and part 2. We're excited about the current state of Dataflow, and the state of the overall data processing industry. We look forward to delivering a steady "stream" of innovations to our customers in the months and years ahead.

The complete Dataflow Under the Hood series

Posted in