This page provides a conceptual overview of Dataflow's integration with Pub/Sub. The overview describes some optimizations that are available in the Dataflow runner's implementation of the Pub/Sub I/O connector. Pub/Sub is a scalable, durable event ingestion and delivery system. Dataflow compliments Pub/Sub's scalable, at-least-once delivery model with message deduplication and exactly-once, in-order processing if you use windows and buffering. To use Dataflow, write your pipeline using the Apache Beam SDK and then execute the pipeline code on the Dataflow service.
Before you begin, learn about the basic concepts of Apache Beam and streaming pipelines. Read the following resources for more information:
- Intro to Apache Beam concepts such as PCollections, triggers, windows, and watermarks
- After Lambda: Exactly-once processing in Dataflow Part 1 and Part 3: Sources and Sinks
- The world beyond batch streaming 101 and 102
- Apache Beam programming guide
Building streaming pipelines with Pub/Sub
To get the benefits of Dataflow's integration with Pub/Sub, you can use any of the following ways to build your streaming pipelines:
Use Google-provided Dataflow templates and the corresponding template source code in Java.
Google provides a set of Dataflow templates that offer a UI-based way to start Pub/Sub stream processing pipelines. If you use Java, you can also use the source code of these templates as a starting point to create a custom pipeline.
The following streaming templates export Pub/Sub data to different destinations:
- Pub/Sub Subscription to BigQuery
- Pub/Sub to Pub/Sub relay
- Pub/Sub to Cloud Storage Avro
- Pub/Sub to Cloud Storage Text
- Storage Text to Pub/Sub (Stream)
The following batch template imports a stream of data into a Pub/Sub topic:
Pub/Sub and Dataflow integration features
Apache Beam provides a reference I/O source implementation (
which is used by non-Dataflow runners (such as the Apache Spark
runner, Apache Flink runner, and the direct runner).
However, the Dataflow runner uses a different, private
PubsubIO. This implementation takes advantage of
Google Cloud-internal APIs and services to offer three main advantages: low
latency watermarks, high watermark accuracy (and therefore data completeness),
and efficient deduplication.
Low latency watermarks
Dataflow has access to Pub/Sub's private API which provides the age of the oldest unacknowledged message in a subscription with lower latency than is available in Stackdriver. For comparison, the Pub/Sub backlog metrics that are available in Stackdriver are typically delayed by two to three minutes, but the metrics are only delayed by approximately ten seconds for Dataflow. This makes it possible for Dataflow to advance pipeline watermarks and emit windowed computation results sooner.
High watermark accuracy
Another important problem solved natively by the Dataflow
integration with Pub/Sub is the need for a robust watermark for
windows defined in event time. Event time is a timestamp
specified by the publisher application as an attribute of a
Pub/Sub message, rather than the
field set on a message by Pub/Sub service. Because
Pub/Sub computes backlog statistics only with
respect to the service-assigned (or processing time) timestamps, estimating the
event time watermark requires a separate mechanism.
To solve this problem, if the user elects to use custom event timestamps, the Dataflow service creates a second tracking subscription. This tracking subscription is used to inspect the event times of the messages in the backlog of the base subscription, and estimate the event time backlog. See the StackOverflow page that covers how Dataflow computes Pub/Sub watermarks for more information.
Message deduplication is required for exactly-once message processing.
Dataflow deduplicates messages with respect to the
Pub/Sub message ID. As a result, all processing logic
can assume that the messages are already unique with respect to the
Pub/Sub message ID. The efficient, incremental aggregation
mechanism to accomplish this is abstracted in the