This page provides a conceptual overview of Dataflow's integration with Pub/Sub. The overview describes some optimizations that are available in the Dataflow runner's implementation of the Pub/Sub I/O connector. Pub/Sub is a scalable, durable event ingestion and delivery system. Dataflow compliments Pub/Sub's scalable, at-least-once delivery model with message deduplication, exactly-once processing, and generation of a data watermark from timestamped events. To use Dataflow, write your pipeline using the Apache Beam SDK and then execute the pipeline code on the Dataflow service.
Before you begin, learn about the basic concepts of Apache Beam and streaming pipelines. Read the following resources for more information:
- Intro to Apache Beam concepts such as PCollections, triggers, windows, and watermarks
- After Lambda: Exactly-once processing in Dataflow Part 1 and Part 3: Sources and Sinks
- Streaming: The world beyond batch: 101 and 102
- Apache Beam programming guide
Building streaming pipelines with Pub/Sub
To get the benefits of Dataflow's integration with Pub/Sub, you can build your streaming pipelines in any of the following ways:
Use existing streaming pipeline example code from the Apache Beam GitHub repo, such as streaming word extraction (Java), streaming wordcount (Python), and streaming_wordcap (Go).
Write a new pipeline using the Apache Beam API reference (Java, Python, or Go).
Use Google-provided Dataflow templates and the corresponding template source code in Java.
Google provides a set of Dataflow templates that offer a UI-based way to start Pub/Sub stream processing pipelines. If you use Java, you can also use the source code of these templates as a starting point to create a custom pipeline.
The following streaming templates export Pub/Sub data to different destinations:
- Pub/Sub Subscription to BigQuery
- Pub/Sub to Pub/Sub relay
- Pub/Sub to Cloud Storage Avro
- Pub/Sub to Cloud Storage Text
- Cloud Storage Text to Pub/Sub (Stream)
The following batch template imports a stream of data into a Pub/Sub topic:
Follow the Pub/Sub quickstart for stream processing with Dataflow to run a simple pipeline.
Pub/Sub and Dataflow integration features
Apache Beam provides a reference I/O source implementation (PubsubIO
) for
Pub/Sub (Java,
Python,
and Go).
This I/O source implementation is used by non-Dataflow runners, such as the Apache Spark
runner, Apache Flink runner, and the direct runner.
The Dataflow runner uses a different, private
implementation of PubsubIO
(for
Java,
Python, and
Go).
This implementation takes advantage of
Google Cloud-internal APIs and services to offer three main advantages: low
latency watermarks, high watermark accuracy (and therefore data completeness),
and efficient deduplication (exactly-once message processing).
The Apache Beam I/O connectors let you interact with Dataflow
by using controlled sources and sinks.
The Dataflow runner's implementation of PubsubIO
automatically
acknowledges messages after they are successfully processed by the first
fused stage and side-effects of that processing are written to persistent
storage. See the fusion documentation
for more details. Messages are therefore only acknowledged when
Dataflow can guarantee that there is no data loss if some
component crashes or a connection is lost.
Low latency watermarks
Dataflow has access to Pub/Sub's private API that provides the age of the oldest unacknowledged message in a subscription, with lower latency than is available in Cloud Monitoring. For comparison, the Pub/Sub backlog metrics that are available in Cloud Monitoring are typically delayed by two to three minutes, but the metrics are delayed only by approximately ten seconds for Dataflow. This makes it possible for Dataflow to advance pipeline watermarks and emit windowed computation results sooner.
High watermark accuracy
Another important problem solved natively by the Dataflow
integration with Pub/Sub is the need for a robust watermark for
windows defined in event time. The event time is a timestamp
specified by the publisher application as an attribute of a
Pub/Sub message, rather than the
publish_time
field set on a message by the Pub/Sub service itself. Because
Pub/Sub computes backlog statistics only with
respect to the service-assigned (or processing time) timestamps, estimating the
event time watermark requires a separate mechanism.
To solve this problem, if the user elects to use custom event timestamps, the Dataflow service creates a second tracking subscription. This tracking subscription is used to inspect the event times of the messages in the backlog of the base subscription, and estimate the event time backlog. See the StackOverflow page that covers how Dataflow computes Pub/Sub watermarks for more information.
Efficient deduplication
Message deduplication is required for exactly-once message processing, and
you can use the Apache Beam programming model
to achieve exactly-once processing of Pub/Sub message streams.
Dataflow deduplicates messages with respect to the
Pub/Sub message ID. As a result, all processing logic
can assume that the messages are already unique with respect to the
Pub/Sub message ID. The efficient, incremental aggregation
mechanism to accomplish this is abstracted in the PubsubIO
API.
If PubsubIO
is configured to use the Pub/Sub message attribute for
deduplication instead of the message ID, Dataflow deduplicates
messages published to Pub/Sub within 10 minutes of each other.
The standard sorting APIs of the Dataflow service allow you to use
ordered processing with Dataflow. Alternatively, to use ordering with
Pub/Sub, see Message Ordering.
Unsupported Pub/Sub features
Dead-letter topics and exponential backoff delay retry policies
Pub/Sub dead-letter topics and exponential backoff delay retry policies are not fully supported by Dataflow. Instead, implement these patterns explicitly within the pipeline. Two examples of dead-letter patterns are provided in the retail application and the Pub/Sub to BigQuery template.
There are two reasons why dead-letter topics and exponential backoff delay retry policies do not work with Dataflow.
First, Dataflow does not NACK messages (that is, send a negative acknowledgement) to Pub/Sub when pipeline code fails. Instead, Dataflow retries message processing indefinitely, while continually extending the acknowledgment deadline for the message. However, the Dataflow backend might NACK messages for various internal reasons, so it is possible for messages to be delivered to the dead-letter topic even when there have been no failures in the pipeline code.
Second, Dataflow might acknowledge messages before the pipeline fully processes the data. Specifically, Dataflow acknowledges messages after they are successfully processed by the first fused stage and side-effects of that processing have been written to persistent storage. If the pipeline has multiple fused stages and failures occur at any point after the first stage, the messages are already acknowledged.
Pub/Sub exactly-once delivery
Because Dataflow has its own exactly-once processing, using Pub/Sub exactly-once delivery with Dataflow is not recommended. Enabling Pub/Sub exactly-once delivery reduces pipeline performance, because it limits the messages available for parallel processing.
Upcoming migration from Synchronous Pull to Streaming Pull (Streaming Engine only)
Streaming Engine currently uses Synchronous Pull to consume data from Pub/Sub. In the future, Streaming Engine will use Streaming Pull for improved performance.
During the migration, a job might use Synchronous Pull for one period of time and Streaming Pull for another period of time. This transition will affect the Pub/Sub metrics displayed in the Dataflow UI and reported to Cloud Monitoring. After a job switches to Streaming Pull, some existing metrics will not be reported. For more information, see Using the Dataflow monitoring interface.
Synchronous Pull and Streaming Pull consume separate quotas. The Dataflow team proactively increases quotas for projects that already consume large amounts of data using Synchronous Pull.
Streaming jobs that don't use Streaming Engine won't be migrated to Streaming Pull and won't be affected by this change.
Please contact your account team if you have any questions about the migration.
What's next
- Stream Processing with Pub/Sub and Dataflow: Qwik Start (self-paced lab)
- Stream from Pub/Sub to BigQuery