Streaming with Pub/Sub

This page provides a conceptual overview of Dataflow's integration with Pub/Sub. The overview describes some optimizations that are available in the Dataflow runner's implementation of the Pub/Sub I/O connector. Pub/Sub is a scalable, durable event ingestion and delivery system. Dataflow compliments Pub/Sub's scalable, at-least-once delivery model with message deduplication, exactly-once processing, and generation of a data watermark from timestamped events. To use Dataflow, write your pipeline using the Apache Beam SDK and then execute the pipeline code on the Dataflow service.

Before you begin, learn about the basic concepts of Apache Beam and streaming pipelines. Read the following resources for more information:

Building streaming pipelines with Pub/Sub

To get the benefits of Dataflow's integration with Pub/Sub, you can build your streaming pipelines in any of the following ways:

Pub/Sub and Dataflow integration features

Apache Beam provides a reference I/O source implementation (PubsubIO) for Pub/Sub (Java and Python), which is used by non-Dataflow runners (such as the Apache Spark runner, Apache Flink runner, and the direct runner).

However, the Dataflow runner uses a different, private implementation of PubsubIO. This implementation takes advantage of Google Cloud-internal APIs and services to offer three main advantages: low latency watermarks, high watermark accuracy (and therefore data completeness), and efficient deduplication.

The Dataflow runner's implementation of PubsubIO automatically acknowledges messages once they have been successfully processed by the first fused stage (and side-effects of that processing have been written to persistent storage). See the fusion documentation for more details. Messages are therefore only acknowledged when Dataflow can guarantee that there will be no data loss if some component crashed, or a connection were lost.

Low latency watermarks

Dataflow has access to Pub/Sub's private API that provides the age of the oldest unacknowledged message in a subscription, with lower latency than is available in Cloud Monitoring. For comparison, the Pub/Sub backlog metrics that are available in Cloud Monitoring are typically delayed by two to three minutes, but the metrics are delayed only by approximately ten seconds for Dataflow. This makes it possible for Dataflow to advance pipeline watermarks and emit windowed computation results sooner.

High watermark accuracy

Another important problem solved natively by the Dataflow integration with Pub/Sub is the need for a robust watermark for windows defined in event time. The event time is a timestamp specified by the publisher application as an attribute of a Pub/Sub message, rather than the publish_time field set on a message by the Pub/Sub service itself. Because Pub/Sub computes backlog statistics only with respect to the service-assigned (or processing time) timestamps, estimating the event time watermark requires a separate mechanism.

To solve this problem, if the user elects to use custom event timestamps, the Dataflow service creates a second tracking subscription. This tracking subscription is used to inspect the event times of the messages in the backlog of the base subscription, and estimate the event time backlog. See the StackOverflow page that covers how Dataflow computes Pub/Sub watermarks for more information.

Efficient deduplication

Message deduplication is required for exactly-once message processing. Dataflow deduplicates messages with respect to the Pub/Sub message ID. As a result, all processing logic can assume that the messages are already unique with respect to the Pub/Sub message ID. The efficient, incremental aggregation mechanism to accomplish this is abstracted in the PubsubIO API.

If PubsubIO is configured to use the Pub/Sub message attribute for deduplication instead of the message ID, Dataflow deduplicates messages published to Pub/Sub within 10 minutes of each other.

Unsupported Pub/Sub features

Dead-letter topics and exponential backoff delay retry policies

Pub/Sub dead-letter topics and exponential backoff delay retry policies are not fully supported by Dataflow. Instead, implement these patterns explicitly within the pipeline. Two examples of dead-letter patterns are provided in the retail application and the Pub/Sub to BigQuery template.

There are two reasons why dead-letter topics and exponential backoff delay retry policies do not work with Dataflow.

First, Dataflow does not NACK messages (that is, send a negative acknowledgement) to Pub/Sub when pipeline code fails. Instead, Dataflow retries message processing indefinitely, while continually extending the acknowledgment deadline for the message. However, the Dataflow backend may NACK messages for various internal reasons, so it is possible for messages to be delivered to the dead-letter topic even when there have been no failures in the pipeline code.

Second, Dataflow may acknowledge messages before the pipeline fully processes the data. Specifically, Dataflow acknowledges messages after they have been successfully processed by the first fused stage (and side-effects of that processing have been written to persistent storage). If the pipeline has multiple fused stages, and there are failures at any point after the first stage, the messages will already have been acknowledged.