Use Pub/Sub Seek with Dataflow

Pub/Sub Seek lets users replay and reprocess previously acknowledged messages or to acknowledge messages in bulk. However, we do not recommend direct access to Pub/Sub Seek from within a running Dataflow pipeline. Direct access invalidates Dataflow's watermark logic and does not work well with exactly-once processing. In addition, direct access conflicts with the state of a pipeline that incorporates processed data.

We recommend using Pub/Sub Seek with the following workflow:

  1. Make a snapshot of the subscription.
  2. Drain the subscription messages in Dataflow.
  3. Resubmit the pipeline.

Creating a snapshot

You seek to and redo processing from a subscription snapshot. To create this snapshot using the gcloud command-line tool, run the following commands:

alias pubsub='gcloud pubsub'
pubsub snapshots create my-snapshot --subscription=seek-demo-sub

To verify that you have created the snapshot, run the command:

pubsub snapshots list

Draining the subscription

To drain the subscription and resubmit the pipeline, follow these steps:

  1. Navigate to the Dataflow console and click your streaming pipeline.
  2. In the Summary pane, click Stop Job.
  3. Select Drain to allow for processing of the in-flight messages and wait until the job is terminated.
  4. Seek your subscription to the snapshot with the gcloud command-line tool:
    pubsub subscriptions seek seek-demo-sub --snapshot=my-snapshot
  5. Resubmit your Dataflow pipeline.