Using Cloud Pub/Sub Seek with Cloud Dataflow

Cloud Pub/Sub Seek allows users to replay and reprocess previously acknowledged messages or to acknowledge messages in bulk. However, we do not recommend direct access to Cloud Pub/Sub Seek from within a running Cloud Dataflow pipeline. Direct access invalidates Cloud Dataflow's watermark logic and does not work well with exactly-once processing. In addition, direct access conflicts with the state of a pipeline that incorporates processed data.

We recommend using Cloud Pub/Sub Seek with the following workflow:

  1. Make a snapshot of the subscription.
  2. Drain the subscription messages in Cloud Dataflow.
  3. Restart the pipeline.

Creating a snapshot

You seek to and redo processing from a subscription snapshot. To create this snapshot using the gcloud command-line tool, run the following commands:

alias pubsub='gcloud pubsub'
pubsub snapshots create my-snapshot --subscription=seek-demo-sub

To verify that you have created the snapshot, run the command:

pubsub snapshots list

Draining the subscription

To drain the subscription and restart the pipeline, follow these steps:

  1. Navigate to the Cloud Dataflow console and click on your streaming pipeline.
  2. In the Summary pane, click on Stop Job.
  3. Select Drain to allow for processing of the in-flight messages and wait until the job is terminated.
  4. Seek your subscription to the snapshot with the gcloud command-line tool:
    pubsub subscriptions seek seek-demo-sub --snapshot=my-snapshot
  5. Restart your Cloud Dataflow pipeline.
