Jump to Content
Data Analytics

Reliable streaming pipeline development with Cloud Pub/Sub’s Replay

February 1, 2019
Alex Mordkovich

Software Engineer

By its very definition, streaming data moves quickly, sometimes leaving you wishing you had a ‘rewind’ button. Cloud Pub/Sub’s new Replay feature gives you that option. Cloud Pub/Sub is a simple, reliable, and scalable messaging service that streams messages from publisher to subscriber clients. With Replay, you can now replay old events (or messages), giving your subscribers another chance to process them. In this post, we’ll explain how this new feature can help you eliminate bugs in either your publisher’s or subscriber’s codebase.

Let’s suppose we are building a pipeline that processes telemetry events from a fleet of vehicles. Each vehicle generates events at a rate of 60 per second, sending them off to front ends running in GCP, or perhaps streaming them into to Cloud IoT Core. In either case, a Cloud Pub/Sub publisher client ultimately publishes the events to a Pub/Sub topic as messages. On the other side, subscriber processes consume these events in parallel from a single subscription attached to the Pub/Sub topic.

After a few days of running this pipeline, we notice that some subscribers occasionally and inexplicably start to crash. When this happens, the crashing subscriber fails to acknowledge the message it is processing. Looking at the Cloud Pub/Sub service metrics in Stackdriver, we see that the oldest unacknowledged message age is growing:

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_pub_sub_metrics.max-1300x1300.png

This means that some messages in the subscription’s backlog are not being acknowledged by the subscribers. Is it possible that the subscriber clients are under-provisioned and cannot keep up with the flow of events? When we look at the size of the subscription’s backlog to test this hypothesis, we see that it remains low and is not growing:

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_backlog.max-1300x1300.png

This chart indicates that the subscribers are generally keeping up, and thus we conclude that a small number of messages are languishing in the subscription’s backlog. These messages are being delivered to the subscribers repeatedly, but are never acknowledged.

We consider the following scenarios for addressing this issue:

Scenario 1:

We spot a bug that causes the publisher to occasionally publish a message that crashes the subscriber. For instance, the message might not conform to a particular schema or encoding that the subscriber expects. We will refer to such a message as “non-conformant.” We fix the bug and roll out a new version of the publisher. However, the oldest unacknowledged message age is still growing, because some non-conformant messages (published prior to our fix) remain in the subscription’s backlog, and the subscribers are not able to acknowledge these messages. If we are happy to simply discard the non-conformant messages, we can use the Seek API to “fast forward” and bulk-acknowledge all messages published before a certain point in time, as demonstrated by the following Console command:
Loading...

The oldest unacknowledged message age drops once the Seek operation completes:

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_seek_operation.max-1300x1300.png

And the backlog size drops to zero as well:

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_subscription_backlog.max-1300x1300.png

Scenario 2:

We fail to find the bug in the publisher, and instead fix the subscriber code to correctly handle the non-conformant messages. When we deploy the new subscriber, it turns out that the new code is even more broken! It (still) fails to process the non-conformant messages, and instead of crashing, it just acknowledges these messages. This lets us make progress, but results in data loss. When we realize this, we roll back the subscribers to the previous version of the code.

Without the Replay feature, once a message is acknowledged, the Pub/Sub system discards the message, leaving us with no way to recover it. With Replay, however, we are able to use the Seek API to “rewind” and unacknowledge all messages that were published in the past couple of days, including the messages that were wrongly acknowledged:

Loading...

The ability to use Seek in this manner requires the retain_acked_messages property to have been enabled on the subscription in advance.

After the Seek operation completes, we see that the size of our subscription backlog jumps up significantly:

https://storage.googleapis.com/gweb-cloudblog-publish/images/5_completed_seek_operations.max-1300x1300.png

Our subscribers will then need to work through several days’ worth of old messages, most of which they’ve already processed before. At a publishing rate of 60 messages per second, tens of millions of messages may need processing, taking hours. Can we do better than this? With snapshots, we can.

Scenario 3:

As in Scenario 2, we can’t spot the bug in the publisher, so we decide to instead fix the subscriber code to correctly handle the non-conformant messages. This time, just before we roll out the new subscriber code, we create a snapshot of the subscription by using the CreateSnapshot API:

Loading...

The snapshot captures the exact state of the subscription’s backlog at the time the snapshot is created. We then proceed to roll out the new subscriber code. As in Scenario 2, the new subscriber code turns out to contain even more significant bugs, and incorrectly acknowledges the non-conformant messages. Upon discovering the new bugs, we roll back the subscribers to the previous version of the code. Because we had created the snapshot, we can now use the Seek API to restore the subscription to the state captured by that snapshot:

Loading...

When the Seek operation completes, we see that the size of our subscription backlog jumps, but not as significantly as in Scenario 2:

https://storage.googleapis.com/gweb-cloudblog-publish/images/6_seek_operations.max-1300x1300.png

The only messages that the subscribers will now need to re-process are the messages in the subscription’s backlog at the time the snapshot was created. This backlog will consist primarily of the non-conformant messages that were causing trouble before, plus new messages published since the snapshot was created.

You can repeat the above process, until you’re able to finally deploy a version of the subscriber code that doesn’t crash and thus handles the non-conformant messages correctly. Cloud Pub/Sub Replay gives you the development flexibility to try solutions safely, knowing you can always undo and retry without the risk of message loss. If you’re interested in trying out your own Pub/Sub Replay scenarios, please check out our documentation and quickstart guide.

Posted in