Data Analytics

Simple backup and replay of streaming events using Cloud Pub/Sub, Cloud Storage, and Cloud Dataflow

August 7, 2018

Sergei Sokolenko

Cloud Dataflow Product Manager

Google Cloud is announcing an easy way to backup and replay your streaming pipeline events directly from the Cloud Console via a new collection of simple import/export templates. If you are a developer interested in data stream processing, you’ll likely find this feature very handy.

For example, let’s assume you’ve built a streaming pipeline on Cloud Pub/Sub and Cloud Dataflow and deployed it into production. A few days later, after being notified by Cloud Error Reporting and inspecting logs, you realize that some of the events travelling through your Pub/Sub topics are malformed. You want to investigate this further, but don’t want to interrupt your main production pipeline. “Wouldn’t it be great if there were a simple way to siphon events into a Cloud Storage file that you can then inspect for errors?” you ask. Or perhaps there is nothing wrong with your events, but you just want to save a collection of them as a simple backup, and then replay these events in the future to debug a problem or run a reproducible test.

For several months now, Cloud Dataflow has offered a collection of simple import/export templates, inter-connecting data sources such as Pub/Sub, BigQuery, Datastore, Bigtable, files in text and Avro format, and many others. We are now bringing these import/export capabilities right to where the rest of your data lives: to Pub/Sub topics, GCS folders and files, etc.

Here’s how it works:

Saving streaming events into text files

Find the Pub/Sub topic that you want to backup to a text file, via the Pub/Sub console.

https://storage.googleapis.com/gweb-cloudblog-publish/images/streaming-events-1.max-800x800.png

After confirming that you want to run a Dataflow export job, you will see a simple page where the name of the Pub/Sub topic is already pre-selected and where you can enter the destination Cloud Storage folder. We selected gs://PROJECT_ID/teleport/output/ as our output location and gs://PROJECT_ID/teleport/tmp as our temporary location, and also set a prefix for output file names to “events”.

https://storage.googleapis.com/gweb-cloudblog-publish/images/streaming-events-2.max-900x900.png

Clicking “Run Job” will start a continuous export job (a streaming pipeline) that will capture all events that pass through the Pub/Sub topic and save them into text files at 5 minute intervals.

https://storage.googleapis.com/gweb-cloudblog-publish/images/streaming-events-3.max-1500x1500.png

Let’s look at the output folder. After navigating to the Cloud Storage browser, we see several output files that were created by our pipeline.

https://storage.googleapis.com/gweb-cloudblog-publish/images/streaming-events-4.max-1100x1100.png

We can now inspect their contents, learn more about the events that pass through our Pub/Sub topic, and, we can also replay these events.

Replaying events

To replay saved events, click on the options menu in the Cloud Storage browser near the file that you want to replay.

https://storage.googleapis.com/gweb-cloudblog-publish/images/streaming-events-5.max-600x600.png

You can also replay entire folders, too. The “Export to Pub/Sub” menu is available for both folder and file objects in the Cloud Storage browser.

When you select the “Export to Cloud Pub/Sub” menu, a very similar Dataflow page will be presented to you where you can confirm the selected file and enter the destination Pub/Sub topic.

https://storage.googleapis.com/gweb-cloudblog-publish/images/streaming-events-6.max-900x900.png

When you click on “Run job”, a very simple batch import job will be created that will parse the file you specified and send a Pub/Sub message for each record it found in that file. If you selected a folder when exporting, all files in that folder will be processed that way.

https://storage.googleapis.com/gweb-cloudblog-publish/images/streaming-events-7.max-1700x1700.png

This is how you backup and replay streaming events in Cloud Pub/Sub using a very simple import/export functionality in Cloud Dataflow.

Next steps

We hope you’ve found this quick explainer useful to help you test Cloud Pub/Sub scenarios, and simulate results in Dataflow based on these backed up and replayed datasets. For some additional, related topics, feel free to check out the following tutorials:

Posted in

Data Analytics

How to reduce costs with Managed Service for Apache Kafka: CUDs, compression and more

By Qiqi Wu • 5-minute read

Data Analytics

How to use gen AI for better data schema handling, data quality, and data generation

By Deb Lee • 9-minute read

Data Analytics

BigQuery ML is now compatible with open-source gen AI models

By Vaibhav Sethi • 3-minute read

Data Analytics

Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support

By Yuri Volobuev • 4-minute read

Simple backup and replay of streaming events using Cloud Pub/Sub, Cloud Storage, and Cloud Dataflow

Sergei Sokolenko

Saving streaming events into text files

Replaying events

Next steps

Related articles

How to reduce costs with Managed Service for Apache Kafka: CUDs, compression and more

How to use gen AI for better data schema handling, data quality, and data generation

BigQuery ML is now compatible with open-source gen AI models

Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support