Using Dataflow snapshots

Dataflow snapshots save the state of a streaming pipeline, which allows you to start a new version of your Dataflow job without losing state. This is useful for backup and recovery, testing and rolling back updates to streaming pipelines, and other similar scenarios.

You can create a Dataflow snapshot of any running streaming job; however, any new job you create from a snapshot uses Streaming Engine. You can also use a Dataflow snapshot to migrate your existing pipeline over to the more efficient and scalable Streaming Engine with minimal downtime.

This guide explains how to create snapshots, manage snapshots, and create jobs from snapshots.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Cloud Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, BigQuery, Cloud Pub/Sub, Cloud Datastore, and Cloud Resource Manager APIs.

    Enable the APIs

Creating a snapshot

Console

  1. Open the Dataflow monitoring UI.
    Go to the Dataflow Web UI

    A list of Dataflow jobs appears along with their status. If you don't see any streaming jobs, you need to run a new streaming job. For an example of a streaming job, see the Using Templates quickstart.

  2. Select a job.
  3. In the menu bar on the Job details page, click Create Snapshot.
  4. In the Create a snapshot dialog, select either Without data sources to create a snapshot of your Dataflow job state only, or With data sources to create a snapshot of both your Dataflow job state along with a snapshot of your Pub/Sub source.

CLI

To create a snapshot, use the create command:

gcloud beta dataflow snapshots create \
    --job-id=JOB_ID \
    --snapshot-ttl=DURATION \
    --snapshot-sources=true \
    --region=REGION

Replace the following:

  • JOB_ID: your streaming job ID.
  • DURATION: the amount of time (in days) before the snapshot expires, after which no more jobs can be created from the snapshot. The snapshot-ttl flag is optional, so if it is not specified, the snapshot expires in 7 days. Specify the value in the following format: 5d. The maximum duration you can specify is 30 days (30d).
  • REGION: the region where your job is running.

The snapshot-sources flag specifies whether to snapshot the Pub/Sub sources along with the Dataflow snapshot. If true, Pub/Sub sources are automatically snapshotted and the Pub/Sub snapshot IDs are shown in the output response. After running the createcommand, check on the snapshot status by running either the list or the describe command.

The following apply when creating Dataflow snapshots:

  • Dataflow snapshots incur a charge on disk usage.
  • Snapshots are created in the same region as the job and only for jobs running in us-central1, europe-west1, europe-west4, or asia-northeast1. More regions will be available upon General Availability.
  • If the job's worker location is different from the job's region, snapshot creation fails. See regional endpoints guide.
  • You can only take snapshots of non-Streaming Engine jobs if the jobs were started or updated after September 1, 2020.
  • Pub/Sub snapshots created with Dataflow snapshots are managed by the Pub/Sub service and incur a charge.
  • A Pub/Sub snapshot expires no later than 7 days from the time of its creation. Its exact lifetime is determined at creation by the existing backlog in the source subscription. Specifically, the lifetime of the Pub/Sub snapshot is 7 days - (age of oldest unacked message in the subscription). For example, consider a subscription whose oldest unacked message is 3 days old. If a Pub/Sub snapshot is created from this subscription, the snapshot, which always captures this 3-day-old backlog as long as the snapshot exists, expires in 4 days. See Pub/Sub snapshot reference.
  • During the snapshot operation, your Dataflow job pauses and resumes after the snapshot is ready. The time needed depends on the size of the pipeline state. For example, the time needed to take snapshots on Streaming Engine jobs are generally shorter than non-Streaming Engine jobs.
  • You can cancel the job while a snapshot is in progress, which then cancels the snapshot.
  • You cannot update or drain the job while a snapshot is in progress. You must wait until the job has resumed from the snapshot process before you can update or drain the job.

Using the snapshots UI

After you create a snapshot, you can use the Snapshots page to manage the snapshots for your project.

Clicking on a snapshot opens the Snapshot details page where you can get additional metadata about the snapshot as well as a link to the source job and any Pub/Sub snapshots.

Dataflow snapshots details page where you can get more information about a snapshot.

Deleting a snapshot

Console

  1. In the Snapshots or Snapshots details page, click Delete.
  2. In the Delete snapshot dialog, click Delete to confirm.

CLI

To delete a snapshot, use the delete command:

gcloud beta dataflow snapshots delete SNAPSHOT_ID \
    --region=REGION

Where:

  • SNAPSHOT_ID is your snapshot ID.
  • REGION is the region for the snapshot.

Creating a job from a snapshot

After you create a snapshot, you can restore the job's state by creating a new job from that snapshot.

Console

You cannot create a job from a snapshot in Cloud Console.

Java

To create a new job from a snapshot, use both the --createFromSnapshot and
--enableStreamingEngine flags when starting a job. For example:

mvn -Pdataflow-runner compile exec:java \
    -Dexec.mainClass=org.apache.beam.examples.WordCount \
    -Dexec.args="--project=PROJECT_ID \
    --stagingLocation=gs://STORAGE_BUCKET/staging/ \
    --inputFile=gs://apache-beam-samples/shakespeare/* \
    --output=gs://STORAGE_BUCKET/output \
    --runner=DataflowRunner \
    --enableStreamingEngine \
    --createFromSnapshot=SNAPSHOT_ID \
    --region=REGION"

Where:

  • PROJECT_ID is the project ID.
  • STORAGE_BUCKET is the storage bucket you use for temporary job assets and the final output.
  • SNAPSHOT_ID is the snapshot ID you want to create a new job from.
  • REGION is the location for the job.

Python

This feature is not yet supported in the Apache Beam SDK for Python.

The following apply when creating jobs from Dataflow snapshots:

  • Jobs created from snapshots must run in the same region where the snapshot is stored.
  • If a Dataflow snapshot includes snapshots of Pub/Sub sources, jobs created from a Dataflow snapshot automatically seek to those Pub/Sub snapshots as sources. You must specify the same Pub/Sub topics used by the source job when creating jobs from that Dataflow snapshot.

  • If a Dataflow snapshot doesn't include snapshots of Pub/Sub sources and the source job uses a Pub/Sub source, you must specify a Pub/Sub topic when creating jobs from that Dataflow snapshot.

  • New jobs created from a snapshot are still subjected to an update compatibility check.

Known limitations

The following limitations apply to Dataflow snapshots:

  • You cannot create jobs from snapshots using templates or the Dataflow SQL editor.
  • The snapshot expiration timeframe can only be set through the gcloud tool.
  • Dataflow snapshots support only Pub/Sub source snapshots at this time.
  • Sink snapshots are not supported at this time. For example, you cannot create a BigQuery snapshot when creating a Dataflow snapshot.
  • Each project is only allowed 10 ready or running snapshots per region. An error occurs when trying to create additional snapshots above this limit. This limit will be relaxed upon General Availability.

Troubleshooting

This section provides instructions for troubleshooting common issues found when interacting with Dataflow snapshots.

Before reaching out for support, ensure that you have ruled out problems related to the known limitations and in the following troubleshooting sections.

Snapshot creation request is rejected

Once a snapshot creation request is submitted, either through the Cloud Console or using gcloud tool, Dataflow service performs precondition check and returns any error messages. The creation request can be rejected due to reasons such as unsupported job type or unavailable region, as specified in the error messages.

If the request is rejected because the job is too old, you must update your job before you request a snapshot.

If the request is rejected because your job reaches the snapshot count limit, remove old snapshots or reach out support team to increase the limit.

Unable to create a job from snapshot

When creating a job from a snapshot, ensure that the snaphot exists and is not expired. The new job must run on Streaming Engine.

For common job creation issues, refer to Dataflow troubleshooting guide. In particular, new jobs created from snapshots are subjected to an update compatibility check where the new job is required to be compatible with the snapshotted source job.

Job created from snapshot makes little progress

Job creation from snapshot involves a step that loads data from the snapshot into the new job. A job starts processing data only after snapshot is fully loaded. Depending on the size of the snapshot, this step can take some time.

For jobs created from Dataflow snapshots that include Pub/Sub source snapshots, the new job might have large Pub/Sub backlogs to process. Streaming autoscaling might help the new job to clear the backlog faster.

The snapshotted source job might already be in an unhealthy state before the snapshot was taken. Understanding why the source job is unhealthy might help resolve issues of the new job. For common job debugging tips, refer to Dataflow troubleshooting guide.

Job created from snapshot does not pull Pub/Sub sources

Make sure the Pub/Sub snapshots captured along with Dataflow snapshots exist and are not expired. Pub/Sub snapshots expire as soon as the oldest message in a snapshot is older than seven days.