Upgrade a streaming pipeline

This page provides guidance and recommendations for upgrading your streaming pipelines. For example, you might need to upgrade to a newer version of the Apache Beam SDK, or you might want to update your pipeline code. Different options are provided to suit different scenarios.

Whereas batch pipelines stop when the job is complete, streaming pipelines often run continuously in order to provide uninterrupted processing. Therefore, when you upgrade streaming pipelines, you need to account for the following considerations:

You might need to minimize or avoid disruption to the pipeline. In some cases, you might be able to tolerate a temporary disruption to processing while a new version of a pipeline is deployed. In other cases, your application might not be able to tolerate any disruption.
Pipeline update processes need to handle schema changes in a way that minimizes disruption to message processing and to other attached systems. For example, if the schema for messages in an event-processing pipeline changes, schema changes might also be necessary in downstream data sinks.

You can use one of the following methods to update streaming pipelines, depending on your pipeline and update requirements:

Perform in-flight updates
Launch a replacement job
Run parallel pipelines

For more information about issues you might encounter during an update and how to prevent them, see Validate a replacement job and Job compatibility check.

Best practices

Upgrade the Apache Beam SDK version separately from any pipeline code changes.
Test your pipeline after each change before making additional updates.
Regularly upgrade the Apache Beam SDK version that your pipeline uses.
Use automated methods where possible, such as in-flight updates or automated parallel pipeline updates.

Perform in-flight updates

You can update some ongoing streaming pipelines without stopping the job. This scenario is called an in-flight job update. In-flight job updates are only available in limited circumstances:

The job must use Streaming Engine.
The job must be in the running state.
You are only changing the number of workers that the job uses.

For more information, see Set the autoscaling range in the Horizontal Autoscaling page.

For instructions explaining how to perform an in-flight job update, see Update an existing pipeline.

Launch a replacement job

If the updated job is compatible with the existing job, you can update your pipeline by using the update option. When you replace an existing job, a new job runs your updated pipeline code. The Dataflow service retains the job name but runs the replacement job with an updated Job ID. This process might cause downtime while the existing job stops, the compatibility check runs, and the new job starts. For more details, see The effects of replacing a job.

Dataflow performs a compatibility check to make sure that the updated pipeline code can be safely deployed to the running pipeline. Certain code changes cause the compatibility check to fail, such as when side inputs are added to or removed from an existing step. When the compatibility check fails, you can't perform an in-place job update.

For instructions explaining how to launch a replacement job, see Launch a replacement job.

If the pipeline update is incompatible with the current job, you need to stop and replace the pipeline. If your pipeline can't tolerate downtime, run parallel pipelines.

Stop and replace pipelines

If you can temporarily halt processing, you can cancel or drain the pipeline, and then replace it with the updated pipeline. Cancelling a pipeline causes Dataflow to immediately halt processing and shut down resources as quickly as possible, which can cause some loss of data that's being processed, known as in-flight data. To avoid data loss, in most cases, draining is the preferred action. You can also use Dataflow snapshots to save the state of a streaming pipeline, which lets you start a new version of your Dataflow job without losing state. For more information, see Use Dataflow snapshots.

Draining a pipeline immediately closes any in-process windows and fires all triggers. Although in-flight data isn't lost, draining might cause windows to have incomplete data. If this happens, in-process windows emit partial or incomplete results. For more information, see Effects of draining a job. After the existing job completes, launch a new streaming job that contains your updated pipeline code, which enables processing to resume.

With this method, you incur some downtime between the time when the existing streaming job stops and the time when the replacement pipeline is ready to resume processing data. However, canceling or draining an existing pipeline and then launching a new job with the updated pipeline is less complicated than running parallel pipelines.

For more detailed instructions, see Drain a Dataflow job. After you drain the current job, start a new job with the same job name.

Message reprocessing with Pub/Sub Snapshot and Seek

In some situations, after you replace or cancel a drained pipeline, you might need to reprocess previously delivered Pub/Sub messages. For example, you might need to use updated business logic to reprocess data. Pub/Sub Seek is a feature that lets you replay messages from a Pub/Sub snapshot. You can use Pub/Sub Seek with Dataflow to reprocess messages from the time when the subscription snapshot is created.

During development and testing, you can also use Pub/Sub Seek to replay the known messages repeatedly to verify the output from your pipeline. When you use Pub/Sub Seek, don't seek a subscription snapshot when the subscription is being consumed by a pipeline. If you do, the seek can invalidate Dataflow's watermark logic and might impact the exactly-once processing of Pub/Sub messages.

A recommended gcloud CLI workflow for using Pub/Sub Seek with Dataflow pipelines in a terminal window is as follows:

To create a snapshot of the subscription, use the gcloud pubsub snapshots create command:

gcloud pubsub snapshots create SNAPSHOT_NAME --subscription=PIPELINE_SUBSCRIPTION_NAME

To drain or cancel the pipeline, use the gcloud dataflow jobs drain command or the gcloud dataflow jobs cancel command:
```
gcloud dataflow jobs drain JOB_ID
```
or
```
gcloud dataflow jobs cancel JOB_ID
```
To seek to the snapshot, use the gcloud pubsub subscriptions seek command:
```
gcloud pubsub subscriptions seek SNAPSHOT_NAME
```
Deploy a new pipeline that consumes the subscription.

Run parallel pipelines

If you need to avoid disruption to your streaming pipeline during an update, you can run parallel pipelines. This approach lets you launch a new streaming job with your updated pipeline code and run it in parallel with the existing job. You can use Dataflow's automated parallel pipeline update deployment workflow, or you can perform the steps manually.

Parallel pipelines overview

When you create the new pipeline, use the same windowing strategy that you used for the existing pipeline. For the manual workflow, let the existing pipeline continue to run until its watermark exceeds the timestamp of the earliest complete window processed by the updated pipeline. Then, drain or cancel the existing pipeline. If using the automated workflow, this work is done for you. The updated pipeline continues to run in its place and effectively takes over processing on its own.

The following diagram illustrates this process.

Pipeline B overlaps with Pipeline B for a 5-minute window.

In the diagram, Pipeline B is the updated job that takes over from Pipeline A. The value t is the timestamp of the earliest complete window processed by Pipeline B. The value w is the watermark for Pipeline A. For simplicity, a perfect watermark is assumed with no late data. Processing and wall time are represented on the horizontal axis. Both pipelines use five-minute fixed (tumbling) windows. Results are triggered after the watermark passes the end of each window.

Because concurrent output occurs during the time period where the two pipelines overlap, configure the two pipelines to write results to different destinations. Downstream systems can then use an abstraction over the two destination sinks, such as a database view, to query the combined results. These systems can also use the abstraction to deduplicate results from the overlapping period. For more information, see Handle duplicated output.

Limitations

Using automated or manual parallel pipeline updates has the following limitations:

Automated updates only: The new parallel job must be a Streaming Engine job.
Old and new job names must be different because concurrent jobs with the same name are disallowed.
Running two pipelines in parallel on the same input may lead to duplicate data, partial aggregations, and potential ordering issues when data is inserted into the sink. The downstream system must be designed to anticipate and manage these outcomes.
When reading from a Pub/Sub source, using the same subscription for multiple pipelines isn't recommended and can lead to correctness issues. However, in some use cases, like extract, transform, load (ETL) pipelines, using the same subscription across two pipelines might reduce duplication. Problems with autoscaling are likely any time you provide a non-zero value for the overlapping duration. This can be mitigated by using the in-flight job update feature. For more information, see Fine tune autoscaling for your Pub/Sub streaming pipelines.
For Apache Kafka, you can minimize duplicates by enabling offset committing in Kafka. To enable offset committing in Kafka, see Committing back to Kafka.

Automated parallel pipeline updates

Dataflow provides API support for launching a parallel replacement job. This declarative-style API abstracts the manual work of running procedural steps. You declare the job you want to update, and a new job then runs in parallel with the old job. After the new job runs for the duration you specify, the old job is drained. This feature eliminates processing pauses during updates and reduces the operational effort needed to update incompatible pipelines.

This update method is best for pipelines that can tolerate some duplicates or partial aggregations and don't require strict ordering while inserting data. It is well-suited for ETL pipelines, as well as pipelines using at-least-once streaming mode and the Redistribute transform with allow duplicates set to true.

Send an automated parallel pipeline update request

To use the automated workflow, launch a new streaming job with the following service options. You must launch the new job with a different job name than the old job.

Java

--dataflowServiceOptions="parallel_replace_job_name=OLD_JOB_NAME" \
--dataflowServiceOptions="parallel_replace_job_min_parallel_pipelines_duration=DURATION"

Alternatively, you can specify the job ID of the old job:

--dataflowServiceOptions="parallel_replace_job_id=OLD_JOB_ID" \
--dataflowServiceOptions="parallel_replace_job_min_parallel_pipelines_duration=DURATION"

Python

--dataflow_service_options="parallel_replace_job_name=OLD_JOB_NAME" \
--dataflow_service_options="parallel_replace_job_min_parallel_pipelines_duration=DURATION"

Alternatively, you can specify the job ID of the old job:

--dataflow_service_options="parallel_replace_job_id=OLD_JOB_ID" \
--dataflow_service_options="parallel_replace_job_min_parallel_pipelines_duration=DURATION"

Go

--dataflow_service_options="parallel_replace_job_name=OLD_JOB_NAME" \
--dataflow_service_options="parallel_replace_job_min_parallel_pipelines_duration=DURATION"

Alternatively, you can specify the job ID of the old job:

--dataflow_service_options="parallel_replace_job_id=OLD_JOB_ID" \
--dataflow_service_options="parallel_replace_job_min_parallel_pipelines_duration=DURATION"

gcloud

--additional-experiments="parallel_replace_job_name=OLD_JOB_NAME" \
--additional-experiments="parallel_replace_job_min_parallel_pipelines_duration=DURATION"

Alternatively, you can specify the job ID of the old job:

--additional-experiments="parallel_replace_job_id=OLD_JOB_ID" \
--additional-experiments="parallel_replace_job_min_parallel_pipelines_duration=DURATION"

Replace the following variables:

You must provide either parallel_replace_job_name or parallel_replace_job_id to identify the job to replace.
- OLD_JOB_NAME: If you use parallel_replace_job_name, the name of the job to be replaced.
- OLD_JOB_ID: If you use parallel_replace_job_id, the ID of the job to be replaced.
You must provide a parallel_replace_job_min_parallel_pipelines_duration value.
- DURATION: The minimum amount of time the two pipelines run in parallel as an integer or floating-point number. After this duration passes, the old job is sent a drain signal.
  
  The duration must be between 0 seconds (0s) and 31 days (744h). Use s, m, and h to specify seconds, minutes, and hours. For example, 10m is 10 minutes.

When you launch the new job, Dataflow waits for all workers to be provisioned before it begins processing data. To monitor the status of the deployment, check the Dataflow job logs.

Manually run parallel pipelines

For more complex scenarios, or when you need more control over the update process, you can manually run parallel pipelines. Let the existing pipeline continue to run until its watermark exceeds the timestamp of the earliest complete window processed by the updated pipeline. Then, drain or cancel the existing pipeline.

Handle duplicated output

The following example describes one approach to handling duplicated output. The two pipelines write output to different destinations, use downstream systems to query results, and deduplicate results from the overlapping period. This example uses a pipeline that reads input data from Pub/Sub, performs some processing, and writes the results to BigQuery.

In the initial state, the existing streaming pipeline (Pipeline A) is running and reading messages from a Pub/Sub topic (Topic) by using a subscription (Subscription A). The results are written to a BigQuery table (Table A). Results are consumed through a BigQuery view, which acts as a façade to mask underlying table changes. This process is an application of a design method called the façade pattern. The following diagram shows the initial state.
Create a new subscription (Subscription B) for the updated pipeline. Deploy the updated pipeline (Pipeline B), which reads from the Pub/Sub topic (Topic) by using Subscription B and writes to a separate BigQuery table (Table B). The following diagram illustrates this flow.

At this point, Pipeline A and Pipeline B are running in parallel and writing results to separate tables. You record time t as the timestamp of the earliest complete window processed by Pipeline B.
When the watermark of Pipeline A exceeds time t, drain Pipeline A. When you drain the pipeline, any open windows close, and processing for in-flight data completes. If the pipeline contains windows and complete windows are important (assuming no late data), before draining Pipeline A, let both pipelines to run until you have complete overlapping windows. Stop the streaming job for Pipeline A after all in-flight data is processed and written to Table A. The following diagram shows this stage.
At this point, only Pipeline B is running. You can query from a BigQuery view (Façade View), which acts as a façade for Table A and Table B. For rows that have the same timestamp in both tables, configure the view to return the rows from Table B, or, if the rows don't exist in Table B, fall back to Table A. The following diagram shows the view (Façade View) reading from both Table A and Table B.

At this point, you can delete Subscription A.

When issues are detected with a new pipeline deployment, having parallel pipelines can simplify rollback. In this example, you might want to keep Pipeline A running while you monitor Pipeline B for correct operation. If any issues occur with Pipeline B, you can roll back to Pipeline A.

Handle schema mutations

Data-handling systems often need to accommodate schema mutations over time, sometimes due to changes in business requirements and other times for technical reasons. Applying schema updates typically requires careful planning and execution to avoid disruptions to business information systems.

Consider a pipeline that reads messages that contain JSON payloads from a Pub/Sub topic. The pipeline converts each message into a TableRow instance and then writes the rows to a BigQuery table. The schema of the output table is similar to messages that are processed by the pipeline. In the following diagram, the schema is referred to as Schema A.

Pipeline that reads a subscription and writes to a BigQuery output table using Schema A.

Over time, the message schema might mutate in non-trivial ways. For example, fields are added, removed, or replaced. Schema A evolves into a new schema. In the discussion that follows, the new schema is referred to as Schema B. In this case, Pipeline A needs to be updated, and the output table schema needs to support Schema B.

For the output table, you can perform some schema mutations without downtown. For example, you can add new fields or relax column modes, such as changing REQUIRED to NULLABLE, without downtime. These mutations don't usually impact existing queries. However, schema mutations that modify or remove existing schema fields break queries or result in other disruptions. The following approach accommodates changes without requiring downtime.

Separate the data that's written by the pipeline into a principal table and into one or more staging tables. The principal table stores historic data written by the pipeline. Staging tables store the latest pipeline output. You can define a BigQuery façade view over the principal and staging tables, which lets consumers query both historic and up-to-date data.

The following diagram revises the previous pipeline flow to include a staging table (Staging Table A), a principal table, and a façade view.

Pipeline that reads a subscription and writes to a BigQuery staging table. A second (principal) table has output from a previous version of the schema. A façade view reads from both the staging table and the principal table.

In the revised flow, Pipeline A processes messages that use Schema A and writes the output to Staging Table A, which has a compatible schema. The principal table contains historic data written by previous versions of the pipeline, as well as results that are periodically merged from the staging table. Consumers can query up-to-date data, including both historic and real-time data, by using the façade view.

When the message schema mutates from Schema A to Schema B, you might update the pipeline code to be compatible with messages that use Schema B. The existing pipeline needs to be updated with the new implementation. By running parallel pipelines, you can make sure that streaming data processing continues without disruption. Terminating and replacing pipelines results in a break in processing, because no pipeline is running for a period of time.

The updated pipeline writes to an additional staging table (Staging Table B) that uses Schema B. You can use an orchestrated workflow to create the new staging table before updating the pipeline. Update the façade view to include results from the new staging table, potentially using a related workflow step.

The following diagram shows the updated flow that shows Staging Table B with Schema B and how the façade view is updated to include content from the principal table and from both staging tables.

The pipeline now uses Schema B and writes to Staging Table B. A façade view reads from the Principal table, Staging Table A, and Staging Table B.

As a separate process from the pipeline update, you can merge the staging tables into the principal table, either periodically or as required. The following diagram shows how Staging Table A is merged into the principal table.

Staging Table A is merged into the principal table. The façade view reads from Staging Table B and from the principal table.

What's next

Find detailed steps for updating an existing pipeline.

Upgrade a streaming pipeline Stay organized with collections Save and categorize content based on your preferences.

Best practices

Perform in-flight updates

Launch a replacement job

Stop and replace pipelines

Message reprocessing with Pub/Sub Snapshot and Seek

Run parallel pipelines

Parallel pipelines overview

Limitations

Automated parallel pipeline updates

Send an automated parallel pipeline update request

Java

Python

Go

gcloud

Manually run parallel pipelines

Handle duplicated output

Handle schema mutations

What's next

Upgrade a streaming pipeline