Stopping a running pipeline

If you need to stop a running Dataflow job, you can do so by issuing a command using the Dataflow monitoring interface, the Dataflow command-line interface, or the Dataflow REST API. There are two possible commands you can issue to stop your job: Cancel and Drain.

Stopping a job using the Dataflow monitoring interface

To stop a job, select the job from the jobs list in the Dataflow monitoring interface. On the top panel, click Stop.

Figure 1: Dataflow monitoring interface page with the Stop button on the top panel.

The Stop Job dialog appears with your options for how to stop your job:

Figure 2: The Stop Job dialog with options for Cancel and Drain.

Select Cancel or Drain option as appropriate and click the Stop Job button.

Stopping a job using the Dataflow command-line interface

To stop a job using the Dataflow command-line interface, run the following command either in a terminal with Cloud SDK installed or in Cloud Shell:

  gcloud dataflow jobs {cancel|drain} JOB-ID

Stopping a job using the Dataflow REST API

To cancel or drain a job using the Dataflow REST API, the desired state can be passed in the requestedState field of the job instance via the request body to projects.locations.jobs.update or projects.jobs.update. To cancel the job, set the job state to JOB_STATE_CANCELLED. To drain the job, set the job state to JOB_STATE_DRAINED.

Cancel

Using the Cancel option to stop your job tells the Dataflow service to cancel your job immediately. The service will halt all data ingestion and processing as soon as possible and immediately begin cleaning up the Google Cloud resources attached to your job. These resources may include shutting down Compute Engine worker instances and closing active connections to I/O sources or sinks.

Because Cancel immediately halts processing, you may lose any "in-flight" data. "In-flight" data refers to data that has been read but is still being processed by your pipeline. Data written from your pipeline to an output sink before you issued the Cancel command may still be accessible on your output sink.

If data loss is not a concern, use the Cancel option to stop your pipeline to ensure the Google Cloud resources associated with your job are shut down as soon as possible.

Drain

Using the Drain option to stop your job tells the Dataflow service to finish your job in its current state. Your job stops ingesting new data from input sources soon after receiving the drain request (typically within a few minutes). However, the Dataflow service preserves any existing resources, such as worker instances, to finish processing and writing any buffered data in your pipeline. When all pending processing and write operations are complete, the Dataflow service cleans up the Google Cloud resources associated with your job.

If you want to prevent data loss as you bring down your pipelines, use the Drain option to stop your job.

Effects of draining a job

When you issue the Drain command, Dataflow immediately closes any in-process windows and fires all triggers. The system does not wait for any outstanding time-based windows to finish. For example, if your pipeline is ten minutes into a two-hour window when you issue the Drain command, Dataflow won't wait for the remainder of the window to finish. It closes the window immediately with partial results.

In the detailed view of your pipeline's transforms, you can see the effects of an in-process Drain command:

Figure 3: A step view with Drain in progress; notice the watermark has advanced to the maximum value.