Stopping a running pipeline

You cannot delete a Dataflow job; you can only stop it.

To stop a Dataflow job, you can use either the Google Cloud Console, Cloud Shell, a local terminal installed with the Cloud SDK, or the Dataflow REST API.

You can stop a Dataflow job in the following two ways:

  • Canceling a job. This method applies to both streaming and batch pipelines. Canceling a job stops the Dataflow service from processing any data, including buffered data. For more information, see Canceling a job.

  • Draining a job. This method applies only to streaming pipelines. Draining a job enables the Dataflow service to finish processing the buffered data while simultaneously ceasing the ingestion of new data. For more information, see Draining a job.

Canceling a job

When you cancel a job, the Dataflow service stops the job immediately.

The following actions occur when you cancel a job:

  1. The Dataflow service halts all data ingestion and data processing.

  2. The Dataflow service begins cleaning up the Google Cloud resources that are attached to your job.

    These resources may include shutting down Compute Engine worker instances and closing active connections to I/O sources or sinks.

Important information on canceling a job

  • Canceling a job immediately halts the processing of the pipeline.

  • You might lose in-flight data when you cancel a job. In-flight data refers to data that is already read but is still being processed by the pipeline.

  • Data that's written from the pipeline to an output sink before you cancelled the job might still be accessible on your output sink.

  • If data loss is not a concern, canceling your job ensures that the Google Cloud resources that are associated with your job are shut down as soon as possible.

Draining a job

When you drain a job, the Dataflow service finishes your job in its current state. If you want to prevent data loss as you bring down your streaming pipelines, the best option is to drain your job.

The following actions occur when you drain a job:

  1. Your job stops ingesting new data from input sources soon after receiving the drain request (typically within a few minutes).

  2. The Dataflow service preserves any existing resources, such as worker instances, to finish processing and writing any buffered data in your pipeline.

  3. When all pending processing and write operations are complete, the Dataflow service shuts down Google Cloud resources that are associated with your job.

Important information on draining a job

  • Draining a job is not supported for batch pipelines.

  • Your pipeline continues to incur the cost of maintaining any associated Google Cloud resources until all processing and writing is finished.

  • If your streaming pipeline code includes a looping timer, the job cannot be drained.

  • If your streaming pipeline includes a Splittable DoFn, you must truncate the result before running the drain option. For more information on truncating Splittable DoFns, see the Apache Beam documentation.

  • You can update a pipeline that is being drained.

  • Draining a job can take a significant amount of time to complete, such as when your pipeline has a large amount of buffered data.

  • You can cancel a job that is currently draining.

  • In some cases, a Dataflow job might be unable to complete the drain operation. You can consult the job logs to determine the root cause and take appropriate action.

Effects of draining a job

When you drain a streaming pipeline, Dataflow immediately closes any in-process windows and fires all triggers.

The system does not wait for any outstanding time-based windows to finish in a drain operation.

For example, if your pipeline is ten minutes into a two-hour window when you drain the job, Dataflow won't wait for the remainder of the window to finish. It closes the window immediately with partial results. Dataflow causes open windows to close by advancing the system watermark to infinity. This functionality also works with custom data sources.

When draining a pipeline that uses a custom data source class, Dataflow stops issuing requests for new data, advances the system watermark to infinity, and calls your source's finalize() method on the last checkpoint.

In the Google Cloud Console, you can view the details of your pipeline's transforms. The following diagram shows the effects of an in-process drain operation. Note that the watermark is advanced to the maximum value.

A step view of a drain operation.

Figure 1. A step view of a drain operation.

Stopping a job

Before stopping a job, you must understand the effects of canceling or draining a job.

Console

  1. Go to the Dataflow Jobs page.

    Go to Jobs

  2. Click the job that you want to stop.

    To stop a job, the status of the job must be running.

  3. In the job details page, click Stop.

  4. Do one of the following:

    • For a batch pipeline, click Cancel.

    • For a streaming pipeline, click either Cancel or Drain.

  5. To confirm your choice, click Stop Job.

gcloud

To either drain or cancel a Dataflow job, you can use the gcloud dataflow jobs command in the Cloud Shell or a local terminal installed with the Cloud SDK.

  1. Log in to your shell.

  2. List the job IDs for the Dataflow jobs that are currently running, and then note the job ID for the job that you want to stop:

    gcloud dataflow jobs list
    

    If the --region flag is not set, Dataflow jobs from all available regions are displayed.

  3. Do one of the following:

    • To drain a streaming job:

       gcloud dataflow jobs drain JOB_ID
      

      Replace JOB_ID with the job ID that you copied earlier.

    • To cancel a batch or streaming job:

      gcloud dataflow jobs cancel JOB_ID
      

      Replace JOB_ID with the job ID that you copied earlier.

API

To cancel or drain a job using the Dataflow REST API, you can choose either projects.locations.jobs.update or projects.jobs.update. In the request body, pass the required job state in the requestedState field of the job instance of the chosen API.

  • To cancel the job, set the job state to JOB_STATE_CANCELLED.

  • To drain the job, set the job state to JOB_STATE_DRAINED.

What's next