Troubleshoot slow or stuck jobs

This page explains how to troubleshoot common causes of slow or stuck Dataflow streaming and batch jobs.

Streaming

If you notice the following symptoms, your Dataflow streaming job might be running slowly or stuck:

The pipeline isn't reading data from the source. For example, Pub/Sub has a growing backlog.
The pipeline isn't writing data to the sink.
The data freshness metric is increasing.
The system latency metric is increasing.

Use the information in the following sections to identify and diagnose the problem.

Investigate repeated failures

In a streaming job, some failures are retried indefinitely. These retries prevent the pipeline from progressing. To identify repeated failures, check the worker logs for exceptions.

If the exception is with user code, debug and fix the issue in the code or in the data.
To prevent unexpected failures from stalling your pipeline, implement a dead-letter queue. For an example implementation, see BigQuery patterns in the Apache Beam documentation.
If the exception is an out of memory (OOM) error, see Troubleshoot Dataflow out of memory errors.
For other exceptions, see Troubleshoot Dataflow errors.

Identify unhealthy workers

If the workers processing the streaming job are unhealthy, the job might be slow or appear stuck. To identify unhealthy workers:

Check for memory pressure by using the memory utilization metrics and by looking for out of memory errors in the worker logs. For more information, see Troubleshoot Dataflow out of memory errors.
If you're using Streaming Engine, use the persistence metrics to identify bottlenecks with the disk input/output operations (IOPS).
Check the worker logs for other errors. For more information, see Work with pipeline logs and Troubleshoot Dataflow errors.

Identify stragglers

A straggler is a work item that is slow relative to other work items in the stage. For information about identifying and fixing stragglers, see Troubleshoot stragglers in streaming jobs.

Troubleshoot insufficient parallelism

For scalability and efficiency, Dataflow runs the stages of your pipeline in parallel across multiple workers. The smallest unit of parallel processing in Dataflow is a key. Incoming messages for each fused stage are associated with a key. The key is defined in one of the following ways:

The key is implicitly defined by the properties of the source, such as Pub/Sub partitions.
The key is explicitly defined by aggregation logic in the pipeline, such as GroupByKey.

If the pipeline doesn't have enough keys for a given stage, it limits parallel processing. That stage might become a bottleneck.

Identify stages with low parallelism

To identify if pipeline slowness is caused by low parallelism, view the CPU utilization metrics. If CPU is low but evenly distributed across workers, your job might have insufficient parallelism. If your job is using Streaming Engine, to see if a stage has low parallelism, in the Job Metrics tab, view the parallelism metrics. To mitigate this issue:

In the Google Cloud console, on the Job info page, use the Autoscaling tab to see if the job is having problems scaling up. If autoscaling is the problem, see Troubleshoot Dataflow autoscaling.
Use the job graph to check the steps in the stage. If the stage is reading from a source or writing to a sink, review the documentation for the service of the source or sink. Use the documentation to determine if that service is configured for sufficient scalability.
- To gather more information, use the input and output metrics provided by Dataflow.
- If you're using Kafka, check the number of Kafka partitions. For more information, see the Apache Kafka documentation.
- If you're using a BigQuery sink, enable automatic sharding to improve parallelism. For more information, see 3x Dataflow Throughput with Auto Sharding for BigQuery.

Check for hot keys

If tasks are unevenly distributed across workers and worker utilization is very uneven, your pipeline might have a hot key. A hot key is a key that has far more elements to process compared to other keys. To resolve this issue, take one or more of the following steps:

Rekey your data. To output new key-value pairs, apply a ParDo transform. For more information, see the Java ParDo transform page or the Python ParDo transform page in the Apache Beam documentation.
Use .withFanout in your combine transforms. For more information, see the Combine.PerKey class in the Java SDK or the with_hot_key_fanout operation in the Python SDK.
If you have a Java pipeline that processes high-volume unbounded PCollections, we recommend that you do the following:
- Use Combine.Globally.withFanout instead of Combine.Globally.
- Use Combine.PerKey.withHotKeyFanout instead of Count.PerKey.

Check for insufficient quota

Make sure you have sufficient quota for your source and sink. For example, if your pipeline reads input from Pub/Sub or BigQuery, your Google Cloud project might have insufficient quota. For more information about quota limits for these services, see Pub/Sub quota or BigQuery quota.

If your job is generating a high number of 429 (Rate Limit Exceeded) errors, it might have insufficient quota. To check for errors, try the following steps:

Go to the Google Cloud console.
In the navigation pane, click APIs & services.
In the menu, click Library.
Use the search box to search for Pub/Sub.
Click Cloud Pub/Sub API.
Click Manage.
In the Traffic by response code chart, look for (4xx) client error codes.

You can also use Metrics Explorer to check quota usage. If your pipeline uses a BigQuery source or sink, to troubleshoot quota issues, use the BigQuery Storage API metrics. For example, to create a chart showing the BigQuery concurrent connection count, follow these steps:

In the Google Cloud console, select Monitoring:

Go to Monitoring
In the navigation pane, select Metrics explorer.
In the Select a metric pane, for Metric, filter to BigQuery Project > Write > concurrent connection count.

For instructions about viewing Pub/Sub metrics, see Monitor quota usage in "Monitor Pub/Sub in Cloud Monitoring." For instructions about viewing BigQuery metrics, see View quota usage and limits in "Create dashboards, charts, and alerts."

Batch

If your batch job is slow or stuck, use the Execution details tab to find more information about the job and to identify the stage or worker that's causing a bottleneck.

Identify stragglers

A straggler is a work item that is slow relative to other work items in the stage. For information about identifying and fixing stragglers, see Troubleshoot stragglers in batch jobs.

Identify slow or stuck stages

To identify slow or stuck stages, use the Stage progress view. Longer bars indicate that the stage takes more time. Use this view to identify the slowest stages in your pipeline.

After you find the bottleneck stage, you can take the following steps:

Identify the lagging worker within that stage.
If there are no lagging workers, identify the slowest step by using the Stage info panel. Use this information to identify candidates for user code optimization.
To find parallelism bottlenecks, use Dataflow monitoring metrics.

Identify a lagging worker

To identify a lagging worker for a specific stage, use the Worker progress view. This view shows whether all workers are processing work until the end of the stage, or if a single worker is stuck on a lagging task. If you find a lagging worker, take the following steps:

View the log files for that worker. For more information, see Monitor and view pipeline logs.
View the CPU utilization metrics and the worker progress details for lagging workers. If you see unusually high or low CPU utilization, in the log files for that worker, look for the following issues:
- A hot key ... was detected
- Processing stuck ... Operation ongoing

Tools for debugging

When you have a slow or stuck pipeline, the following tools can help you diagnose the problem.

To correlate incidents and identify bottlenecks, use Cloud Monitoring for Dataflow.
To monitor pipeline performance, use Cloud Profiler.
Some transforms are better suited to high-volume pipelines than others. Log messages can identify a stuck user transform in either batch or streaming pipelines.
To learn more about a stuck job, use Dataflow job metrics. The following list includes useful metrics:
- The Backlog bytes metric (backlog_bytes) measures the amount of unprocessed input in bytes by stage. Use this metric to find a fused step that has no throughput. Similarly, the backlog elements metric (backlog_elements) measures the number of unprocessed input elements for a stage.
- The Processing parallelism keys (processing_parallelism_keys) metric measures the number of parallel processing keys for a particular stage of the pipeline over the last five minutes. Use this metric to investigate in the following ways:
  - Narrow the issue down to specific stages and confirm hot key warnings, such as A hot key ... was detected.
  - Find throughput bottlenecks caused by insufficient parallelism. These bottlenecks can result in slow or stuck pipelines.
- The System lag metric (system_lag) and the per-stage system lag metric (per_stage_system_lag) measure the maximum amount of time an item of data has been processing or awaiting processing. Use these metrics to identify inefficient stages and bottlenecks from data sources.

For additional metrics that aren't included in the Dataflow monitoring web interface, see the complete list of Dataflow metrics in Google Cloud metrics.