Troubleshoot slow or stuck batch jobs

This page explains how to troubleshoot common causes of slow or stuck Dataflow batch jobs.

If your batch job is slow or stuck, use the Execution details tab to find more information about the job and to identify the stage or worker that's causing a bottleneck.

Identify the root cause

Check whether the job is running into issues during worker startup. For more information, see Error syncing pod.

To verify the job has started processing data, look in the job-message log for the following log entry:
```
All workers have finished the startup processes and began to receive work requests
```
To compare job performance between different jobs, make sure the volume of input data, worker configuration, autoscaling behavior, and Dataflow Shuffle settings are the same.
Check the job-message logs for issues such as quota limits, stockout issues, or IP address exhaustion.
In the Execution details tab, compare the stage progress to identify stages that took longer.
Look for any stragglers in the job. For more information, see Troubleshooting stragglers in batch jobs.
Check the throughput, CPU, and memory utilization metrics.
Check the worker logs for warnings and errors.
- If the worker logs contain errors, view the stack trace. Investigate whether the error is caused by a bug in your code.
- Look for Dataflow errors. See Troubleshoot Dataflow errors.
- Look for out-of-memory errors, which can cause a stuck pipeline. If you see out-of-memory errors, follow the steps in Troubleshoot Dataflow out of memory errors.
- To identify a slow or stuck step, check the worker logs for Operation ongoing messages. View the stack trace to see where the step is spending time. For more information, see Processing stuck or operation ongoing.
Check for hot keys.
If you aren't using Dataflow Shuffle, check the shuffler logs for warnings and errors during shuffle operation. If you see an RPC timeout error on port 12345 or 12346, your job might be missing a firewall rule. See Firewall rules for Dataflow.
If Runner v2 is enabled, check the harness logs for errors. For more information, see Troubleshoot Runner v2.

Identify stragglers

A straggler is a work item that is slow relative to other work items in the stage. For information about identifying and fixing stragglers, see Troubleshoot stragglers in batch jobs.

Identify slow or stuck stages

To identify slow or stuck stages, use the Stage progress view. Longer bars indicate that the stage takes more time. Use this view to identify the slowest stages in your pipeline.

After you find the bottleneck stage, you can take the following steps:

Identify the lagging worker within that stage.
If there are no lagging workers, identify the slowest step by using the Stage info panel. Use this information to identify candidates for user code optimization.
To find parallelism bottlenecks, use Dataflow monitoring metrics.

Identify a lagging worker

To identify a lagging worker for a specific stage, use the Worker progress view. This view shows whether all workers are processing work until the end of the stage, or if a single worker is stuck on a lagging task. If you find a lagging worker, take the following steps:

View the log files for that worker. For more information, see Monitor and view pipeline logs.
View the CPU utilization metrics and the worker progress details for lagging workers. If you see unusually high or low CPU utilization, in the log files for that worker, look for the following issues:
- A hot key ... was detected
- Processing stuck ... Operation ongoing

Tools for debugging

When you have a slow or stuck pipeline, the following tools can help you diagnose the problem.

To correlate incidents and identify bottlenecks, use Cloud Monitoring for Dataflow.
To monitor pipeline performance, use Cloud Profiler.
Some transforms are better suited to high-volume pipelines than others. Log messages can identify a stuck user transform in either batch or streaming pipelines.
To learn more about a stuck job, use Dataflow job metrics. The following list includes useful metrics:
- The Backlog bytes metric (backlog_bytes) measures the amount of unprocessed input in bytes by stage. Use this metric to find a fused step that has no throughput. Similarly, the backlog elements metric (backlog_elements) measures the number of unprocessed input elements for a stage.
- The Processing parallelism keys (processing_parallelism_keys) metric measures the number of parallel processing keys for a particular stage of the pipeline over the last five minutes. Use this metric to investigate in the following ways:
  - Narrow the issue down to specific stages and confirm hot key warnings, such as A hot key ... was detected.
  - Find throughput bottlenecks caused by insufficient parallelism. These bottlenecks can result in slow or stuck pipelines.
- The System lag metric (system_lag) and the per-stage system lag metric (per_stage_system_lag) measure the maximum amount of time an item of data has been processing or awaiting processing. Use these metrics to identify inefficient stages and bottlenecks from data sources.

For additional metrics that aren't included in the Dataflow monitoring web interface, see the complete list of Dataflow metrics in Google Cloud Platform metrics.