Troubleshoot slow or stuck jobs

This page explains how to troubleshoot common causes of slow or stuck Dataflow streaming and batch jobs.

Streaming

If you notice the following symptoms, your Dataflow streaming job might be running slowly or stuck:

The pipeline isn't reading data from the source. For example, Pub/Sub has a growing backlog.
The pipeline isn't writing data to the sink.
The data freshness metric is increasing.
The system latency metric is increasing.

Use the information in the following sections to identify and diagnose the problem.

Identify the root cause

Check the data freshness and backlog bytes metrics.
- If both metrics are monotonically increasing, it means the pipeline is stuck and not progressing.
- If data freshness is increasing, but backlog bytes remains normal, it means that one or more work items are stuck in the pipeline.
Look for the stages where these metrics are increasing, to identify any stage with problems and the operations performed in that stage.
Check the Parallel processing chart to see if any stage is stuck due to low parallelism. See Troubleshoot parallelism.
Check the job logs for issues such as quota limits, stockout issues, or IP address exhaustion.
Check the worker logs for warnings and errors.
- If worker logs contain errors, view the stack trace. Investigate whether the error is caused by a bug in your code.
- Look for Dataflow errors. See Troubleshoot Dataflow errors.
- Look for errors that show the job exceeded a limit, such as the maximum Pub/Sub message size.
- Look for out-of-memory errors, which can cause a stuck pipeline. If you see out-of-memory errors, follow the steps in Troubleshoot Dataflow out of memory errors.
- To identify a slow or stuck step, check the worker logs for Operation ongoing messages. View the stack trace to see where the step is spending time. For more information, see Processing stuck or operation ongoing.
If a work item is stuck on a specific worker, restart that worker VM.
If you aren't using Streaming Engine, check the shuffler logs for warnings and errors. If you see an RPC timeout error on port 12345 or 12346, your job might be missing a firewall rule. See Firewall rules for Dataflow.
Check for hot keys.
If Runner v2 is enabled, check the harness logs for errors. For more information, see Troubleshoot Runner v2.

Investigate repeated failures

In a streaming job, some failures are retried indefinitely. These retries prevent the pipeline from progressing. To identify repeated failures, check the worker logs for exceptions.

If the exception is with user code, debug and fix the issue in the code or in the data.
To prevent unexpected failures from stalling your pipeline, implement a dead-letter queue. For an example implementation, see BigQuery patterns in the Apache Beam documentation.
If the exception is an out of memory (OOM) error, see Troubleshoot Dataflow out of memory errors.
For other exceptions, see Troubleshoot Dataflow errors.

Identify unhealthy workers

If the workers processing the streaming job are unhealthy, the job might be slow or appear stuck. To identify unhealthy workers:

Check for memory pressure by using the memory utilization metrics and by looking for out of memory errors in the worker logs. For more information, see Troubleshoot Dataflow out of memory errors.
If you're using Streaming Engine, use the persistence metrics to identify bottlenecks with the disk input/output operations (IOPS).
Check the worker logs for other errors. For more information, see Work with pipeline logs and Troubleshoot Dataflow errors.

Identify stragglers

A straggler is a work item that is slow relative to other work items in the stage. For information about identifying and fixing stragglers, see Troubleshoot stragglers in streaming jobs.

Troubleshoot insufficient parallelism

For scalability and efficiency, Dataflow runs the stages of your pipeline in parallel across multiple workers. The smallest unit of parallel processing in Dataflow is a key. Incoming messages for each fused stage are associated with a key. The key is defined in one of the following ways:

The key is implicitly defined by the properties of the source, such as Pub/Sub partitions.
The key is explicitly defined by aggregation logic in the pipeline, such as GroupByKey.

If the pipeline doesn't have enough keys for a given stage, it limits parallel processing. That stage might become a bottleneck.

Identify stages with low parallelism

To identify if pipeline slowness is caused by low parallelism, view the CPU utilization metrics. If CPU is low but evenly distributed across workers, your job might have insufficient parallelism. If your job is using Streaming Engine, to see if a stage has low parallelism, in the Job Metrics tab, view the parallelism metrics. To mitigate this issue:

In the Google Cloud console, on the Job info page, use the Autoscaling tab to see if the job is having problems scaling up. If autoscaling is the problem, see Troubleshoot Dataflow autoscaling.
Use the job graph to check the steps in the stage. If the stage is reading from a source or writing to a sink, review the documentation for the service of the source or sink. Use the documentation to determine if that service is configured for sufficient scalability.
- To gather more information, use the input and output metrics provided by Dataflow.
- If you're using Kafka, check the number of Kafka partitions. For more information, see the Apache Kafka documentation.
- If you're using a BigQuery sink, enable automatic sharding to improve parallelism. For more information, see 3x Dataflow Throughput with Auto Sharding for BigQuery.

Check for hot keys

If tasks are unevenly distributed across workers and worker utilization is very uneven, your pipeline might have a hot key. A hot key is a key that has far more elements to process compared to other keys.

Check for hot keys by using the following log filter:

  resource.type="dataflow_step"
  resource.labels.job_id=JOB_ID
  jsonPayload.line:"hot_key_logger"

Replace JOB_ID with the ID of your job.

To resolve this issue, take one or more of the following steps:

Rekey your data. To output new key-value pairs, apply a ParDo transform. For more information, see the Java ParDo transform page or the Python ParDo transform page in the Apache Beam documentation.
Use .withFanout in your combine transforms. For more information, see the Combine.PerKey class in the Java SDK or the with_hot_key_fanout operation in the Python SDK.
If you have a Java pipeline that processes high-volume unbounded PCollections, we recommend that you do the following:
- Use Combine.Globally.withFanout instead of Combine.Globally.
- Use Combine.PerKey.withHotKeyFanout instead of Count.PerKey.

Check for insufficient quota

Make sure you have sufficient quota for your source and sink. For example, if your pipeline reads input from Pub/Sub or BigQuery, your Google Cloud project might have insufficient quota. For more information about quota limits for these services, see Pub/Sub quota or BigQuery quota.

If your job is generating a high number of 429 (Rate Limit Exceeded) errors, it might have insufficient quota. To check for errors, try the following steps:

Go to the Google Cloud console.
In the navigation pane, click APIs & services.
In the menu, click Library.
Use the search box to search for Pub/Sub.
Click Cloud Pub/Sub API.
Click Manage.
In the Traffic by response code chart, look for (4xx) client error codes.

You can also use Metrics Explorer to check quota usage. If your pipeline uses a BigQuery source or sink, to troubleshoot quota issues, use the BigQuery Storage API metrics. For example, to create a chart showing the BigQuery concurrent connection count, follow these steps:

In the Google Cloud console, select Monitoring:

Go to Monitoring
In the navigation pane, select Metrics explorer.
In the Select a metric pane, for Metric, filter to BigQuery Project > Write > concurrent connection count.

For instructions about viewing Pub/Sub metrics, see Monitor quota usage in "Monitor Pub/Sub in Cloud Monitoring." For instructions about viewing BigQuery metrics, see View quota usage and limits in "Create dashboards, charts, and alerts."

Batch

If your batch job is slow or stuck, use the Execution details tab to find more information about the job and to identify the stage or worker that's causing a bottleneck.

Identify the root cause

Check whether the job is running into issues during worker startup. For more information, see Error syncing pod.

To verify the job has started processing data, look in the job-message log for the following log entry:
```
All workers have finished the startup processes and began to receive work requests
```
To compare job performance between different jobs, make sure the volume of input data, worker configuration, autoscaling behavior, and Dataflow Shuffle settings are the same.
Check the job-message logs for issues such as quota limits, stockout issues, or IP address exhaustion.
In the Execution details tab, compare the stage progress to identify stages that took longer.
Look for any stragglers in the job. For more information, see Troubleshooting stragglers in batch jobs.
Check the throughput, CPU, and memory utilization metrics.
Check the worker logs for warnings and errors.
- If the worker logs contain errors, view the stack trace. Investigate whether the error is caused by a bug in your code.
- Look for Dataflow errors. See Troubleshoot Dataflow errors.
- Look for out-of-memory errors, which can cause a stuck pipeline. If you see out-of-memory errors, follow the steps in Troubleshoot Dataflow out of memory errors.
- To identify a slow or stuck step, check the worker logs for Operation ongoing messages. View the stack trace to see where the step is spending time. For more information, see Processing stuck or operation ongoing.
Check for hot keys.
If you aren't using Dataflow Shuffle, check the shuffler logs for warnings and errors during shuffle operation. If you see an RPC timeout error on port 12345 or 12346, your job might be missing a firewall rule. See Firewall rules for Dataflow.
If Runner v2 is enabled, check the harness logs for errors. For more information, see Troubleshoot Runner v2.

Identify stragglers

A straggler is a work item that is slow relative to other work items in the stage. For information about identifying and fixing stragglers, see Troubleshoot stragglers in batch jobs.

Identify slow or stuck stages

To identify slow or stuck stages, use the Stage progress view. Longer bars indicate that the stage takes more time. Use this view to identify the slowest stages in your pipeline.

After you find the bottleneck stage, you can take the following steps:

Identify the lagging worker within that stage.
If there are no lagging workers, identify the slowest step by using the Stage info panel. Use this information to identify candidates for user code optimization.
To find parallelism bottlenecks, use Dataflow monitoring metrics.

Identify a lagging worker

To identify a lagging worker for a specific stage, use the Worker progress view. This view shows whether all workers are processing work until the end of the stage, or if a single worker is stuck on a lagging task. If you find a lagging worker, take the following steps:

View the log files for that worker. For more information, see Monitor and view pipeline logs.
View the CPU utilization metrics and the worker progress details for lagging workers. If you see unusually high or low CPU utilization, in the log files for that worker, look for the following issues:
- A hot key ... was detected
- Processing stuck ... Operation ongoing

Tools for debugging

When you have a slow or stuck pipeline, the following tools can help you diagnose the problem.

To correlate incidents and identify bottlenecks, use Cloud Monitoring for Dataflow.
To monitor pipeline performance, use Cloud Profiler.
Some transforms are better suited to high-volume pipelines than others. Log messages can identify a stuck user transform in either batch or streaming pipelines.
To learn more about a stuck job, use Dataflow job metrics. The following list includes useful metrics:
- The Backlog bytes metric (backlog_bytes) measures the amount of unprocessed input in bytes by stage. Use this metric to find a fused step that has no throughput. Similarly, the backlog elements metric (backlog_elements) measures the number of unprocessed input elements for a stage.
- The Processing parallelism keys (processing_parallelism_keys) metric measures the number of parallel processing keys for a particular stage of the pipeline over the last five minutes. Use this metric to investigate in the following ways:
  - Narrow the issue down to specific stages and confirm hot key warnings, such as A hot key ... was detected.
  - Find throughput bottlenecks caused by insufficient parallelism. These bottlenecks can result in slow or stuck pipelines.
- The System lag metric (system_lag) and the per-stage system lag metric (per_stage_system_lag) measure the maximum amount of time an item of data has been processing or awaiting processing. Use these metrics to identify inefficient stages and bottlenecks from data sources.

For additional metrics that aren't included in the Dataflow monitoring web interface, see the complete list of Dataflow metrics in Google Cloud metrics.