Troubleshoot Dataflow autoscaling

Stay organized with collections Save and categorize content based on your preferences.

This page shows you how to resolve issues with the Dataflow autoscaling features and provides information about how to manage autoscaling.

Job doesn't scale up or down

This section provides information about scenarios that might prevent workers from scaling up or down.

Streaming job doesn't scale up

When your streaming pipeline has a backlog, the workers don't scale up.

This issue occurs when the backlog lasts less than a few minutes or when the workers are using less than 20% of their CPUs.

Sometimes, backlog is elevated but CPU utilization is low. Because some tasks do not require high CPU utilization, adding workers doesn't improve performance. In those cases, Dataflow doesn't scale up. For more information, see Streaming autoscaling. This scenario might occur for the following reasons:

  • The pipeline is I/O intense.
  • The pipeline is waiting for external RPC calls.
  • Hot keys cause uneven worker CPU utilization.
  • The pipeline does not have enough keys.

Batch and streaming jobs don't scale up

Your batch or streaming job runs as expected, but when more workers are needed, the job doesn't scale up.

This issue might occur for one of the following reasons:

  • The staging or temp files are inaccessible. If your job uses a Cloud Storage bucket, the bucket might have a lifecycle configuration that deletes objects in the bucket, including staging and temp folders and files. To verify whether files have been deleted, check the lifecycle configuration for the bucket. If the staging or temp folders or files were deleted after the job started, the packages required to create new workers might not exist. To resolve this issue, recreate the folders and files in the bucket.
  • Firewall rules prevent workers from sending and receiving traffic on the necessary TCP ports. Firewall rules might prevent workers from starting. Dataflow workers need to be able to send and receive traffic on TCP ports 12345 and 12346. For more information, including steps to resolve this issue, see Firewall rules for Dataflow.
  • A custom source has a getProgress() method that returns a NULL value. When you use a custom source, the backlog metrics rely on the return value of your custom source's getProgress() method to start collecting data. The default implementation for getProgress() returns a NULL value. To resolve this issue, ensure that your custom source overrides the default getProgress() method to return a non-NULL value.

Streaming job doesn't scale down

When your streaming job has a low backlog and low CPU utilization, the workers don't scale down. This issue can occur for a variety of reasons.

  • When jobs don't use Streaming Engine, Dataflow balances the number of persistent disks between the workers, meaning each worker must have an equal number of persistent disks. For example, if there are 100 disks and 100 workers, each worker has one disk. When the job scales down, the job can have 50 workers with two persistent disks per worker. The job doesn't scale down again until it can have 25 workers with four persistent disks per worker. In addition, the minimum number of workers is the value assigned to maxNumWorkers divided by 15. For more information, see Scaling range for streaming autoscaling pipelines.

  • When jobs use Streaming Engine, the downscaling target is based on a target CPU utilization of 75%. When this CPU utilization can't be achieved, downscaling is disabled.

  • The backlog time estimate needs to stay below ten seconds for at least two minutes before workers scale down. Fluctuations in backlog time might disable scaling down. In addition, low throughput can skew the time estimate.

Scaling up stops

Your batch or streaming job starts scaling up, but the workers stop scaling up even though a backlog remains.

This issue occurs when quota limits are reached.

  • Compute Engine quotas: Dataflow jobs are subject to the project's Compute Engine quota. If multiple jobs are running, the project might be at the limit of its Compute Engine quota, in which case, Dataflow can't increase the number of workers.
  • CPU quotas: Dataflow jobs are also subject to the project's CPU quota. If the worker type uses more than 1 CPU, the project might be at the limit of the CPU quota.
  • External IP address quotas: When your job uses external IP addresses to communicate with resources, you need as many external IPs addresses as workers. When the number of workers scales up, the number of external IP addresses also increases. When you reach the IP address limit, the workers stop scaling up.

In addition, if the region you choose is out of a resource, you can't create new resources of that type, even if you have remaining quota in your region or project. For example, you might still have quota to create external IP addresses in us-central1, but that region might not have available IP addresses. For more information, see Quotas and resource availability.

To resolve this issue, request a quota increase or run the job in a different region.

CPU is unevenly distributed

When the job is autoscaling, CPU utilization is unevenly distributed among workers. Some workers have higher CPU utilization, system latency, or data freshness than others.

This issue can occur if your data contains a hot key. A hot key is a key with enough elements to negatively impact pipeline performance. Each key must be processed by a single worker, so the work can't be split between workers.

For more information, see A hot key ... was detected.

The work item requesting state read is no longer valid on the backend

During communication between worker VM instances and Streaming Engine tasks in a streaming pipeline, the following error occurs:

The work item requesting state read is no longer valid on the backend.
The work has already completed or will be retried.
This is expected during autoscaling events.

During autoscaling, worker VM instances communicate with multiple Streaming Engine tasks, and each task serves multiple worker VM instances. Item keys are used to distribute the work. Each task and worker VM instance have a collection of key ranges, and the distribution of these ranges can change dynamically. For example, during autoscaling, job resizing can cause the key range distribution to change. When a key range changes, this error can occur. The error is expected, and unless you see a correlation between these messages and an underperforming pipeline, you can ignore it.

Scaling range for streaming autoscaling pipelines

This section provides details about the scaling range for streaming autoscaling pipelines.

Java

For streaming autoscaling jobs that do not use Streaming Engine, the Dataflow service allocates between 1 to 15 Persistent Disks to each worker. This means that the minimum number of workers used for a streaming autoscaling pipeline is N/15, where N is the value of --maxNumWorkers.

For streaming autoscaling jobs that use Streaming Engine, the minimum number of workers is 1.

Dataflow balances the number of Persistent Disks between the workers. For example, if your pipeline needs 3 or 4 workers in steady state, you could set --maxNumWorkers=15. The pipeline automatically scales between 1 and 15 workers, using 1, 2, 3, 4, 5, 8, or 15 workers, which corresponds to 15, 8, 5, 4, 3, 2, or 1 Persistent Disks per worker, respectively.

--maxNumWorkers can be 1000 at most.

Python

For streaming autoscaling jobs that do not use Streaming Engine, the Dataflow service allocates between 1 to 15 Persistent Disks to each worker. This means that the minimum number of workers used for a streaming autoscaling pipeline is N/15, where N is the value of --max_num_workers.

For streaming autoscaling jobs that use Streaming Engine, the minimum number of workers is 1.

Dataflow balances the number of Persistent Disks between the workers. For example, if your pipeline needs 3 or 4 workers in steady state, you could set --max_num_workers=15. The pipeline automatically scales between 1 and 15 workers, using 1, 2, 3, 4, 5, 8, or 15 workers, which corresponds to 15, 8, 5, 4, 3, 2, or 1 Persistent Disks per worker, respectively.

--max_num_workers can be 1000 at most.

Go

For streaming autoscaling jobs that do not use Streaming Engine, the Dataflow service allocates between 1 to 15 Persistent Disks to each worker. This means that the minimum number of workers used for a streaming autoscaling pipeline is N/15, where N is the value of --max_num_workers.

For streaming autoscaling jobs that use Streaming Engine, the minimum number of workers is 1.

Dataflow balances the number of Persistent Disks between the workers. For example, if your pipeline needs 3 or 4 workers in steady state, you could set --max_num_workers=15. The pipeline automatically scales between 1 and 15 workers, using 1, 2, 3, 4, 5, 8, or 15 workers, which corresponds to 15, 8, 5, 4, 3, 2, or 1 Persistent Disks per worker, respectively.

--max_num_workers can be 1000 at most.

Maximum number of workers streaming autoscaling might use

Java

Dataflow operates within the limits of your project's Compute Engine instance count quota or maxNumWorkers, whichever is lower.

Python

Dataflow operates within the limits of your project's Compute Engine instance count quota or max_num_workers, whichever is lower.

Go

Dataflow operates within the limits of your project's Compute Engine instance count quota or max_num_workers, whichever is lower.

Limit autoscaling to reduce the impact on billing

If you don't want autoscaling to increase your bill, you can limit the maximum number of workers that your streaming job can use.

Java

By specifying --maxNumWorkers, you limit the scaling range used to process your job.

Python

By specifying --max_num_workers, you limit the scaling range used to process your job.

Go

By specifying --max_num_workers, you limit the scaling range used to process your job.

Change scaling range

To change the scaling range on a streaming pipeline, follow these steps.

Java

You can't change the scaling range with Update. You must stop your pipeline by using Cancel or Drain and redeploy your pipeline with the new desired maxNumWorkers.

Python

You can't change the scaling range with Update. You must stop your pipeline by using Cancel or Drain and redeploy your pipeline with the new desired max_num_workers.

Go

You can't change the scaling range with Update. You must stop your pipeline by using Cancel or Drain and redeploy your pipeline with the new desired max_num_workers.

Turn off autoscaling on streaming pipelines

To turn off autoscaling on streaming pipeline, follow these steps.

Java

Set --autoscalingAlgorithm=NONE. Update the pipeline with fixed cluster specifications, as described in the manual scaling documentation, where numWorkers is within the scaling range.

Python

Set --autoscaling_algorithm=NONE. Update the pipeline with fixed cluster specifications, as described in the manual scaling documentation, where num_workers is within the scaling range.

Go

Set --autoscaling_algorithm=NONE. Update the pipeline with fixed cluster specifications, as described in the manual scaling documentation, where num_workers is within the scaling range.

Use a fixed number of workers

For streaming jobs that don't use Streaming Engine, to enable streaming autoscaling, you need to opt in. It's not on by default. The behavior of streaming jobs that don't use Streaming Engine isn't changing, so to keep using a fixed number of workers, you don't need to do anything.