Common error guidance

This page describes some common errors you might encounter when running your Dataflow job, and suggests some courses of action for dealing with those errors.

Resource stock-out errors:

Job messages:

Job submission errors:

Worker logs:

Input and output errors:

Worker startup errors:



Resource stock-out errors

Resolving errors for an exhausted resource pool

Sometimes, when you create a Google Cloud resource, you might see an error for an exhausted resource pool. This error indicates a temporary stock-out condition for a specific resource in a specific zone. The following is the message format:


To resolve the issue, you can either wait for a period of time or create the same resource in another zone. As a best practice, we recommend that you distribute your resources across multiple zones and regions to tolerate outages.

Job messages

The job failed because a work item failed 4 times.

If a single operation causes the worker code to fail four times, by the code throwing an exception or crashing, Dataflow fails the job, and the message a work item failed 4 times is displayed. You can't configure this failure threshold. For more details, refer to pipeline error and exception handling .

Look in the job's Cloud Monitoring logs for the four individual failures. Look for Error-level or Fatal-level log entries in the worker logs that show exceptions or errors. You should see at least four of these exceptions or errors.

Encoding errors, IOExceptions, or unexpected behavior in user code.

The Apache Beam SDKs and the Dataflow workers depend on common third-party components. These components import additional dependencies. Version collisions can result in unexpected behavior in the service. If you are using any of these packages in your code, be aware that some libraries are not forward-compatible. You may need to pin to the listed versions that will be in scope during execution. SDK and Worker Dependencies contains a list of dependencies and their required versions.

Jobs that used to run, now fail with Staged inaccessible

  • Verify that the Cloud Storage bucket used for staging does not have TTL settings that cause staged packages to be deleted.
  • Verify that your Dataflow project's controller service account has the permission to access the Cloud Storage bucket used for staging. Gaps in permission can be due to any of the following reasons:

    • The Cloud Storage bucket used for staging is present in a different project.
    • The Cloud Storage bucket used for staging was migrated from fine-grained access to uniform bucket-level access. Due to the inconsistency between IAM and ACL policies, migrating the staging bucket to uniform bucket-level access disallows ACLs for Cloud Storage resources, which includes the permissions held by your Dataflow project's controller service account over the staging bucket.

For more information, see Accessing Cloud Storage buckets across Google Cloud projects.

Flex Template polling timeout

Your Flex Template job might return the following error message:

Timeout in polling result file: ${file_path}.
Service account: ${service_account_email}
Image URL: ${image_url}
Troubleshooting guide at

This could be due to the following reasons:

  1. The base Docker image was overridden.
  2. The service account that fills in ${service_account_email} does not have some necessary permissions.
  3. The program that creates the graph takes too long to finish.
  4. (Python only) There is a problem with the requirements.txt file.
  5. There was a transient error.

Beyond checking job logs and retrying in case of transient failures, here are some additional troubleshooting steps.

Verify Docker entrypoint

This step is for those running a template from a custom Docker image rather than using one of the provided templates.

Check for the container entrypoint using the following command:

docker inspect $TEMPLATE_IMAGE

You should see the following:





If you get a different output, then your Docker container's entrypoint is overridden. Restore $TEMPLATE_IMAGE to the default.

Check service account permissions

Check that the service account mentioned in the message has the following permissions:

  • It must be able read and write the Cloud Storage path that fills in ${file_path} in the message.
  • It must be able to read the Docker image that fills in ${image_url} in the message.

Check if the launcher program fails to exit

The program that constructs the pipeline must finish before the pipeline can be launched. The polling error could indicate that it took too long to do so.

Some things you can do to locate the cause in code are:

  • Check job logs and see if any operation appears to take a long time to complete. An example would be a request for an external resource.
  • Make sure no threads are blocking the program from exiting. Some clients may create their own threads, and if these clients are not shut down, then the program will wait forever for these threads to be joined.

Note that pipelines launched directly (not using a template) do not have these limitations, so if the pipeline worked directly but failed as a template, then this could be the root cause.

(Python Only) Remove Apache Beam from the Requirements File

If your Dockerfile includes a requirements.txt with apache-beam[gcp], then you should remove it from the file and install it separately. Example:

RUN pip install apache-beam[gcp]
RUN pip install -U -r ./requirements.txt

Putting Beam in the requirements file is known to lead to long launch times, and it often causes a timeout.

Flex Template job in Python queues and times out

If you are running a Dataflow job by using Flex Template and Python, your job might queue for a period of time and then fail to run. A Timeout in polling error message is displayed.

The error is due to the requirements.txt file that's used to install the required dependencies. When you launch a Dataflow job, all the dependencies are staged first to make these files accessible to the worker VMs. The process involves downloading and recompiling every direct and indirect dependency in the requirements.txt file. Some dependencies might take several minutes to compile—notably PyArrow, which is an indirect dependency that's used by Apache Beam and most Google Cloud client libraries.

As a workaround, perform the following steps:

  1. Download the precompiled dependencies in the Dockerfile into the Dataflow staging directory.

  2. Set the environment variable PIP_NO_DEPS to True.

    The setting prevents pip from re-downloading and re-compiling all the dependencies, which helps prevent the timeout error.

The following is a code sample that shows how the dependencies are pre-downloaded.

# Copyright 2020 Google LLC
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.



COPY . /template

# We could get rid of installing libffi-dev and git, or we could leave them.
RUN apt-get update \
    && apt-get install -y libffi-dev git \
    && rm -rf /var/lib/apt/lists/* \
    # Upgrade pip and install the requirements.
    && pip install --no-cache-dir --upgrade pip \
    && pip install --no-cache-dir -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE \
    # Download the requirements to speed up launching the Dataflow job.
    && pip download --no-cache-dir --dest /tmp/dataflow-requirements-cache -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE

# Since we already downloaded all the dependencies, there's no need to rebuild everything.

Job submission errors

413 Request Entity Too Large / The size of serialized JSON representation of the pipeline exceeds the allowable limit

If you encounter an error about the JSON payload when submitting your job, it means your pipeline's JSON representation exceeds the maximum 20 MB request size. These errors might appear as one of the following messages in your console or terminal window:

  • 413 Request Entity Too Large
  • "The size of serialized JSON representation of the pipeline exceeds the allowable limit"
  • "Failed to create a workflow job: Invalid JSON payload received"
  • "Failed to create a workflow job: Request payload exceeds the allowable limit"

The size of your job is specifically tied to the JSON representation of the pipeline. A larger pipeline means a larger request. Dataflow currently has a limitation that caps requests at 20 MB.

To estimate the size of your pipeline's JSON request, run your pipeline with the following option:


--dataflowJobFile=< path to output file >


--dataflow_job_file=< path to output file >

This command writes a JSON representation of your job to a file. The size of the serialized file is a good estimate of the size of the request. The actual size will be slightly larger due to some additional information included the request.

Certain conditions in your pipeline can cause the JSON representation to exceed the limit. Common conditions include:

  • A Create transform that includes a large amount of in-memory data.
  • A large DoFn instance that is serialized for transmission to remote workers.
  • A DoFn as an anonymous inner class instance that (possibly inadvertently) pulls in a large amount of data to be serialized.

To avoid these conditions, consider restructuring your pipeline.

The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.

Your job's graph size must not exceed 10 MB. Certain conditions in your pipeline can cause the job graph to exceed the limit. Common conditions include:

  • A Create transform that includes a large amount of in-memory data.
  • A large DoFn instance that is serialized for transmission to remote workers.
  • A DoFn as an anonymous inner class instance that (possibly inadvertently) pulls in a large amount of data to be serialized.

To avoid these conditions, consider restructuring your pipeline.

Total number of BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit or Total size of the BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit.


You might encounter this error if you're reading from a very large number of files via TextIO, AvroIO, or some other file-based source. The particular limit depends on the details of your source (e.g. embedding schema in AvroIO.Read will allow fewer files), but it is on the order of tens of thousands of files in one pipeline.

You might also encounter this error if you've created a custom data source for your pipeline and your source's splitIntoBundles method returned a list of BoundedSource objects which takes up more than 20 MB when serialized.

The allowable limit for the total size of the BoundedSource objects generated by your custom source's splitIntoBundles() operation is 20 MB. You can work around this limitation by modifying your custom BoundedSource subclass so that the total size of the generated BoundedSource objects is smaller than the 20 MB limit. For example, your source might generate fewer splits initially, and rely on Dynamic Work Rebalancing to further split inputs on demand.

"SDK pipeline options or staging file list exceeds size limit. Please keep their length under 256K Bytes each and 512K Bytes in total."

The pipeline could not be started due to Google Compute Engine metadata limits being exceeded. These limits cannot be changed. Dataflow uses Compute Engine metadata for pipeline options. The limit is documented in the Compute Engine custom metadata limitations.

Too many jar files to stage can cause the JSON representation to exceed the limit.

How to find the full pipeline option and its length? To estimate the size of your pipeline's JSON request, run your pipeline with the following option:


--dataflowJobFile=< path to output file >


--dataflow_job_file=< path to output file >

The size of the output file from above has to be less than 256K Bytes. 512K Bytes in the error message is referring to the total size of the above output file and the custom metadata options for the compute engine VM instance. A rough estimate of the custom metadata option for VM instance can be determined from running dataflow jobs in the project. Choose any running dataflow job, take a VM instance and then navigate to the Compute Engine VM instance details page for that VM to check for the custom metadata section. The total length of the custom metadata and the above file should be less than 512K Bytes. An accurate estimate for the failed job is not possible since the VMs for this job were not spun up.

If your jar list is hitting the 256K Bytes limit, review it and reduce any unnecessary jars. If it is still too large, try executing the Dataflow job using an uber jar.

The cloud function section of the documentation has an example of how to create and use uber jar.

Startup of the worker pool in zone $ZONE failed to bring up any of the desired $N workers. The project quota may have been exceeded or access control policies may be preventing the operation; review the Cloud Logging 'VM Instance' log for diagnostics.

This error has two possible causes:

  • You might have exceeded one of the Compute Engine quotas that Dataflow worker creation relies on.
  • Your organization has constraints in place that prohibit some aspect of the VM instance creation process, like the account being used, or the zone being targeted.

To troubleshoot this issue:

Review the VM Instance log

  1. Go to the Cloud Logging viewer
  2. In the Audited Resource drop-down list, select VM Instance.
  3. In the All logs drop-down list, select
  4. Scan the log for any entries related to VM instance creation failure.

Check your usage of Compute Engine quotas

  1. On your local machine command line or in Cloud Shell, run the following command to view Compute Engine resource usage compared to Dataflow quotas for the zone you are targeting:

    gcloud compute regions describe [REGION]

  2. Review the results for the following resources to see if any are exceeding quota:

    • CPUS
  3. If needed, request a quota change.

Review your organization policy constraints

  1. Go to the Organization policies page
  2. Review the constraints for any that might limit VM instance creation for either the account you are using (this is the Dataflow service account by default) or in the zone you are targeting.

Worker logs

"Processing stuck/Operation ongoing in step <step_id> for at least <time_interval> without outputting or completing in state finish at <stack_trace>"

If Dataflow spends more time executing a DoFn than the time specified in <time_interval> without returning, this message is displayed.

This error has two possible causes:

  • Your DoFn code is simply slow, or waiting for some slow external operation to complete.
  • Or, your DoFn code might be stuck, deadlocked, or abnormally slow to finish processing.

To determine which of these is the case, expand the Cloud Monitoring log entry to see a stack trace. Look for messages that indicate that the DoFn code is stuck or otherwise encountering issues. If none are present, the issue might be the execution speed of the DoFn code. Consider using Cloud Profiler or other tool to investigate the code's performance.

You can further investigate the cause of your stuck code if your pipeline is built on the Java VM (using either Java or Scala); take a full thread dump of the whole JVM (not just the stuck thread) by following these steps:

  1. Make note of the worker name from the log entry.
  2. In the Compute Engine section of the Cloud Console, find the Compute Engine instance with the worker name you noted.
  3. SSH into the instance with that name.
  4. Run the following command:

    curl http://localhost:8081/threadz

Shuffle key too large exceptions

If you encounter exceptions indicating the shuffle key is too large, it means that the serialized key emitted to a particular (Co-)GroupByKey, after the corresponding coder is applied, is too large. Dataflow has a limit for serialized shuffle keys. Consider reducing the size of the keys or using more space-efficient coders.


In streaming scenarios, this might be caused by grouping a very large amount of data without using a Combine transform, or by producing a large amount of data from a single input element. The following strategies can reduce the chance of raising this exception:

  • Ensure that processing a single element cannot result in outputs or state modifications exceeding the limit.
  • If multiple elements were grouped by a key, consider increasing the key space to reduce the elements grouped per key.
  • If elements for a key are emitted at a high frequency over a short period of time, that may result in many GB of events for that key in windows. Rewrite the pipeline to detect keys like this and only emit an output indicating the key was frequently present in that window.
  • Make sure to use sub-linear space Combine transforms for commutative and associate operations. Don't use a combiner if it doesn't reduce space. For example, combiner for strings that just appends strings together is worse than not using combiner.

RPC timeout exceptions, DEADLINE_EXCEEDED exceptions, or Server Unresponsive errors

If you encounter RPC timeouts, DEADLINE_EXCEEDED exceptions, or Server Unresponsive errors while your job runs, these typically indicate one of two problems:

  • The VPC network used for your job might be missing a firewall rule. The firewall rule needs to enable all TCP traffic among VMs in the VPC network you specified in your pipeline options. See Specifying your network and subnetwork for more details.

  • Your job is shuffle-bound. Consider one of, or a combination of, the following courses of action:


    • If the job is not using the service-based shuffle, switch to using the service-based Dataflow Shuffle by setting --experiments=shuffle_mode=service. For details and availability, read Dataflow Shuffle.
    • Add more workers. Try setting --numWorkers with a higher value when you run your pipeline.
    • Increase the size of the attached disk for workers. Try setting --diskSizeGb with a higher value when you run your pipeline.
    • Use an SSD-backed persistent disk. Try setting --workerDiskType="<project>/zones/<zone>/diskTypes/pd-ssd" when you run your pipeline.


    • If the job is not using the service-based shuffle, switch to using the service-based Dataflow Shuffle by setting --experiments=shuffle_mode=service. For details and availability, read Dataflow Shuffle.
    • Add more workers. Try setting --num_workers with a higher value when you run your pipeline.
    • Increase the size of the attached disk for workers. Try setting --disk_size_gb with a higher value when you run your pipeline.
    • Use an SSD-backed persistent disk. Try setting --worker_disk_type="<project>/zones/<zone>/diskTypes/pd-ssd" when you run your pipeline.

A call from the Java worker harness to a Python DoFn failed with the error <error message>

If a DoFn implemented in Python fails and throws an exception, a relevant error message is displayed.

Expand the Cloud Monitoring error log entry and look at the error message and traceback. It shows you which code failed so you can correct it if necessary. If you believe that this is a bug in Apache Beam or Dataflow, please report the bug.

No space left on device

When a job runs out of disk space you may see No space left on device errors in worker logs.

If a job downloads large dependencies at runtime, uses large custom containers or writes a lot of temporary data to local disk, the worker persistent storage might run out of free space.

When using Dataflow Shuffle, Dataflow sets lower default disk size. As a result, jobs moving from worker-based shuffle may also see this error.

It is also possible for the worker boot disk to fill up if it is logging more than 50 entries per second.

To see disk resources associated with a single worker, look up VM instance details for worker VMs associated with your job. Part of the disk space will be consumed by the operating system, binaries, logs, and containers.

To increase persistent disk or boot disk space, adjust the disk size pipeline option.

You can track disk space usage on the worker VM instances by using Cloud Monitoring. See Receive worker VM metrics from the Monitoring agent for instructions on setting this up.

You can also look for boot disk space issues by Viewing serial port output on the worker VM instances and looking for messages like "Failed to open system journal: No space left on device". If you have a large number of worker VM instances, you might want to create a script to run gcloud compute instances get-serial-port-output on all of them at once and review the output from that instead.

For more information, see Persistent disk resources.

EOFError: marshal data too short

This error sometimes happens when Python pipeline workers run out of disk space. See No space left on device.

Bad request warnings in shuffler logs

During the execution of a Dataflow job, Cloud Monitoring logs might display a series of warnings similar to:

Unable to update setup work item <step_id> error: generic::invalid_argument: Http(400) Bad Request
Update range task returned 'invalid argument'. Assuming lost lease for work with id <lease_id>
with expiration time: <timestamp>, now: <timestamp>. Full status: generic::invalid_argument: Http(400) Bad Request

"Bad request" warnings appear because worker state information is stale or out of sync due to processing delays. Often, your Dataflow job will succeed despite these "bad request" warnings. If that is the case, ignore the warnings.

A hot key <hot-key-name> was detected in...

These errors identify a hot key in your data. A hot key is a key with enough elements to negatively impact pipeline performance. These keys limit Dataflow's ability to process elements in parallel, which increases execution time.

To log the presence of a hot key, use the hot key pipeline option.

Check that your data is evenly distributed. If a key has disproportionately many values, consider the following courses of action:

Python custom container pipeline fails with SystemError: unknown opcode

If you encounter an unexpected Python error SystemError: unknown opcode immediately after job submission and the stack trace includes apache_beam/internal/, verify that the Python version that you are using locally matches the version in the container image up to the major and minor version. The difference in the patch version, such as 3.6.7 versus 3.6.8 does not create compatibility issues, but the difference in minor version such as 3.6.8 versus 3.8.2 can cause pipeline failures.

Input and output errors

The I/O metric charts use canonical error codes. If these error codes persist in your sources and sinks, refer to the following list for potential causes and actions you can take.

  • RESOURCE_EXHAUSTED. The project might have run out of resource quota for the service the source or sink is using.

    If the error occurs occasionally or when the Requests per sec chart indicates a high volume of requests being made, then this might indicate that you have reached an API rate limiting quota and need to increase the quota.

  • DEADLINE_EXCEEDED. Source or sink might have timed out reading or writing a large batch of data. Check the latency chart and worker logs. If the error persists, contact support.

  • INVALID_ARGUMENT. Parameters specified to the source or sink might be malformed (such as a Pub/Sub topic). Check configuration of the source or sink, and check the worker logs.

  • FAILED_PRECONDITION. Check configuration of the source or sink, and check the worker logs. This could also indicate a bug.

  • OUT_OF_RANGE. Check that the resource being used by the source or sink exists (such as a Pub/Sub topic or subscription).

  • UNAUTHENTICATED. Check that the Dataflow service account has Identity and Access Management permissions to the specific service and relevant APIs are enabled for the project.

  • PERMISSION_DENIED. Check that the Dataflow service account has Identity and Access Management permissions to the specific service and relevant APIs are enabled for the project.

  • NOT_FOUND. Check that the entities being used by the source or sink exist (such as a Pub/Sub topic or subscription).

  • ABORTED. Service might not be properly handling the source or sinks attempts to read or write data. If the error persists, contact support.

  • ALREADY_EXISTS. I/O might be trying to create an entity which already exists (such as a Pub/Sub topic or subscription). If the error persists, contact support.

  • CANCELLED. This can occur when a Dataflow worker is shut down or source or sink logic intentionally decides to cancel attempts to read or write data.

  • DATALOSS. Indicates unrecoverable data loss or corruption occurred. You might want to create a new dataset for your sources and rerun the Dataflow job.

    You might also see if there are any backup and restoring instructions available for the underlying Google Cloud service.

  • UNKNOWN. Service might be down. Check for updates on Cloud Status Dashboard for more information.

  • INTERNAL. Service might be down. Check for updates on Cloud Status Dashboard for more information.

  • UNAVAILABLE. Service might be down. Check for updates on Cloud Status Dashboard for more information.

  • UNIMPLEMENTED. The source or sink attempted to use the service in an invalid way. Your pipeline might be misconfigured. If the error persists, contact support.

Worker startup errors

Errors in the log types,, and indicate configuration problems with a job. They can also indicate conditions that prevent the normal logging path from functioning.

NoClassDefFound errors

These errors can also be of the form Unable to initialize main class [...]. NoClassDefFound errors indicate that the JAR files being uploaded to the Dataflow service do not contain the required classes to run Dataflow. This most frequently occurs when the --filesToStage PipelineOption is overridden and some required JAR or class files are omitted.

Image pull request failed with error

This error indicates that a worker was unable to start up because the worker could not pull a Docker container image. This can happen if the custom SDK container image URL is incorrect, or the worker lacks credential or network access to the remote image.

If you are using a custom container image with your job, verify that your image URL is correct, has a valid tag or digest, and that the image is accessible by the Dataflow workers.

Verify that public images can be pulled locally by running docker pull $image from an unauthenticated machine.

There are additional consideration for private images or private workers.

  • Dataflow only supports private container images hosted on Container Registry. If you are using Container Registry to host your container image, the default Google Cloud service account has access to images in the same project. If you are using images in a different project than the one used to run your Google Cloud job, make sure to configure access control for the default Google Cloud service account.
  • If using shared Virtual Private Cloud (VPC), make sure that workers can access the custom container repository host.
  • Use ssh to connect with a running job worker VM and run docker pull $image to directly confirm the worker is configured properly.

If workers fail several times in a row due to this error and and no work has started on a job, the job can fail with an error like Job appears to be stuck. .... If you have removed access to the image while the job is running, either by removing the image itself or revoking the Dataflow worker service account credentials or internet access to access images, Dataflow will only log errors and not take actions to fail the job. Dataflow also avoids failing long-running streaming pipelines to avoid losing pipeline state.

Other possible errors can arise from repository quota issues or outages. If you experience issues exceeding the Docker Hub quota for pulling public images or general third-party repository outages, consider using Container Registry as the image repository.

Error syncing pod ... failed to "StartContainer"

This error indicates that a worker Docker container failed to start. This typically happens when one of the containers is continuously crashing on startup.

If the container is crashing, look for an error message describing the failure in worker-startup, harness-startup or docker logs. A common startup crash for Python jobs are pip installation errors (pip install failed with error...) due to dependency issues. These can include incompatible specified dependencies or network issues that cause specified repositories to be unreachable. The Apache Beam Python SDK installed in the container should be the same as the SDK used to launch the job.

This error can also occur if the container image cannot be pulled during a long-running job. For more details, read "image pull request failed with error".

"A fatal error has been detected by the Java Runtime Environment"

This indicates that the pipeline is using Java Native Interface (JNI) to run non-Java code and that there is an error in that code or the JNI bindings.

Custom container pipeline fails or is slow to start up workers

A known issue may cause Dataflow to mark workers as unresponsive and restart the worker if they are using a custom container that is very large (~10 GB) and takes a long time to download. This can cause pipeline startup to be slow or in some extreme cases prevent workers from starting up at all.

This can be confirmed by looking for worker logs in that show several attempts to pull a container that do not succeed and kubelet does not start on the worker, e.g. logs like Pulling image <URL> without a corresponding Successfully pulled image <URL> or Started Start kubelet logs after all images have been pulled.

A workaround for this issue is to run the pipeline with the --experiments=disable_worker_container_image_prepull pipeline option.


I don't see any logs

If you don't see any logs for your jobs, remove any exclusion filters containing resource.type="dataflow_step" from all of your Cloud Logging Log Router sinks.

Go to Log Router

For more details on removing your logs exclusions, refer to the Removing exclusions guide.


High Streaming Engine transfer quota usage

Jobs using Streaming Engine have a regional limit to how much data they can transfer to/from Streaming Engine. It is important to stay below the limit since reaching it may affect job performance. Also, reducing the amount of bytes transferred may reduce billed cost.

To view your project's limit, see the instructions in Quotas & limits. To view your jobs' current usage, follow this link to Metrics Explorer. If necessary, select the project where your job is running. Then replace REGION with the region you're interested in, and click "Run Query".

Suggestions to reduce usage:

  • Reduce the amount of data processed by GroupByKey and Combine transforms. This could be done by reducing the number of elements passed or omitting unnecessary fields.
  • Encode data more efficiently to take up less space. In some cases this may increase CPU usage on the worker.

    • For custom data types, the default coder may be inefficient (SerializableCoder in Java, PickleCoder in Python). Try selecting a more efficient default coder for the custom type, or a PCollection with a schema.

      For a more detailed explanation on coders and how to specify them, see the Beam Programming Guide chapter on data encoding. To learn more about PCollections with schemas, see the chapter on schemas.

    • For large elements, applying compression could help.

  • Reduce large state reads. For example, a pipeline that has a large enough BagState will need to transfer it from Streaming Engine whenever it is fetched.

    • Try reducing state size.
    • Try reducing the frequency of state fetches.
    • State is tied to a specific key and window. Using non-global windowing could be used to limit state size.

    To learn more about state, see the Beam Programming Guide chapter on State and Timers.

Autoscaling: maxNumWorkers could be increased

Dataflow detected that a job is using the maximum number of allowed workers, maxNumWorkers (or max_num_workers), and that it might utilize more workers if this maximum was raised. For example, this recommendation would be given for a job that has maxNumWorkers set to 50, and is using 50 workers with an average worker CPU utilization above 80%.

Typically, increasing maxNumWorkers increases pipeline throughput: a batch pipeline could complete in less time, and a streaming pipeline could handle larger spikes in data and process more elements per second. However, this may come at an increased cost (see Worker resources pricing). See the Autoscaling guide for details on how the Autoscaling algorithm works and how to configure it.


  • Try increasing the maxNumWorkers pipeline option, or removing it. Without the option, Dataflow will use the defaults listed in the Autoscaling guide.
  • It's okay to do nothing if pipeline performance is adequate.
    • For batch pipelines, check that the total running time meets your requirements.
    • For streaming pipelines, check the Data freshness graph in the Job Metrics tab on the job page. Make sure it isn't continuously increasing, and that the values are within acceptable bounds.

High fan-out detected

Dataflow detected that a job has one or more transforms with a "high fan-out". This occurs when a ParDo that has a high output-to-input element count ratio is fused with a subsequent ParDo. In this situation, the second ParDo runs sequentially with the first. This forces all the output elements of a given input onto the same worker, reducing parallelism and slowing down performance.


  • Insert a GroupByKey and ungroup after your first ParDo. The Dataflow service never fuses ParDo operations across an aggregation.
  • Pass the intermediate PCollection as a side input to another ParDo. The Dataflow service always materializes side inputs.
  • Insert a Reshuffle step. Reshuffle prevents fusion, checkpoints the data, and reconfigures the windowing strategy so that no data is dropped. Reshuffle is supported by Dataflow even though it is marked deprecated in the Apache Beam documentation (note that reshuffling data can increase the cost of running your pipeline).