Join the Apache Beam community on July 18th-20th for the Beam Summit 2022 to learn more about Beam and share your expertise.

Troubleshoot Dataflow errors

If you run into problems with your Dataflow pipeline or job, this page lists error messages that you might see and provides suggestions for how to fix each error.

Errors in the log types dataflow.googleapis.com/worker-startup, dataflow.googleapis.com/harness-startup, and dataflow.googleapis.com/kubelet indicate configuration problems with a job. They can also indicate conditions that prevent the normal logging path from functioning.

Your pipeline might throw exceptions while processing data. Some of these errors are transient, for example when temporary difficulty accessing an external service occurs. Some of these errors are permanent, such as errors caused by corrupt or unparseable input data, or null pointers during computation.

Dataflow processes elements in arbitrary bundles and retries the complete bundle when an error is thrown for any element in that bundle. When running in batch mode, bundles including a failing item are retried four times. The pipeline fails completely when a single bundle fails four times. When running in streaming mode, a bundle including a failing item is retried indefinitely, which might cause your pipeline to permanently stall.

Exceptions in user code, for example, your DoFn instances, are reported in the Dataflow Monitoring Interface. If you run your pipeline with BlockingDataflowPipelineRunner, you also see error messages printed in your console or terminal window.

Consider guarding against errors in your code by adding exception handlers. For example, if you want to drop elements that fail some custom input validation done in a ParDo, use a try/catch block within your ParDo to handle the exception and log and drop the element. For production workloads, implement an unprocessed message pattern. To keep track of the error count, you use aggregation transforms.

Missing log files

If you don't see any logs for your jobs, remove any exclusion filters containing resource.type="dataflow_step" from all of your Cloud Logging Log Router sinks.

Go to Log Router

For more details about removing your logs exclusions, refer to the Removing exclusions guide.

Pipeline errors

The following sections contain common pipeline errors that you might encounter and steps for resolving or troubleshooting the errors.

Some Cloud APIs need to be enabled

When you try to run a Dataflow job, the following error occurs:

Some Cloud APIs need to be enabled for your project in order for Cloud Dataflow to run this job.

This issue occurs because some required APIs are not enabled in your project.

To resolve this issue and run a Dataflow job, enable the following Google Cloud APIs in your project:

  • Compute Engine API (Compute Engine)
  • Cloud Logging API
  • Cloud Storage
  • Cloud Storage JSON API
  • BigQuery API
  • Pub/Sub
  • Datastore API

For detailed instructions, see the Getting Started section on enabling Google Cloud APIs .

Bad request

During the execution of a Dataflow job, Cloud Monitoring logs display a series of warnings similar to the following:

Unable to update setup work item STEP_ID error: generic::invalid_argument: Http(400) Bad Request
Update range task returned 'invalid argument'. Assuming lost lease for work with id LEASE_ID
with expiration time: TIMESTAMP, now: TIMESTAMP. Full status: generic::invalid_argument: Http(400) Bad Request

Bad request warnings occur if worker state information is stale or out of sync due to processing delays. Often, your Dataflow job succeeds despite the bad request warnings. If that is the case, ignore the warnings.

DEADLINE_EXCEEDED or Server Unresponsive

When you run your jobs, you might encounter encounter RPC timeout exceptions or one of the following errors:

DEADLINE_EXCEEDED

Or:

Server Unresponsive

These errors typically occur for one of the following reasons:

To resolve this issue, make one or more of the following changes.

Java

  • If the job is not using the service-based shuffle, switch to using the service-based Dataflow Shuffle by setting --experiments=shuffle_mode=service. For details and availability, read Dataflow Shuffle.
  • Add more workers. Try setting --numWorkers with a higher value when you run your pipeline.
  • Increase the size of the attached disk for workers. Try setting --diskSizeGb with a higher value when you run your pipeline.
  • Use an SSD-backed persistent disk. Try setting --workerDiskType="compute.googleapis.com/projects/<project>/zones/<zone>/diskTypes/pd-ssd" when you run your pipeline.

Python

  • If the job is not using the service-based shuffle, switch to using the service-based Dataflow Shuffle by setting --experiments=shuffle_mode=service. For details and availability, read Dataflow Shuffle.
  • Add more workers. Try setting --num_workers with a higher value when you run your pipeline.
  • Increase the size of the attached disk for workers. Try setting --disk_size_gb with a higher value when you run your pipeline.
  • Use an SSD-backed persistent disk. Try setting --worker_disk_type="compute.googleapis.com/projects/<project>/zones/<zone>/diskTypes/pd-ssd" when you run your pipeline.

Go

  • If the job is not using the service-based shuffle, switch to using the service-based Dataflow Shuffle by setting --experiments=shuffle_mode=service. For details and availability, read Dataflow Shuffle.
  • Add more workers. Try setting --num_workers with a higher value when you run your pipeline.
  • Increase the size of the attached disk for workers. Try setting --disk_size_gb with a higher value when you run your pipeline.
  • Use an SSD-backed persistent disk. Try setting --disk_type="compute.googleapis.com/projects/<project>/zones/<zone>/diskTypes/pd-ssd" when you run your pipeline.

Encoding errors, IOExceptions, or unexpected behavior in user code

The Apache Beam SDKs and the Dataflow workers depend on common third-party components. These components import additional dependencies. Version collisions can result in unexpected behavior in the service. If you are using any of these packages in your code, be aware that some libraries are not forward-compatible. You might need to pin to the listed versions that are in scope during execution. SDK and Worker Dependencies contains a list of dependencies and their required versions.

Error running LookupEffectiveGuestPolicies

When you run a Dataflow job, you might see the following error in the log files:

OSConfigAgent Error policies.go:49: Error running LookupEffectiveGuestPolicies: 
error calling LookupEffectiveGuestPolicies: code: "Unauthenticated", 
message: "Request is missing required authentication credential. 
Expected OAuth 2 access token, login cookie or other valid authentication credential.

This error occurs if OS configuration management is enabled for the entire project.

To resolve this issue, disable VM Manager policies that apply to the entire project. If disabling VM Manager policies for the entire project isn't possible, you can safely ignore this error and filter it out of log monitoring tools.

Exhausted resource pool

When you create a Google Cloud resource, you might see the following error for an exhausted resource pool:

ERROR: ZONE_RESOURCE_POOL_EXHAUSTED

This error occurs for temporary stock-out conditions for a specific resource in a specific zone.

To resolve the issue, you can either wait for a period of time or create the same resource in another zone. As a best practice, we recommend that you distribute your resources across multiple zones and regions to tolerate outages.

A fatal error has been detected by the Java Runtime Environment

The following error occurs during worker startup:

A fatal error has been detected by the Java Runtime Environment

This error occurs if the pipeline is using Java Native Interface (JNI) to run non-Java code and that code or the JNI bindings contain an error.

A hot key ... was detected

The following error occurs:

A hot key HOT_KEY_NAME was detected in...

These errors occur if your data contains a hot key. A hot key is a key with enough elements to negatively impact pipeline performance. These keys limit Dataflow's ability to process elements in parallel, which increases execution time.

To print the human-readable key to the logs when a hot key is detected in the pipeline, use the hot key pipeline option.

To resolve this issue, check that your data is evenly distributed. If a key has disproportionately many values, consider the following courses of action:

Invalid table specification in Data Catalog

When you use Dataflow SQL to create Dataflow SQL jobs, your job might fail with the following error in the log files:

Invalid table specification in Data Catalog: Could not resolve table in Data Catalog

This error occurs if the Dataflow service account doesn't have access to the Data Catalog API.

To resolve this issue, enable the Data Catalog API in the Google Cloud project that you're using to write and run queries.

Alternately, assign the roles/datacatalog.viewer role to the Dataflow service account.

The job graph is too large

Your job might fail with the following error:

The job graph is too large. Please try again with a smaller job graph, 
or split your job into two or more smaller jobs.

This error occurs if your job's graph size exceeds 10 MB. Certain conditions in your pipeline can cause the job graph to exceed the limit. Common conditions include:

  • A Create transform that includes a large amount of in-memory data.
  • A large DoFn instance that is serialized for transmission to remote workers.
  • A DoFn as an anonymous inner class instance that (possibly inadvertently) pulls in a large amount of data to be serialized.
  • A directed acyclic graph (DAG) is being used as part of a programmatic loop that is enumerating a large list.

To avoid these conditions, consider restructuring your pipeline.

Key Commit Too Large

When running a streaming job, the following error appears in the worker log files:

KeyCommitTooLargeException

This error occurs in streaming scenarios if a very large amount of data is grouped without using a Combine transform, or if a large amount of data is produced from a single input element.

To reduce the possibility of encountering this error, use the following strategies:

  • Ensure that processing a single element cannot result in outputs or state modifications exceeding the limit.
  • If multiple elements were grouped by a key, consider increasing the key space to reduce the elements grouped per key.
  • If elements for a key are emitted at a high frequency over a short period of time, that might result in many GB of events for that key in windows. Rewrite the pipeline to detect keys like this and only emit an output indicating the key was frequently present in that window.
  • Make sure to use sub-linear space Combine transforms for commutative and associate operations. Don't use a combiner if it doesn't reduce space. For example, combiner for strings that just appends strings together is worse than not using combiner.

Request Entity Too Large

When you submit your job, one of the following errors appears in your console or terminal window:

413 Request Entity Too Large
The size of serialized JSON representation of the pipeline exceeds the allowable limit
Failed to create a workflow job: Invalid JSON payload received
Failed to create a workflow job: Request payload exceeds the allowable limit

When you encounter an error about the JSON payload when submitting your job, your pipeline's JSON representation exceeds the maximum 20 MB request size.

The size of your job is specifically tied to the JSON representation of the pipeline. A larger pipeline means a larger request. Dataflow currently has a limitation that caps requests at 20 MB.

To estimate the size of your pipeline's JSON request, run your pipeline with the following option:

Java

--dataflowJobFile=PATH_TO_OUTPUT_FILE

Python

--dataflow_job_file=PATH_TO_OUTPUT_FILE

Go

Outputting your job as JSON is not supported in Go.

This command writes a JSON representation of your job to a file. The size of the serialized file is a good estimate of the size of the request. The actual size is slightly larger due to some additional information included the request.

Certain conditions in your pipeline can cause the JSON representation to exceed the limit. Common conditions include:

  • A Create transform that includes a large amount of in-memory data.
  • A large DoFn instance that is serialized for transmission to remote workers.
  • A DoFn as an anonymous inner class instance that (possibly inadvertently) pulls in a large amount of data to be serialized.

To avoid these conditions, consider restructuring your pipeline.

SDK pipeline options or staging file list exceeds size limit

When running a pipeline, the following error occurs:

SDK pipeline options or staging file list exceeds size limit. 
Please keep their length under 256K Bytes each and 512K Bytes in total.

This error occurs if the pipeline could not be started due to Google Compute Engine metadata limits being exceeded. These limits cannot be changed. Dataflow uses Compute Engine metadata for pipeline options. The limit is documented in the Compute Engine custom metadata limitations.

Having too many JAR files to stage can cause the JSON representation to exceed the limit.

To estimate the size of your pipeline's JSON request, run your pipeline with the following option:

Java

--dataflowJobFile=PATH_TO_OUTPUT_FILE

Python

--dataflow_job_file=PATH_TO_OUTPUT_FILE

Go

Outputting your job as JSON is not supported in Go.

The size of the output file from this command has to be less than 256 KB. 512 KB in the error message is referring to the total size of the above output file and the custom metadata options for the compute engine VM instance.

You can get a rough estimate of the custom metadata option for VM instance from running Dataflow jobs in the project. Choose any running Dataflow job, take a VM instance and then navigate to the Compute Engine VM instance details page for that VM to check for the custom metadata section. The total length of the custom metadata and the file should be less than 512 KB. An accurate estimate for the failed job is not possible, because the VMs are not spun up for failed jobs.

If your JAR list is hitting the 256 KB limit, review it and reduce any unnecessary JAR files. If it is still too large, try executing the Dataflow job using an uber JAR. The cloud function section of the documentation has an example showing how to create and use uber JAR.

Shuffle key too large

The following error appears in the worker log files:

Shuffle key too large

This error occurs if the serialized key emitted to a particular (Co-)GroupByKey is too large after the corresponding coder is applied. Dataflow has a limit for serialized shuffle keys.

To resolve this issue, reduce the size of the keys or use more space-efficient coders.

Total number of BoundedSource objects ... is larger than the allowable limit

One of the following errors might occur when running jobs with Java:

Total number of BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit

Or:

Total size of the BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit

Java

This error can occur if you're reading from a very large number of files via TextIO, AvroIO, or some other file-based source. The particular limit depends on the details of your source, but it is on the order of tens of thousands of files in one pipeline. For example, embedding schema in AvroIO.Read allows fewer files.

This error can also occur if you created a custom data source for your pipeline and your source's splitIntoBundles method returned a list of BoundedSource objects which takes up more than 20 MB when serialized.

The allowable limit for the total size of the BoundedSource objects generated by your custom source's splitIntoBundles() operation is 20 MB.

To work around this limitation, modify your custom BoundedSource subclass so that the total size of the generated BoundedSource objects is smaller than the 20 MB limit. For example, your source might generate fewer splits initially, and rely on Dynamic Work Rebalancing to further split inputs on demand.

NameError

When you execute your pipeline using the Dataflow service, the following error occurs:

NameError

This error does not occur when you execute locally, such as when you execute using the DirectRunner.

This error occurs if your DoFns are using values in the global namespace that are not available on the Dataflow worker.

By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Dataflow job.

To resolve this issue, use one of the following methods. If your DoFns are defined in the main file and reference imports and functions in the global namespace, set the --save_main_session pipeline option to True. This change pickles the state of the global namespace to and loads it on the Dataflow worker.

If you have objects in your global namespace that can't be pickled, a pickling error occurs. If the error is regarding a module that should be available in the Python distribution, import the module locally, where it is used.

For example, instead of:

Review the VM Instance log

  1. Go to the Cloud Logging viewer
  2. In the Audited Resource drop-down list, select VM Instance.
  3. In the All logs drop-down list, select compute.googleapis.com/activity_log.
  4. Scan the log for any entries related to VM instance creation failure.

Check your usage of Compute Engine quotas

  1. On your local machine command line or in Cloud Shell, run the following command to view Compute Engine resource usage compared to Dataflow quotas for the zone you are targeting:

    gcloud compute regions describe [REGION]

  2. Review the results for the following resources to see if any are exceeding quota:

    • CPUS
    • DISKS_TOTAL_GB
    • IN_USE_ADDRESSES
    • INSTANCE_GROUPS
    • INSTANCE_GROUP_MANAGERS
    • INSTANCES
  3. If needed, request a quota change.

Review your organization policy constraints

  1. Go to the Organization policies page
  2. Review the constraints for any that might limit VM instance creation for either the account you are using (this is the Dataflow service account by default) or in the zone you are targeting.

"Processing stuck/Operation ongoing in step <step_id> for at least <time_interval> without outputting or completing in state finish at <stack_trace>"

If Dataflow spends more time executing a DoFn than the time specified in <time_interval> without returning, this message is displayed.

This behavior has two possible causes:

  • Your DoFn code is simply slow, or waiting for some slow external operation to complete.
  • Your DoFn code might be stuck, deadlocked, or abnormally slow to finish processing.

To determine which of these is the case, expand the Cloud Monitoring log entry to see a stack trace. Look for messages that indicate that the DoFn code is stuck or otherwise encountering issues. If no messages are present, the issue might be the execution speed of the DoFn code. Consider using Cloud Profiler or other tool to investigate the code's performance.

You can further investigate the cause of your stuck code if your pipeline is built on the Java VM (using either Java or Scala); take a full thread dump of the whole JVM (not just the stuck thread) by following these steps:

  1. Make note of the worker name from the log entry.
  2. In the Compute Engine section of the Google Cloud console, find the Compute Engine instance with the worker name you noted.
  3. SSH into the instance with that name.
  4. Run the following command:

    curl http://localhost:8081/threadz
    

Pub/Sub quota errors

When running a streaming pipeline from Pub/Sub, the following errors occur:

429 (rateLimitExceeded)

Or:

Request was throttled due to user QPS limit being reached

These errors occur if your project has insufficient Pub/Sub quota.

To find out if your project has insufficient quota, follow these steps to check for client errors:

  1. Go to the Google Cloud console.
  2. In the menu on the left, select APIs & services.
  3. In the Search Box, search for Cloud Pub/Sub.
  4. Click the Usage tab.
  5. Check Response Codes and look for (4xx) client error codes.

Request is prohibited by organization's policy

When running a pipeline, the following error occurs:

Error trying to get gs://BUCKET_NAME/FOLDER/FILE:
{"code":403,"errors":[{"domain":"global","message":"Request is prohibited by organization's policy","reason":"forbidden"}],
"message":"Request is prohibited by organization's policy"}

This error occurs if the Cloud Storage bucket is outside of your service perimeter.

To resolve this issue, create an egress rule that allows access to the bucket outside of the service perimeter.

Could not cancel workflow

When trying to cancel a Dataflow job, the following error occurs:

Could not cancel workflow; user does not have sufficient permissions on project: PROJECT_ID, 
or the job does not exist in the project

To resolve this issue:

  • Verify that the user has sufficient permissions on the project to cancel the job.

  • Verify that the JOB_ID is correct and you are not using JOB_NAME to cancel the job.

Staged package...is inaccessible

Jobs that used to succeed might fail with the following error:

Staged package...is inaccessible

To resolve this issue:

  • Verify that the Cloud Storage bucket used for staging does not have TTL settings that cause staged packages to be deleted.
  • Verify that your Dataflow project's controller service account has the permission to access the Cloud Storage bucket used for staging. Gaps in permission can be due to any of the following reasons:

    • The Cloud Storage bucket used for staging is present in a different project.
    • The Cloud Storage bucket used for staging was migrated from fine-grained access to uniform bucket-level access. Due to the inconsistency between IAM and ACL policies, migrating the staging bucket to uniform bucket-level access disallows ACLs for Cloud Storage resources, which includes the permissions held by your Dataflow project's controller service account over the staging bucket.

For more information, see Accessing Cloud Storage buckets across Google Cloud projects.

A work item failed 4 times

The following error occurs when a job fails:

a work item failed 4 times

This error occurs if a single operation causes the worker code to fail four times. Dataflow fails the job, and this message is displayed.

You can't configure this failure threshold. For more details, refer to pipeline error and exception handling.

To resolve this issue, look in the job's Cloud Monitoring logs for the four individual failures. Look for Error-level or Fatal-level log entries in the worker logs that show exceptions or errors. The exception or error should appear at least four times.

Container image errors

The following sections contain common errors that you might encounter when using custom containers and steps for resolving or troubleshooting the errors. The errors are typically prefixed with the following message:

Unable to pull container image due to error: DETAILED_ERROR_MESSAGE

Custom container pipeline fails or is slow to start up workers

A known issue might cause Dataflow to mark workers as unresponsive and restart the worker if they are using a large custom container (~10 GB) that takes a long time to download. This issue can cause pipeline startup to be slow or in some extreme cases prevent workers from starting up at all.

To confirm that this known issue is causing the problem, in dataflow.googleapis.com/kubelet, look for worker logs that show several failed attempts to pull a container and confirm that kubelet did not start on the worker. For example, the logs might contain Pulling image <URL> without a corresponding Successfully pulled image <URL> or Started Start kubelet after all images have been pulled.

To work around this issue, run the pipeline with the --experiments=disable_worker_container_image_prepull pipeline option.

Error syncing pod ... failed to "StartContainer"

The following error occurs during worker startup:

Error syncing pod POD_ID, skipping: [failed to "StartContainer" for CONTAINER_NAME with CrashLoopBackOff: "back-off 5m0s restarting failed container=CONTAINER_NAME pod=POD_NAME]. 

A pod is a co-located group of Docker containers running on a Dataflow worker. This error occurs when one of the Docker containers in the pod fails to start. If the failure is not recoverable, the Dataflow worker isn't able to start, and Dataflow batch jobs eventually fail with errors like the following:

The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.

This error typically occurs when one of the containers is continuously crashing during startup. To understand the root cause, look for the logs captured immediately prior to the failure. To analyze the logs, use the Logs Explorer.

In the Logs Explorer, limit the log files to log entries emitted from the worker with container startup errors. To limit the log entries, complete the following steps:

  1. In the Logs Explorer, find the Error syncing pod log entry.
  2. To see the labels associated with the log entry, expand the log entry.
  3. Click the label associated with the resource_name, and then click Show matching entries.

The Logs Explorer page with the steps for limiting log files highlighted.

In the Logs Explorer, the Dataflow logs are organized into several log streams. The Error syncing pod message is emitted in the log named kubelet. However, the logs from the failing container could be in a different log stream. Each container has a name. Use the following table to determine which log stream might contain logs relevant to the failing container.

Container name Log names
sdk, sdk0, sdk1, sdk-0-0, and so on docker
harness harness, harness-startup
python, java-batch, java-streaming worker-startup, worker
artifact artifact

When you query the Logs Explorer, make sure that the query either includes the relevant log names in the query builder UI or does not have restrictions on the log name.

A Logs Explorer query that includes the relevant log names.

After you select the relevant logs, the query result might look like the following example:

resource.type="dataflow_step"
resource.labels.job_id="2022-06-29_08_02_54-JOB_ID"
labels."compute.googleapis.com/resource_name"="testpipeline-jenkins-0629-DATE-cyhg-harness-8crw"
logName=("projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fdocker"
OR
"projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fworker-startup"
OR
"projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fworker")

Because the logs reporting the symptom of the container failure are sometimes reported as INFO, include INFO logs in your analysis.

Typical causes of container failures include the following:

  1. Your Python pipeline has additional dependencies that are installed at runtime, and the installation is unsuccessful. You might see errors like pip install failed with error. This issue might occur due to conflicting requirements, or due to a restricted networking configuration that prevents a Dataflow worker from pulling an external dependency from a public repository over the internet.
  2. A worker fails in the middle of the pipeline run due to an out of memory error. You might see an error like one of the following:

    • java.lang.OutOfMemoryError: Java heap space
    • Shutting down JVM after 8 consecutive periods of measured GC thrashing. Memory is used/total/max = 24453/42043/42043 MB, GC last/max = 58.97/99.89 %, #pushbacks=82, gc thrashing=true. Heap dump not written.

    To debug an out of memory issue, see Debugging out-of-memory conditions in Dataflow.

  3. Dataflow is unable to pull the container image. For more information, see Image pull request failed with error.

After you identify the error causing the container to fail, try to address the error, and then resubmit the pipeline.

Image pull request failed with error

During worker startup, one of the following errors appears in the worker or job logs:

Image pull request failed with error
pull access denied for IMAGE_NAME
manifest for IMAGE_NAME not found: manifest unknown: Failed to fetch
Get IMAGE_NAME: Service Unavailable

These errors occur if a worker is unable to start up because the worker can't pull a Docker container image. This issue happens in the following scenarios:

  • The custom SDK container image URL is incorrect
  • The worker lacks credential or network access to the remote image

To resolve this issue:

  • If you are using a custom container image with your job, verify that your image URL is correct, has a valid tag or digest, and that the image is accessible by the Dataflow workers.
  • Verify that public images can be pulled locally by running docker pull $image from an unauthenticated machine.

For private images or private workers:

  • Dataflow only supports private container images hosted on Container Registry. If you are using Container Registry to host your container image, the default Google Cloud service account has access to images in the same project. If you are using images in a different project than the one used to run your Google Cloud job, make sure to configure access control for the default Google Cloud service account.
  • If using shared Virtual Private Cloud (VPC), make sure that workers can access the custom container repository host.
  • Use ssh to connect with a running job worker VM and run docker pull $image to directly confirm the worker is configured properly.

If workers fail several times in a row due to this error and and no work has started on a job, the job can fail with an error similar to:

Job appears to be stuck.

If you remove access to the image while the job is running, either by removing the image itself or revoking the Dataflow worker service account credentials or internet access to access images, Dataflow only logs errors and doesn't take actions to fail the job. Dataflow also avoids failing long-running streaming pipelines to avoid losing pipeline state.

Other possible errors can arise from repository quota issues or outages. If you experience issues exceeding the Docker Hub quota for pulling public images or general third-party repository outages, consider using Container Registry as the image repository.

SystemError: unknown opcode

Your Python custom container pipeline might fail with the following error immediately after job submission:

SystemError: unknown opcode

In addition, the stack trace might include

apache_beam/internal/pickler.py

To resolve this issue, verify that the Python version that you are using locally matches the version in the container image up to the major and minor version. The difference in the patch version, such as 3.6.7 versus 3.6.8 does not create compatibility issues, but the difference in minor version such as 3.6.8 versus 3.8.2 can cause pipeline failures.

Worker errors

The following sections contain common worker errors that you might encounter and steps for resolving or troubleshooting the errors.

Call from Java worker harness to Python DoFn fails with error

If a call from the Java worker harness to a Python DoFn fails, a relevant error message is displayed.

To investigate the error, expand the Cloud Monitoring error log entry and look at the error message and traceback. It shows you which code failed so you can correct it if necessary. If you believe that this is a bug in Apache Beam or Dataflow, report the bug.

EOFError: marshal data too short

The following error appears in the worker logs:

EOFError: marshal data too short

This error sometimes occurs when Python pipeline workers run out of disk space.

To resolve this issue, see No space left on device.

No space left on device

When a job runs out of disk space, the following error might appear in the worker logs:

No space left on device

This error can occur for one of the following reasons:

  • The worker persistent storage runs out of free space, which can occur for one of the following reasons:
    • A job downloads large dependencies at runtime
    • A job uses large custom containers
    • A job writes a lot of temporary data to local disk
  • When using Dataflow Shuffle, Dataflow sets lower default disk size. As a result, this error might occur with jobs moving from worker-based shuffle.
  • The worker boot disk fills up because it is logging more than 50 entries per second.

To resolve this issue, follow these troubleshooting steps:

To see disk resources associated with a single worker, look up VM instance details for worker VMs associated with your job. Part of the disk space is consumed by the operating system, binaries, logs, and containers.

To increase persistent disk or boot disk space, adjust the disk size pipeline option.

Track disk space usage on the worker VM instances by using Cloud Monitoring. See Receive worker VM metrics from the Monitoring agent for instructions explaining how to set this up.

Look for boot disk space issues by Viewing serial port output on the worker VM instances and looking for messages like:

Failed to open system journal: No space left on device

If you have a large number of worker VM instances, you can create a script to run gcloud compute instances get-serial-port-output on all of them at once and review the output from that instead.

For more information, see Persistent disk resources.

Python pipeline fails after one hour of worker inactivity

When using the Apache Beam SDK for Python with Dataflow Runner V2 on worker machines with a large number of CPU cores and when the pipeline performs Python dependency installations at startup, the Dataflow job might take a long time to start.

This issue occurs when Runner V2 starts one container per CPU core on workers when running Python pipelines. On workers with many cores, the simultaneous container startup and initialization might cause resource exhaustion.

To resolve this issue, pre-build your Python container. This step can improve VM startup times and horizontal autoscaling performance. To use this experimental feature, enable the Cloud Build API on your project and submit your pipeline with the following parameter:

‑‑prebuild_sdk_container_engine=cloud_build.

For more information, see Dataflow Runner V2.

Alternatively, use a custom container image with all dependencies preinstalled.

As a workaround, you can run the pipeline with the ‑‑experiments=disable_runner_v2 pipeline option.

Startup of the worker pool in zone failed to bring up any of the desired workers

The following error occurs:

Startup of the worker pool in zone ZONE_NAME failed to bring up any of the desired NUMBER workers.
The project quota may have been exceeded or access control policies may be preventing the operation; 
review the Cloud Logging 'VM Instance' log for diagnostics.

This error occurs for one of the following reasons:

  • You have exceeded one of the Compute Engine quotas that Dataflow worker creation relies on.
  • Your organization has constraints in place that prohibit some aspect of the VM instance creation process, like the account being used, or the zone being targeted.

To resolve this issue, follow these troubleshooting steps:

Review the VM Instance log

  1. Go to the Cloud Logging viewer
  2. In the Audited Resource drop-down list, select VM Instance.
  3. In the All logs drop-down list, select compute.googleapis.com/activity_log.
  4. Scan the log for any entries related to VM instance creation failure.

Check your usage of Compute Engine quotas

  1. On your local machine command line or in Cloud Shell, run the following command to view Compute Engine resource usage compared to Dataflow quotas for the zone you are targeting:

    gcloud compute regions describe [REGION]

  2. Review the results for the following resources to see if any are exceeding quota:

    • CPUS
    • DISKS_TOTAL_GB
    • IN_USE_ADDRESSES
    • INSTANCE_GROUPS
    • INSTANCE_GROUP_MANAGERS
    • INSTANCES
  3. If needed, request a quota change.

Review your organization policy constraints

  1. Go to the Organization policies page
  2. Review the constraints for any that might limit VM instance creation for either the account you are using (this is the Dataflow service account by default) or in the zone you are targeting.

The worker lost contact with the service

When a Dataflow batch job fails, the following error occurs:

Root cause: The worker lost contact with the service.

Sometimes this error occurs when the worker runs out of memory or swap space.

To resolve this issue, as a first step, try running the job again. If the job still fails and the same error occurs, try using a worker with more memory and disk space. For example, add the following pipeline startup option:

--worker_machine_type=m1-ultramem-40 --disk_size_gb=500

Note that changing the worker type could impact billed cost.

Streaming Engine errors

The following sections contain common Streaming Engine errors that you might encounter and steps for resolving or troubleshooting the errors.

The work item requesting state read is no longer valid on the backend

During communication between worker VM instances and Streaming Engine tasks in a streaming pipeline, the following error occurs:

The work item requesting state read is no longer valid on the backend. 
The work has already completed or will be retried. 
This is expected during autoscaling events.

During autoscaling, worker VM instances communicate with multiple Streaming Engine tasks, and each task serves multiple worker VM instances. Item keys are used to distribute the work. Each task and worker VM instance have a collection of key ranges, and the distribution of these ranges can change dynamically. For example, during autoscaling, job resizing can cause the key range distribution to change. When a key range changes, this error can occur. The error is expected, and unless you see a correlation between these messages and an underperforming pipeline, you can ignore it.

BigQuery connector errors

The following sections contain common BigQuery connector errors that you might encounter and steps for resolving or troubleshooting the errors.

quotaExceeded

When using the BigQuery connector to write to BigQuery using streaming inserts, write throughput is lower than expected, and the following error might occur:

quotaExceeded

Slow throughput might be due to your pipeline exceeding the available BigQuery streaming insert quota. If this is the case, quota related error messages from BigQuery appear in the Dataflow worker logs (look for quotaExceeded errors).

If you see quotaExceeded errors, to resolve this issue:

  • When using the Apache Beam SDK for Java, set the BigQuery sink option ignoreInsertIds().
  • When using the Apache Beam SDK for Python, use the ignore_insert_ids option.

These settings make you eligible for a one GB/sec per-project BigQuery streaming insert throughput. For more information on caveats related to automatic message de-duplication, see the BigQuery documentation. To increase the BigQuery streaming insert quota above one GB/s, submit a request through the Google Cloud console.

If you do not see quota related errors in worker logs, the issue might be that default bundling or batching related parameters do not provide adequate parallelism for your pipeline to scale. You can adjust several Dataflow BigQuery connector related configurations to achieve the expected performance when writing to BigQuery using streaming inserts. For example, for Apache Beam SDK for Java, adjust numStreamingKeys to match the maximum number of workers and consider increasing insertBundleParallelism to configure BigQuery connector to write to BigQuery using more parallel threads.

For configurations available in the Apache Beam SDK for Java, see BigQueryPipelineOptions {:.external}, and for configurations available in the Apache Beam SDK for Python, see the WriteToBigQuery transform.

rateLimitExceeded

When using the BigQuery connector, the following error occurs:

rateLimitExceeded

This error occurs if BigQuery too many API requests are sent during a short duration. BigQuery has short term quota limits that apply when . It's possible for your Dataflow pipeline to temporarily exceed such a quota. Whenever this happens, API requests from your Dataflow pipeline to BigQuery might fail, which could result in rateLimitExceeded errors in worker logs.

Dataflow retries such failures, so you can safely ignore these errors. If you believe that your pipeline is significantly impacted due to rateLimitExceeded errors, contact Google Cloud Support.

Recommendations

For guidance on recommendations generated by Dataflow Insights, see Insights.