Troubleshoot Dataflow errors

If you run into problems with your Dataflow pipeline or job, this page lists error messages that you might see and provides suggestions for how to fix each error.

Errors in the log types dataflow.googleapis.com/worker-startup, dataflow.googleapis.com/harness-startup, and dataflow.googleapis.com/kubelet indicate configuration problems with a job. They can also indicate conditions that prevent the normal logging path from functioning.

Your pipeline might throw exceptions while processing data. Some of these errors are transient, for example when temporary difficulty accessing an external service occurs. Some of these errors are permanent, such as errors caused by corrupt or unparseable input data, or null pointers during computation.

Dataflow processes elements in arbitrary bundles and retries the complete bundle when an error is thrown for any element in that bundle. When running in batch mode, bundles including a failing item are retried four times. The pipeline fails completely when a single bundle fails four times. When running in streaming mode, a bundle including a failing item is retried indefinitely, which might cause your pipeline to permanently stall.

Exceptions in user code, for example, your DoFn instances, are reported in the Dataflow monitoring interface. If you run your pipeline with BlockingDataflowPipelineRunner, you also see error messages printed in your console or terminal window.

Consider guarding against errors in your code by adding exception handlers. For example, if you want to drop elements that fail some custom input validation done in a ParDo, use a try/catch block within your ParDo to handle the exception and log and drop the element. For production workloads, implement an unprocessed message pattern. To track the error count, you use aggregation transforms.

Missing log files

If you don't see any logs for your jobs, remove any exclusion filters containing resource.type="dataflow_step" from all of your Cloud Logging Log Router sinks.

Go to Log Router

For more details about removing your logs exclusions, refer to the Removing exclusions guide.

Pipeline errors

The following sections contain common pipeline errors that you might encounter and steps for resolving or troubleshooting the errors.

Some Cloud APIs need to be enabled

When you try to run a Dataflow job, the following error occurs:

Some Cloud APIs need to be enabled for your project in order for Cloud Dataflow to run this job.

This issue occurs because some required APIs are not enabled in your project.

To resolve this issue and run a Dataflow job, enable the following Google Cloud APIs in your project:

  • Compute Engine API (Compute Engine)
  • Cloud Logging API
  • Cloud Storage
  • Cloud Storage JSON API
  • BigQuery API
  • Pub/Sub
  • Datastore API

For detailed instructions, see the Getting Started section on enabling Google Cloud APIs .

"@*" and "@N" are reserved sharding specs

When you try to run a job, the following error appears in the log files, and the job fails:

Workflow failed. Causes: "@*" and "@N" are reserved sharding specs. Filepattern must not contain any of them.

This error occurs if the filename for your Cloud Storage path for temporary files (tempLocation or temp_location) has an at sign (@) followed by a number or by an asterisk (*).

To resolve this issue, change the filename so that the at sign is followed by a supported character.

Bad request

When you run a Dataflow job, Cloud Monitoring logs display a series of warnings similar to the following:

Unable to update setup work item STEP_ID error: generic::invalid_argument: Http(400) Bad Request
Update range task returned 'invalid argument'. Assuming lost lease for work with id LEASE_ID
with expiration time: TIMESTAMP, now: TIMESTAMP. Full status: generic::invalid_argument: Http(400) Bad Request

Bad request warnings occur if worker state information is stale or out of sync due to processing delays. Often, your Dataflow job succeeds despite the bad request warnings. If that is the case, ignore the warnings.

Cannot read and write in different locations

When you run a Dataflow job, you might see the following error in the log files:

message:Cannot read and write in different locations: source: SOURCE_REGION, destination: DESTINATION_REGION,reason:invalid

This error occurs when the source and destination are in different regions. It can also occur when the staging location and destination are in different regions. For example, if the job reads from Pub/Sub and then writes to a Cloud Storage temp bucket before writing to a BigQuery table, the Cloud Storage temp bucket and the BigQuery table must be in the same region.

Multi-region locations are considered different than single-region locations, even if the single region falls within the scope of the multi-region location. For example, us (multiple regions in the United States) and us-central1 are different regions.

To resolve this issue, have your destination, source, and staging locations in the same region. Cloud Storage bucket locations can't be changed, so you might need to create a new Cloud Storage bucket in the correct region.

Connection timed out

When you run a Dataflow job, you might see the following error in the log files:

org.springframework.web.client.ResourceAccessException: I/O error on GET request for CONNECTION_PATH: Connection timed out (Connection timed out); nested exception is java.net.ConnectException: Connection timed out (Connection timed out)

This issue occurs when the Dataflow workers fail to establish or maintain a connection with the data source or destination.

To resolve the issue, follow these troubleshooting steps:

  • Verify that the data source is running.
  • Verify that the destination is running.
  • Review the connection parameters used in the Dataflow pipeline configuration.
  • Verify that performance issues aren't affecting the source or destination.
  • Make sure that firewall rules aren't blocking the connection.

No such object

When you run your Dataflow jobs, you might see the following error in the log files:

..., 'server': 'UploadServer', 'status': '404'}>, <content <No such object:...

These errors typically occur when some of your running Dataflow jobs use the same temp_location to stage temporary job files created when the pipeline runs. When multiple concurrent jobs share the same temp_location, these jobs might step on the temporary data of each other, and a race condition might occur. To avoid this issue, it's recommended that you use a unique temp_location for each job.

Dataflow is unable to determine backlog

When running a streaming pipeline from Pub/Sub, the following warning occurs:

Dataflow is unable to determine the backlog for Pub/Sub subscription

When a Dataflow pipeline pulls data from Pub/Sub, Dataflow needs to repeatedly request information from Pub/Sub. This information includes the amount of backlog on the subscription and the age of the oldest unacknowledged message. Occasionally, Dataflow is unable to retrieve this information from Pub/Sub because of internal system issues, which may cause a transient accumulation of backlog.

For more information, see Streaming With Cloud Pub/Sub.

DEADLINE_EXCEEDED or Server Unresponsive

When you run your jobs, you might encounter RPC timeout exceptions or one of the following errors:

DEADLINE_EXCEEDED

Or:

Server Unresponsive

These errors typically occur for one of the following reasons:

  • The Virtual Private Cloud (VPC) network used for your job might be missing a firewall rule. The firewall rule needs to enable all TCP traffic among VMs in the VPC network you specified in your pipeline options. For more information, see Firewall rules for Dataflow.

    In some cases, the workers aren't able to communicate with each other. When you run a Dataflow job that doesn't use Dataflow Shuffle or Streaming Engine, workers need to communicate with each other using TCP ports 12345 and 12346 within the VPC network. In this scenario, the error includes the worker harness name and the TCP port that's blocked. The error looks like one of the following examples:

    DEADLINE_EXCEEDED: (g)RPC timed out when SOURCE_WORKER_HARNESS
    talking to DESTINATION_WORKER_HARNESS:12346.
    
    Rpc to WORKER_HARNESS:12345 completed with error UNAVAILABLE: failed to connect to all addresses
    Server unresponsive (ping error: Deadline Exceeded, UNKNOWN: Deadline Exceeded...)
    

    To resolve this issue, use the gcloud compute firewall-rules create rules flag to allow network traffic to ports 12345 and 12346. The following example demonstrates the Google Cloud CLI command:

    gcloud compute firewall-rules create FIREWALL_RULE_NAME \
      --network NETWORK \
      --action allow \
      --direction IN \
      --target-tags dataflow \
      --source-tags dataflow \
      --priority 0 \
      --rules tcp:12345-12346
    

    Replace the following:

    • FIREWALL_RULE_NAME: the name of your firewall rule
    • NETWORK: the name of your network
  • Your job is shuffle-bound.

    To resolve this issue, make one or more of the following changes.

    Java

    • If the job is not using the service-based shuffle, switch to using the service-based Dataflow Shuffle by setting --experiments=shuffle_mode=service. For details and availability, see Dataflow Shuffle.
    • Add more workers. Try setting --numWorkers with a higher value when you run your pipeline.
    • Increase the size of the attached disk for workers. Try setting --diskSizeGb with a higher value when you run your pipeline.
    • Use an SSD-backed persistent disk. Try setting --workerDiskType="compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/pd-ssd" when you run your pipeline.

    Python

    • If the job is not using the service-based shuffle, switch to using the service-based Dataflow Shuffle by setting --experiments=shuffle_mode=service. For details and availability, see Dataflow Shuffle.
    • Add more workers. Try setting --num_workers with a higher value when you run your pipeline.
    • Increase the size of the attached disk for workers. Try setting --disk_size_gb with a higher value when you run your pipeline.
    • Use an SSD-backed persistent disk. Try setting --worker_disk_type="compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/pd-ssd" when you run your pipeline.

    Go

    • If the job is not using the service-based shuffle, switch to using the service-based Dataflow Shuffle by setting --experiments=shuffle_mode=service. For details and availability, see Dataflow Shuffle.
    • Add more workers. Try setting --num_workers with a higher value when you run your pipeline.
    • Increase the size of the attached disk for workers. Try setting --disk_size_gb with a higher value when you run your pipeline.
    • Use an SSD-backed persistent disk. Try setting --disk_type="compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/pd-ssd" when you run your pipeline.

Encoding errors, IOExceptions, or unexpected behavior in user code

The Apache Beam SDKs and the Dataflow workers depend on common third-party components. These components import additional dependencies. Version collisions can result in unexpected behavior in the service. Also, some libraries aren't forward-compatible. You might need to pin to the listed versions that are in scope during execution. SDK and Worker Dependencies contains a list of dependencies and their required versions.

Error running LookupEffectiveGuestPolicies

When you run a Dataflow job, you might see the following error in the log files:

OSConfigAgent Error policies.go:49: Error running LookupEffectiveGuestPolicies:
error calling LookupEffectiveGuestPolicies: code: "Unauthenticated",
message: "Request is missing required authentication credential.
Expected OAuth 2 access token, login cookie or other valid authentication credential.

This error occurs if OS configuration management is enabled for the entire project.

To resolve this issue, disable VM Manager policies that apply to the entire project. If disabling VM Manager policies for the entire project isn't possible, you can safely ignore this error and filter it out of log monitoring tools.

Exhausted resource pool

When you create a Google Cloud resource, you might see the following error for an exhausted resource pool:

ERROR: ZONE_RESOURCE_POOL_EXHAUSTED

This error occurs for temporary stock-out conditions for a specific resource in a specific zone.

To resolve the issue, you can either wait or create the same resource in another zone. As a best practice, we recommend that you distribute your resources across multiple zones and regions to tolerate outages.

A fatal error has been detected by the Java Runtime Environment

The following error occurs during worker startup:

A fatal error has been detected by the Java Runtime Environment

This error occurs if the pipeline is using Java Native Interface (JNI) to run non-Java code and that code or the JNI bindings contain an error.

googclient_deliveryattempt attribute key error

Your Dataflow job fails with one of the following errors:

The request contains an attribute key that is not valid (key=googclient_deliveryattempt). Attribute keys must be non-empty and must not begin with 'goog' (case-insensitive).

Or:

Invalid extensions name: googclient_deliveryattempt

This error occurs when your Dataflow job has the following characteristics:

This error occurs because when you use the Pub/Sub Java or C# client library and a dead-letter topic for a subscription is enabled, the delivery attempts are in the googclient_deliveryattempt message attribute instead of the delivery_attempt field. For more information, see Track delivery attempts in the "Handle message failures" page.

To workaround this issue, make one or more of the following changes.

A hot key ... was detected

The following error occurs:

A hot key HOT_KEY_NAME was detected in...

These errors occur if your data contains a hot key. A hot key is a key with enough elements to negatively affect pipeline performance. These keys limit the ability of Dataflow to process elements in parallel, which increases execution time.

To print the human-readable key to the logs when a hot key is detected in the pipeline, use the hot key pipeline option.

To resolve this issue, check that your data is evenly distributed. If a key has disproportionately many values, consider the following courses of action:

To view hot keys in the Dataflow monitoring interface, see Troubleshoot stragglers in batch jobs.

Invalid table specification in Data Catalog

When you use Dataflow SQL to create Dataflow SQL jobs, your job might fail with the following error in the log files:

Invalid table specification in Data Catalog: Could not resolve table in Data Catalog

This error occurs if the Dataflow service account doesn't have access to the Data Catalog API.

To resolve this issue, enable the Data Catalog API in the Google Cloud project that you're using to write and run queries.

Alternately, assign the roles/datacatalog.viewer role to the Dataflow service account.

The job graph is too large

Your job might fail with the following error:

The job graph is too large. Please try again with a smaller job graph,
or split your job into two or more smaller jobs.

This error occurs if the graph size of your job exceeds 10 MB. Certain conditions in your pipeline can cause the job graph to exceed the limit. Common conditions include:

  • A Create transform that includes a large amount of in-memory data.
  • A large DoFn instance that is serialized for transmission to remote workers.
  • A DoFn as an anonymous inner class instance that (possibly inadvertently) pulls in a large amount of data to be serialized.
  • A directed acyclic graph (DAG) is being used as part of a programmatic loop that is enumerating a large list.

To avoid these conditions, consider restructuring your pipeline.

Key Commit Too Large

When running a streaming job, the following error appears in the worker log files:

KeyCommitTooLargeException

This error occurs in streaming scenarios if a very large amount of data is grouped without using a Combine transform, or if a large amount of data is produced from a single input element.

To reduce the possibility of encountering this error, use the following strategies:

  • Ensure that processing a single element cannot result in outputs or state modifications exceeding the limit.
  • If multiple elements were grouped by a key, consider increasing the key space to reduce the elements grouped per key.
  • If elements for a key are emitted at a high frequency over a short time, that might result in many GB of events for that key in windows. Rewrite the pipeline to detect keys like this and only emit an output indicating the key was frequently present in that window.
  • Use sublinear space Combine transforms for commutative and associate operations. Don't use a combiner if it doesn't reduce space. For example, combiner for strings that just appends strings together is worse than not using combiner.

Rejecting message over 7168K

When you run a Dataflow job created from a template, the job might fail with the following error:

Error: CommitWork failed: status: APPLICATION_ERROR(3): Pubsub publish requests are limited to 10MB, rejecting message over 7168K (size MESSAGE_SIZE) to avoid exceeding limit with byte64 request encoding.

This error occurs when messages written to a dead-letter queue exceed the size limit of 7168 K. As a workaround, enable Streaming Engine, which has a higher size limit. To enable Streaming Engine, use the following pipeline option.

Java

--enableStreamingEngine=true

Python

--enable_streaming_engine=true

Request Entity Too Large

When you submit your job, one of the following errors appears in your console or terminal window:

413 Request Entity Too Large
The size of serialized JSON representation of the pipeline exceeds the allowable limit
Failed to create a workflow job: Invalid JSON payload received
Failed to create a workflow job: Request payload exceeds the allowable limit

When you encounter an error about the JSON payload when submitting your job, the JSON representation of your pipeline exceeds the maximum 20 MB request size.

The size of your job is tied to the JSON representation of the pipeline. A larger pipeline means a larger request. Dataflow has a limitation that caps requests at 20 MB.

To estimate the size of the JSON request of your pipeline, run your pipeline with the following option:

Java

--dataflowJobFile=PATH_TO_OUTPUT_FILE

Python

--dataflow_job_file=PATH_TO_OUTPUT_FILE

Go

Outputting your job as JSON is not supported in Go.

This command writes a JSON representation of your job to a file. The size of the serialized file is a good estimate of the size of the request. The actual size is slightly larger due to some additional information included the request.

Certain conditions in your pipeline can cause the JSON representation to exceed the limit. Common conditions include:

  • A Create transform that includes a large amount of in-memory data.
  • A large DoFn instance that is serialized for transmission to remote workers.
  • A DoFn as an anonymous inner class instance that (possibly inadvertently) pulls in a large amount of data to be serialized.

To avoid these conditions, consider restructuring your pipeline.

SDK pipeline options or staging file list exceeds size limit

When running a pipeline, one of the following errors occurs:

SDK pipeline options or staging file list exceeds size limit.
Please keep their length under 256K Bytes each and 512K Bytes in total.

Or:

Value for field 'resource.properties.metadata' is too large: maximum size

These errors occur if the pipeline couldn't be started due to Compute Engine metadata limits being exceeded. These limits can't be changed. Dataflow uses Compute Engine metadata for pipeline options. The limit is documented in the Compute Engine custom metadata limitations.

The following scenarios can cause the JSON representation to exceed the limit:

  • There are too many JAR files to stage.
  • The sdkPipelineOptions request field is too large.

To estimate the size of the JSON request of your pipeline, run your pipeline with the following option:

Java

--dataflowJobFile=PATH_TO_OUTPUT_FILE

Python

--dataflow_job_file=PATH_TO_OUTPUT_FILE

Go

Outputting your job as JSON is not supported in Go.

The size of the output file from this command must be less than 256 KB. The 512 KB in the error message refers to the total size of the output file and the custom metadata options for the Compute Engine VM instance.

You can get a rough estimate of the custom metadata option for VM instance from running Dataflow jobs in the project. Choose any running Dataflow job. Take a VM instance, and then navigate to the Compute Engine VM instance details page for that VM to check for the custom metadata section. The total length of the custom metadata and the file should be less than 512 KB. An accurate estimate for the failed job is not possible, because the VMs are not spun up for failed jobs.

If your JAR list is hitting the 256-KB limit, review it and reduce any unnecessary JAR files. If it's still too large, try running the Dataflow job by using an uber JAR. For an example that demonstrates how to create and use uber JAR, see Build and deploy an Uber JAR.

If the sdkPipelineOptions request field is too large, include the following option when you run your pipeline. The pipeline option is the same for Java, Python, and Go.

--experiments=no_display_data_on_gce_metadata

Shuffle key too large

The following error appears in the worker log files:

Shuffle key too large

This error occurs if the serialized key emitted to a particular (Co-)GroupByKey is too large after the corresponding coder is applied. Dataflow has a limit for serialized shuffle keys.

To resolve this issue, reduce the size of the keys or use more space-efficient coders.

Total number of BoundedSource objects ... is larger than the allowable limit

One of the following errors might occur when running jobs with Java:

Total number of BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit

Or:

Total size of the BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit

Java

This error might occur if you're reading from a very large number of files by using TextIO, AvroIO, BigQueryIO through EXPORT, or some other file-based source. The particular limit depends on the details of your source, but it is on the order of tens of thousands of files in one pipeline. For example, embedding schema in AvroIO.Read allows fewer files.

This error might also occur if you created a custom data source for your pipeline and the splitIntoBundles method of your source returned a list of BoundedSource objects which takes up more than 20 MB when serialized.

The allowable limit for the total size of the BoundedSource objects generated by the splitIntoBundles() operation of your custom source is 20 MB.

To work around this limitation, make one of the following changes:

  1. Enable Runner V2. Runner v2 converts sources into splittable DoFns that don't have this source split limit.

  2. Modify your custom BoundedSource subclass so that the total size of the generated BoundedSource objects is smaller han the 20-MB limit. For example, your source might generate fewer splits initially, and rely on Dynamic Work Rebalancing to further split inputs on demand.

NameError

When you execute your pipeline using the Dataflow service, the following error occurs:

NameError

This error does not occur when you execute locally, such as when you execute using the DirectRunner.

This error occurs if your DoFns are using values in the global namespace that are not available on the Dataflow worker.

By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Dataflow job.

To resolve this issue, use one of the following methods. If your DoFns are defined in the main file and reference imports and functions in the global namespace, set the --save_main_session pipeline option to True. This change pickles the state of the global namespace to and loads it on the Dataflow worker.

If you have objects in your global namespace that can't be pickled, a pickling error occurs. If the error is regarding a module that should be available in the Python distribution, import the module locally, where it's used.

For example, instead of:

import re
…
def myfunc():
  # use re module

use:

def myfunc():
  import re
  # use re module

Alternatively, if your DoFns span multiple files, use a different approach to packaging your workflow and managing dependencies.

Processing stuck or operation ongoing

If Dataflow spends more time executing a DoFn than the time specified in TIME_INTERVAL without returning, the following message is displayed.

Java

Either of the two following log messages, depending on the version:

Processing stuck in step STEP_NAME for at least TIME_INTERVAL

Operation ongoing in bundle BUNDLE_ID for at least TIME_INTERVAL without outputting or completing: at STACK_TRACE

Python

Operation ongoing for over TIME_INTERVAL in state STATE in step STEP_ID without returning. Current Traceback: TRACEBACK

Go

Operation ongoing in transform TRANSFORM_ID for at least TIME_INTERVAL without outputting or completing in state STATE

This behavior has two possible causes:

  • Your DoFn code is slow, or waiting for some slow external operation to complete.
  • Your DoFn code might be stuck, deadlocked, or abnormally slow to finish processing.

To determine which is the case, expand the Cloud Monitoring log entry to see a stack trace. Look for messages that indicate that the DoFn code is stuck or otherwise encountering issues. If no messages are present, the issue might be the execution speed of the DoFn code. Consider using Cloud Profiler or other tool to investigate the performance of your code.

If your pipeline is built on the Java VM (using either Java or Scala), you can investigate the cause of your stuck code. Take a full thread dump of the whole JVM (not just the stuck thread) by following these steps:

  1. Make note of the worker name from the log entry.
  2. In the Compute Engine section of the Google Cloud console, find the Compute Engine instance with the worker name you noted.
  3. Use SSH to connect to the instance with that name.
  4. Run the following command:

    curl http://localhost:8081/threadz
    

Pub/Sub quota errors

When running a streaming pipeline from Pub/Sub, the following errors occur:

429 (rateLimitExceeded)

Or:

Request was throttled due to user QPS limit being reached

These errors occur if your project has insufficient Pub/Sub quota.

To find out if your project has insufficient quota, follow these steps to check for client errors:

  1. Go to the Google Cloud console.
  2. In the menu on the left, select APIs & services.
  3. In the Search Box, search for Cloud Pub/Sub.
  4. Click the Usage tab.
  5. Check Response Codes and look for (4xx) client error codes.

Request is prohibited by organization's policy

When running a pipeline, the following error occurs:

Error trying to get gs://BUCKET_NAME/FOLDER/FILE:
{"code":403,"errors":[{"domain":"global","message":"Request is prohibited by organization's policy","reason":"forbidden"}],
"message":"Request is prohibited by organization's policy"}

This error occurs if the Cloud Storage bucket is outside of your service perimeter.

To resolve this issue, create an egress rule that allows access to the bucket outside of the service perimeter.

Staged package...is inaccessible

Jobs that used to succeed might fail with the following error:

Staged package...is inaccessible

To resolve this issue:

  • Verify that the Cloud Storage bucket used for staging does not have TTL settings that cause staged packages to be deleted.
  • Verify that the worker service account of your Dataflow project has the permission to access the Cloud Storage bucket used for staging. Gaps in permission can be due to any of the following reasons:

    • The Cloud Storage bucket used for staging is present in a different project.
    • The Cloud Storage bucket used for staging was migrated from fine-grained access to uniform bucket-level access. Due to the inconsistency between IAM and ACL policies, migrating the staging bucket to uniform bucket-level access disallows ACLs for Cloud Storage resources. ACLs include the permissions held by the worker service account of your Dataflow project over the staging bucket.

For more information, see Accessing Cloud Storage buckets across Google Cloud projects.

A work item failed 4 times

The following error occurs when a job fails:

a work item failed 4 times

This error occurs if a single operation causes the worker code to fail four times. Dataflow fails the job, and this message is displayed.

You can't configure this failure threshold. For more details, refer to pipeline error and exception handling.

To resolve this issue, look in the Cloud Monitoring logs of the job for the four individual failures. Look for Error-level or Fatal-level log entries in the worker logs that show exceptions or errors. The exception or error should appear at least four times. If the logs only contain generic timeout errors related to accessing external resources, such as MongoDB, verify that the worker service account has permission to access the subnetwork of the resource.

Timeout in Polling Result File

The following occurs when a job fails:

Timeout in polling result file: PATH. Possible causes are:
1. Your launch takes too long time to finish. Please check the logs on stackdriver.
2. Service account SERVICE_ACCOUNT may not have enough permissions to pull
container image IMAGE_PATH or create new objects in PATH.
3. Transient errors occurred, please try again.

The issue is often related to how the Python dependencies are being installed by using the requirements.txt file. The Apache Beam stager downloads the source of all dependencies from PyPi, including the sources of transitive dependencies. Then, the wheel compilation happens implicitly during the pip download command for some of the Python packages that are dependencies of apache-beam. A timeout issue might occur because of the requirements.txt file.

For more information, see the Apache Arrow team's bug tracking this issue. The suggested workaround is to install apache-beam directly in the Dockerfile. This way, the timeout for the requirements.txt file is not applied.

Archive job errors

The following sections contain common errors that you might encounter when you try to archive a Dataflow job by using the API.

No value is provided

When you try to archive a Dataflow job by using the API, the following error might occur:

The field mask specifies an update for the field job_metadata.user_display_properties.archived in job JOB_ID, but no value is provided. To update a field, please provide a field for the respective value.

This error occurs for one of the following reasons:

  • The path specified for the updateMask field doesn't follow the correct format. This issue can occur due to typos.

  • The JobMetadata isn't correctly specified. In the JobMetadata field, for userDisplayProperties, use the key-value pair "archived":"true".

To resolve this error, verify that the command that you pass to the API matches the required format. For more details, see Archive a job.

The API does not recognize the value

When you try to archive a Dataflow job by using the API, the following error might occur:

The API does not recognize the value VALUE for the field job_metadata.user_display_properties.archived for job JOB_ID. REASON: Archived display property can only be set to 'true' or 'false'

This error occurs when the value provided in the archive jobs key-value pair isn't a supported value. The supported values for the archive jobs key-value pair are "archived":"true" and "archived":"false".

To resolve this error, verify that the command that you pass to the API matches the required format. For more details, see Archive a job.

Cannot update both state and mask

When you try to archive a Dataflow job by using the API, the following error might occur:

Cannot update both state and mask.

This error occurs when you try to update both the job state and the archive status in the same API call. You can't make updates to both the job state and the updateMask query parameter in the same API call.

To resolve this error, update the job state in a separate API call. Make updates to the job state before updating the job archive status.

Workflow modification failed

When you try to archive a Dataflow job by using the API, the following error might occur:

Workflow modification failed.

This error usually occurs when you try to archive a job that is running.

To resolve this error, wait until the job completes before archiving it. Completed jobs have one of the following job states:

  • JOB_STATE_CANCELLED
  • JOB_STATE_DRAINED
  • JOB_STATE_DONE
  • JOB_STATE_FAILED
  • JOB_STATE_UPDATED

For more information, see Detect Dataflow job completion.

Container image errors

The following sections contain common errors that you might encounter when using custom containers and steps for resolving or troubleshooting the errors. The errors are typically prefixed with the following message:

Unable to pull container image due to error: DETAILED_ERROR_MESSAGE

Permission "containeranalysis.occurrences.list" denied

The following error appears in your log files:

Error getting old patchz discovery occurrences: generic::permission_denied: permission "containeranalysis.occurrences.list" denied for project "PROJECT_ID", entity ID "" [region="REGION" projectNum=PROJECT_NUMBER projectID="PROJECT_ID"]

the Container Analysis API is required for vulnerability scanning.

For more information, see OS scanning overview and Configuring access control in the Artifact Analysis documentation.

Error syncing pod ... failed to "StartContainer"

The following error occurs during worker startup:

Error syncing pod POD_ID, skipping: [failed to "StartContainer" for CONTAINER_NAME with CrashLoopBackOff: "back-off 5m0s restarting failed container=CONTAINER_NAME pod=POD_NAME].

A pod is a colocated group of Docker containers running on a Dataflow worker. This error occurs when one of the Docker containers in the pod fails to start. If the failure is not recoverable, the Dataflow worker isn't able to start, and Dataflow batch jobs eventually fail with errors like the following:

The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.

This error typically occurs when one of the containers is continuously crashing during startup.

To understand the root cause, look for the logs captured immediately prior to the failure. To analyze the logs, use the Logs Explorer. In the Logs Explorer, limit the log files to log entries emitted from the worker with container startup errors. To limit the log entries, complete the following steps:

  1. In the Logs Explorer, find the Error syncing pod log entry.
  2. To see the labels associated with the log entry, expand the log entry.
  3. Click the label associated with the resource_name, and then click Show matching entries.

The Logs Explorer page with the steps for limiting log files highlighted.

In the Logs Explorer, the Dataflow logs are organized into several log streams. The Error syncing pod message is emitted in the log named kubelet. However, the logs from the failing container could be in a different log stream. Each container has a name. Use the following table to determine which log stream might contain logs relevant to the failing container.

Container name Log names
sdk, sdk0, sdk1, sdk-0-0, and similar docker
harness harness, harness-startup
python, java-batch, java-streaming worker-startup, worker
artifact artifact

When you query the Logs Explorer, make sure that the query either includes the relevant log names in the query builder interface or does not have restrictions on the log name.

A Logs Explorer query that includes the relevant log names.

After you select the relevant logs, the query result might look like the following example:

resource.type="dataflow_step"
resource.labels.job_id="2022-06-29_08_02_54-JOB_ID"
labels."compute.googleapis.com/resource_name"="testpipeline-jenkins-0629-DATE-cyhg-harness-8crw"
logName=("projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fdocker"
OR
"projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fworker-startup"
OR
"projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fworker")

Because the logs reporting the symptom of the container failure are sometimes reported as INFO, include INFO logs in your analysis.

Typical causes of container failures include the following:

  1. Your Python pipeline has additional dependencies that are installed at runtime, and the installation is unsuccessful. You might see errors like pip install failed with error. This issue might occur due to conflicting requirements, or due to a restricted networking configuration that prevents a Dataflow worker from pulling an external dependency from a public repository over the internet.
  2. A worker fails in the middle of the pipeline run due to an out of memory error. You might see an error like one of the following:

    • java.lang.OutOfMemoryError: Java heap space
    • Shutting down JVM after 8 consecutive periods of measured GC thrashing. Memory is used/total/max = 24453/42043/42043 MB, GC last/max = 58.97/99.89 %, #pushbacks=82, gc thrashing=true. Heap dump not written.

    To debug an out of memory issue, see Troubleshoot Dataflow out of memory errors.

  3. Dataflow is unable to pull the container image. For more information, see Image pull request failed with error.

  4. The container used is not compatible with the worker VM's CPU architecture. In the harness startup logs, you might see an error like the following: exec /opt/apache/beam/boot: exec format error. To check the container image's architecture, run docker image inspect $IMAGE:$TAG and look for the Architecture key word. If it says Error: No such image: $IMAGE:$TAG, you might need to pull the image first by running docker pull $IMAGE:$TAG. For information on building multi-architecture images, see Build a multi-architecture container image.

After you identify the error causing the container to fail, try to address the error, and then resubmit the pipeline.

Image pull request failed with error

During worker startup, one of the following errors appears in the worker or job logs:

Image pull request failed with error
pull access denied for IMAGE_NAME
manifest for IMAGE_NAME not found: manifest unknown: Failed to fetch
Get IMAGE_NAME: Service Unavailable

These errors occur if a worker is unable to start up because the worker can't pull a Docker container image. This issue happens in the following scenarios:

  • The custom SDK container image URL is incorrect
  • The worker lacks credential or network access to the remote image

To resolve this issue:

  • If you're using a custom container image with your job, verify that your image URL is correct and has a valid tag or digest. The Dataflow workers also need access to the image.
  • Verify that public images can be pulled locally by running docker pull $image from an unauthenticated machine.

For private images or private workers:

  • If you're using Container Registry to host your container image, it is recommended that you use Artifact Registry instead. Effective May 15, 2023, Container Registry is deprecated. If you use Container Registry, you can transition to Artifact Registry. If your images are in a different project than the one used to run your Google Cloud job, configure access control for the default Google Cloud service account.
  • If using shared Virtual Private Cloud (VPC), make sure that workers can access the custom container repository host.
  • Use ssh to connect with a running job worker VM and run docker pull $image to directly confirm that the worker is configured properly.

If workers fail several times in a row due to this error and work has started on a job, the job can fail with an error similar to the following message:

Job appears to be stuck.

If you remove access to the image while the job is running, either by removing the image itself or revoking the Dataflow worker Service Account Credentials or internet access to access images, Dataflow only logs errors. Dataflow doesn't fail the job. Dataflow also avoids failing long-running streaming pipelines to avoid losing pipeline state.

Other possible errors can arise from repository quota issues or outages. If you experience issues exceeding the Docker Hub quota for pulling public images or general third-party repository outages, consider using Artifact Registry as the image repository.

SystemError: unknown opcode

Your Python custom container pipeline might fail with the following error immediately after job submission:

SystemError: unknown opcode

In addition, the stack trace might include

apache_beam/internal/pickler.py

To resolve this issue, verify that the Python version that you're using locally matches the version in the container image up to the major and minor version. The difference in the patch version, such as 3.6.7 versus 3.6.8, does not create compatibility issues. The difference in minor version, such as 3.6.8 versus 3.8.2, can cause pipeline failures.

Worker errors

The following sections contain common worker errors that you might encounter and steps for resolving or troubleshooting the errors.

Call from Java worker harness to Python DoFn fails with error

If a call from the Java worker harness to a Python DoFn fails, a relevant error message is displayed.

To investigate the error, expand the Cloud Monitoring error log entry and look at the error message and traceback. It shows you which code failed so you can correct it if necessary. If you believe that the error is a bug in Apache Beam or Dataflow, report the bug.

EOFError: marshal data too short

The following error appears in the worker logs:

EOFError: marshal data too short

This error sometimes occurs when Python pipeline workers run out of disk space.

To resolve this issue, see No space left on device.

Failed to attach disk

When you try to launch a Dataflow job that uses C3 VMs with Persistent Disk, the job fails with one or both of the following errors:

Failed to attach disk(s), status: generic::invalid_argument: One or more operations had an error
Can not allocate sha384 (reason: -2), Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on...

These errors occur when you use C3 VMs with an unsupported Persistent Disk type. For more information, see Supported disk types for C3.

To use C3 VMs with your Dataflow job, choose the pd-ssd worker disk type. For more information, see Worker-level options.

Java

--workerDiskType=pd-ssd

Python

--worker_disk_type=pd-ssd

Go

disk_type=pd-ssd

No space left on device

When a job runs out of disk space, the following error might appear in the worker logs:

No space left on device

This error can occur for one of the following reasons:

  • The worker persistent storage runs out of free space, which can occur for one of the following reasons:
    • A job downloads large dependencies at runtime
    • A job uses large custom containers
    • A job writes many temporary data to local disk
  • When using Dataflow Shuffle, Dataflow sets lower default disk size. As a result, this error might occur with jobs moving from worker-based shuffle.
  • The worker boot disk fills up because it's logging more than 50 entries per second.

To resolve this issue, follow these troubleshooting steps:

To see disk resources associated with a single worker, look up VM instance details for worker VMs associated with your job. Part of the disk space is consumed by the operating system, binaries, logs, and containers.

To increase persistent disk or boot disk space, adjust the disk size pipeline option.

Track disk space usage on the worker VM instances by using Cloud Monitoring. See Receive worker VM metrics from the Monitoring agent for instructions explaining how to set this up.

Look for boot disk space issues by Viewing serial port output on the worker VM instances and looking for messages like:

Failed to open system journal: No space left on device

If you have many worker VM instances, you can create a script to run gcloud compute instances get-serial-port-output on all of them at once. You can review that output instead.

Python pipeline fails after one hour of worker inactivity

When using the Apache Beam SDK for Python with Dataflow Runner V2 on worker machines with many CPU cores, use Apache Beam SDK 2.35.0 or later. If your job uses a custom container, use Apache Beam SDK 2.46.0 or later.

Consider pre-building your Python container. This step can improve VM startup times and horizontal autoscaling performance. To use this feature, enable the Cloud Build API on your project and submit your pipeline with the following parameter:

‑‑prebuild_sdk_container_engine=cloud_build.

For more information, see Dataflow Runner V2.

You can also use a custom container image with all dependencies preinstalled.

Startup of the worker pool in zone failed to bring up any of the desired workers

The following error occurs:

Startup of the worker pool in zone ZONE_NAME failed to bring up any of the desired NUMBER workers.
The project quota may have been exceeded or access control policies may be preventing the operation;
review the Cloud Logging 'VM Instance' log for diagnostics.

This error occurs for one of the following reasons:

  • You have exceeded one of the Compute Engine quotas that Dataflow worker creation relies on.
  • Your organization has constraints in place that prohibit some aspect of the VM instance creation process, like the account being used, or the zone being targeted.

To resolve this issue, follow these troubleshooting steps:

Review the VM Instance log

  1. Go to the Cloud Logging viewer
  2. In the Audited Resource drop-down list, select VM Instance.
  3. In the All logs drop-down list, select compute.googleapis.com/activity_log.
  4. Scan the log for any entries related to VM instance creation failure.

Check your usage of Compute Engine quotas

  1. To view Compute Engine resource usage compared to Dataflow quotas for the zone you're targeting, run the following command:

    gcloud compute regions describe [REGION]

  2. Review the results for the following resources to see if any are exceeding quota:

    • CPUS
    • DISKS_TOTAL_GB
    • IN_USE_ADDRESSES
    • INSTANCE_GROUPS
    • INSTANCES
    • REGIONAL_INSTANCE_GROUP_MANAGERS
  3. If needed, request a quota change.

Review your organization policy constraints

  1. Go to the Organization policies page
  2. Review the constraints for any that might limit VM instance creation for either the account you're using (by default, the Dataflow service account) or in the zone that you're targeting.
  3. If you have a policy that restricts the use of external IP addresses, turn off external IP addresses for this job. For more information about turning off external IP addresses, see Configure internet access and firewall rules.

Timed out waiting for an update from the worker

When a Dataflow job fails, the following error occurs:

Root cause: Timed out waiting for an update from the worker. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact.

Sometimes, this error occurs when the worker runs out of memory or swap space. To resolve this issue, as a first step, try running the job again. If the job still fails and the same error occurs, try using a worker with more memory and disk space. For example, add the following pipeline startup option:

--worker_machine_type=m1-ultramem-40 --disk_size_gb=500

Changing the worker type could affect billed cost. For more information, see Troubleshoot Dataflow out of memory errors.

This error can also occur when your data contains a hot key. In this scenario, CPU utilization is high on some workers during most of the duration of the job. However, the number of workers does not reach the maximum allowed. For more information about hot keys and possible solutions, see Writing Dataflow pipelines with scalability in mind.

For additional solutions to this issue, see A hot key ... was detected.

If your Python code calls C/C++ code by using the Python extension mechanism, check whether the extension code releases the Python Global Interpreter Lock (GIL) in computationally intensive parts of code that don't access Python state. The libraries that facilitate interactions with extensions like Cython, and PyBind have primitives to control GIL status. You can also manually release the GIL and reacquire it before returning control to the Python interpreter by using the Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS macros. For more information, see Thread State and the Global Interpreter Lock in the Python documentation.

In Python pipelines, in the default configuration, Dataflow assumes that each Python process running on the workers efficiently uses one vCPU core. If the pipeline code bypasses the GIL limitations, such as by using libraries that are implemented in C++, processing elements might use resources from more than one vCPU core, and the workers might not get enough CPU resources. To work around this issue, reduce the number of threads on the workers.

Java dependency issues

Incompatible classes and libraries can cause Java dependency issues. When your pipeline has Java dependency issues, one of the following errors might occur:

  • NoClassDefFoundError: This error occurs when an entire class is not available during runtime.
  • NoSuchMethodError: This error occurs when the class in the classpath uses a version that doesn't contain the correct method or when the method signature changed.
  • NoSuchFieldError: This error occurs when the class in the classpath uses a version that doesn't have a field required during runtime.
  • FATAL ERROR in native method: This error occurs when a built-in dependency can't be loaded properly. When using uber JAR (shaded), don't include libraries that use signatures (such as Conscrypt) in the same JAR.

If your pipeline contains user-specific code and settings, the code can't contain mixed versions of libraries. If you're using a dependency management library, we recommend that you use Google Cloud Libraries BOM.

If you're using the Apache Beam SDK, to import the correct libraries BOM, use beam-sdks-java-io-google-cloud-platform-bom:

Maven

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>org.apache.beam</groupId>
      <artifactId>beam-sdks-java-google-cloud-platform-bom</artifactId>
      <version>BEAM_VERSION</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

Gradle

dependencies {
    implementation(platform("org.apache.beam:beam-sdks-java-google-cloud-platform-bom:BEAM_VERSION"))
}

For more information, see Manage pipeline dependencies in Dataflow.

InaccessibleObjectException in JDK 17 and later

When you run pipelines with the Java Platform, Standard Edition Development Kit (JDK) versions 17 and later, the following error might appear in the worker log files:

Unable to make protected METHOD accessible:
    module java.MODULE does not "opens java.MODULE" to ...

This issue occurs because starting in Java version 9, open module Java virtual machine (JVM) options are needed to access JDK internals. In Java 16 and later versions, open module JVM options are always required to access JDK internals.

To resolve this issue, when you pass modules to your Dataflow pipeline to open, use the format MODULE/PACKAGE=TARGET_MODULE(,TARGET_MODULE)* with the jdkAddOpenModules pipeline option. This format allows access to the necessary library.

For example, if the error is module java.base does not "opens java.lang" to unnamed module @..., then include the following pipeline option when you run your pipeline:

--jdkAddOpenModules=java.base/java.lang=ALL-UNNAMED

For more information, see the DataflowPipelineOptions class documentation.

BigQuery connector errors

The following sections contain common BigQuery connector errors that you might encounter and steps for resolving or troubleshooting the errors.

quotaExceeded

When using the BigQuery connector to write to BigQuery using streaming inserts, write throughput is lower than expected, and the following error might occur:

quotaExceeded

Slow throughput might be due to your pipeline exceeding the available BigQuery streaming insert quota. If so, quota related error messages from BigQuery appear in the Dataflow worker logs (look for quotaExceeded errors).

If you see quotaExceeded errors, to resolve this issue:

  • When using the Apache Beam SDK for Java, set the BigQuery sink option ignoreInsertIds().
  • When using the Apache Beam SDK for Python, use the ignore_insert_ids option.

These settings make you eligible for a one GB per sec, per-project BigQuery streaming insert throughput. For more information on caveats related to automatic message deduplication, see the BigQuery documentation. To increase the BigQuery streaming insert quota higher than one GBps, submit a request through the Google Cloud console.

If you don't see quota related errors in worker logs, the issue might be that default bundling or batching related parameters don't provide adequate parallelism for your pipeline to scale. You can adjust several Dataflow BigQuery connector related configurations to achieve the expected performance when writing to BigQuery using streaming inserts. For example, for Apache Beam SDK for Java, adjust numStreamingKeys to match the maximum number of workers and consider increasing insertBundleParallelism to configure BigQuery connector to write to BigQuery using more parallel threads.

For configurations available in the Apache Beam SDK for Java, see BigQueryPipelineOptions, and for configurations available in the Apache Beam SDK for Python, see the WriteToBigQuery transform.

rateLimitExceeded

When using the BigQuery connector, the following error occurs:

rateLimitExceeded

This error occurs if BigQuery too many API requests are sent during a short duration. BigQuery has short term quota limits. It's possible for your Dataflow pipeline to temporarily exceed such a quota. In this scenario, API requests from your Dataflow pipeline to BigQuery might fail, which could result in rateLimitExceeded errors in worker logs.

Dataflow retries such failures, so you can safely ignore these errors. If you believe that your pipeline is affected by rateLimitExceeded errors, contact Cloud Customer Care.

Miscellaneous errors

The following sections contain miscellaneous errors that you might encounter and steps for resolving or troubleshooting the errors.

No such object: pipeline.pb

When listing jobs using the JOB_VIEW_ALL option, the following error occurs:

No such object: BUCKET_NAME/PATH/pipeline.pb

This error can occur if you delete the pipeline.pb file from the staging files for the job.

Recommendations

For guidance on recommendations generated by Dataflow Insights, see Insights.