If you run into problems with your Dataflow pipeline or job, this page lists error messages that you might see and provides suggestions for how to fix each error.
Errors in the log types dataflow.googleapis.com/worker-startup
,
dataflow.googleapis.com/harness-startup
, and dataflow.googleapis.com/kubelet
indicate configuration problems with a job. They can also indicate conditions
that prevent the normal logging path from functioning.
Your pipeline might throw exceptions while processing data. Some of these errors are transient, for example when temporary difficulty accessing an external service occurs. Some of these errors are permanent, such as errors caused by corrupt or unparseable input data, or null pointers during computation.
Dataflow processes elements in arbitrary bundles and retries the complete bundle when an error is thrown for any element in that bundle. When running in batch mode, bundles including a failing item are retried four times. The pipeline fails completely when a single bundle fails four times. When running in streaming mode, a bundle including a failing item is retried indefinitely, which might cause your pipeline to permanently stall.
Exceptions in user code, for example, your DoFn
instances, are
reported in the
Dataflow monitoring interface.
If you run your pipeline with BlockingDataflowPipelineRunner
, you also see
error messages printed in your console or terminal window.
Consider guarding against errors in your code by adding exception handlers. For
example, if you want to drop elements that fail some custom input validation
done in a ParDo
, use a try/catch block within your ParDo
to handle the
exception and log and drop the element. For production workloads, implement an
unprocessed message pattern. To track the error count, you use
aggregation transforms.
Missing log files
If you don't see any logs for your jobs, remove any exclusion filters containing
resource.type="dataflow_step"
from all of your Cloud Logging Log Router
sinks.
For more details about removing your logs exclusions, refer to the Removing exclusions guide.
Duplicates in output
When you run a Dataflow job, the output contains duplicate records.
This issue can occur when your Dataflow job uses the at-least-once pipeline streaming mode. This mode guarantees that records are processed at least once. However, duplicate records are possible in this mode.
If your workflow can't tolerate duplicate records, use the exactly-once streaming mode. In this mode, Dataflow ensures that records are not dropped or duplicated as the data moves through the pipeline.
To verify which streaming mode your job is using, see View a job's streaming mode.
For more information about streaming modes, see Set the pipeline streaming mode.
Pipeline errors
The following sections contain common pipeline errors that you might encounter and steps for resolving or troubleshooting the errors.
Some Cloud APIs need to be enabled
When you try to run a Dataflow job, the following error occurs:
Some Cloud APIs need to be enabled for your project in order for Cloud Dataflow to run this job.
This issue occurs because some required APIs are not enabled in your project.
To resolve this issue and run a Dataflow job, enable the following Google Cloud APIs in your project:
- Compute Engine API (Compute Engine)
- Cloud Logging API
- Cloud Storage
- Cloud Storage JSON API
- BigQuery API
- Pub/Sub
- Datastore API
For detailed instructions, see the Getting Started section on enabling Google Cloud APIs .
"@*" and "@N" are reserved sharding specs
When you try to run a job, the following error appears in the log files, and the job fails:
Workflow failed. Causes: "@*" and "@N" are reserved sharding specs. Filepattern must not contain any of them.
This error occurs if the filename for your Cloud Storage path for
temporary files (tempLocation
or temp_location
) has an at sign (@) followed
by a number or by an asterisk (*).
To resolve this issue, change the filename so that the at sign is followed by a supported character.
Bad request
When you run a Dataflow job, Cloud Monitoring logs display a series of warnings similar to the following:
Unable to update setup work item STEP_ID error: generic::invalid_argument: Http(400) Bad Request
Update range task returned 'invalid argument'. Assuming lost lease for work with id LEASE_ID
with expiration time: TIMESTAMP, now: TIMESTAMP. Full status: generic::invalid_argument: Http(400) Bad Request
Bad request warnings occur if worker state information is stale or out of sync due to processing delays. Often, your Dataflow job succeeds despite the bad request warnings. If that is the case, ignore the warnings.
Cannot read and write in different locations
When you run a Dataflow job, you might see the following error in the log files:
message:Cannot read and write in different locations: source: SOURCE_REGION, destination: DESTINATION_REGION,reason:invalid
This error occurs when the source and destination are in different regions. It
can also occur when the staging location and destination are in different
regions. For example, if the job reads from Pub/Sub and then writes to a
Cloud Storage temp
bucket before writing to a BigQuery table, the
Cloud Storage temp
bucket and the BigQuery table must be in the
same region.
Multi-region locations are considered different than single-region locations,
even if the single region falls within the scope of the multi-region location.
For example, us (multiple regions in the United States)
and us-central1
are
different regions.
To resolve this issue, have your destination, source, and staging locations in the same region. Cloud Storage bucket locations can't be changed, so you might need to create a new Cloud Storage bucket in the correct region.
Connection timed out
When you run a Dataflow job, you might see the following error in the log files:
org.springframework.web.client.ResourceAccessException: I/O error on GET request for CONNECTION_PATH: Connection timed out (Connection timed out); nested exception is java.net.ConnectException: Connection timed out (Connection timed out)
This issue occurs when the Dataflow workers fail to establish or maintain a connection with the data source or destination.
To resolve the issue, follow these troubleshooting steps:
- Verify that the data source is running.
- Verify that the destination is running.
- Review the connection parameters used in the Dataflow pipeline configuration.
- Verify that performance issues aren't affecting the source or destination.
- Make sure that firewall rules aren't blocking the connection.
No such object
When you run your Dataflow jobs, you might see the following error in the log files:
..., 'server': 'UploadServer', 'status': '404'}>, <content <No such object:...
These errors typically occur when some of your running Dataflow jobs
use the same temp_location
to stage temporary job files created when the
pipeline runs. When multiple concurrent jobs share the same temp_location
,
these jobs might step on the temporary data of each other, and a race condition might
occur. To avoid this issue, it's recommended that you use a unique temp_location
for each job.
Dataflow is unable to determine backlog
When running a streaming pipeline from Pub/Sub, the following warning occurs:
Dataflow is unable to determine the backlog for Pub/Sub subscription
When a Dataflow pipeline pulls data from Pub/Sub, Dataflow needs to repeatedly request information from Pub/Sub. This information includes the amount of backlog on the subscription and the age of the oldest unacknowledged message. Occasionally, Dataflow is unable to retrieve this information from Pub/Sub because of internal system issues, which may cause a transient accumulation of backlog.
For more information, see Streaming With Cloud Pub/Sub.
DEADLINE_EXCEEDED or Server Unresponsive
When you run your jobs, you might encounter RPC timeout exceptions or one of the following errors:
DEADLINE_EXCEEDED
Or:
Server Unresponsive
These errors typically occur for one of the following reasons:
The Virtual Private Cloud (VPC) network used for your job might be missing a firewall rule. The firewall rule needs to enable all TCP traffic among VMs in the VPC network you specified in your pipeline options. For more information, see Firewall rules for Dataflow.
In some cases, the workers aren't able to communicate with each other. When you run a Dataflow job that doesn't use Dataflow Shuffle or Streaming Engine, workers need to communicate with each other using TCP ports
12345
and12346
within the VPC network. In this scenario, the error includes the worker harness name and the TCP port that's blocked. The error looks like one of the following examples:DEADLINE_EXCEEDED: (g)RPC timed out when SOURCE_WORKER_HARNESS talking to DESTINATION_WORKER_HARNESS:12346.
Rpc to WORKER_HARNESS:12345 completed with error UNAVAILABLE: failed to connect to all addresses Server unresponsive (ping error: Deadline Exceeded, UNKNOWN: Deadline Exceeded...)
To resolve this issue, use the
gcloud compute firewall-rules create
rules flag to allow network traffic to ports12345
and12346
. The following example demonstrates the Google Cloud CLI command:gcloud compute firewall-rules create FIREWALL_RULE_NAME \ --network NETWORK \ --action allow \ --direction IN \ --target-tags dataflow \ --source-tags dataflow \ --priority 0 \ --rules tcp:12345-12346
Replace the following:
FIREWALL_RULE_NAME
: the name of your firewall ruleNETWORK
: the name of your network
Your job is shuffle-bound.
To resolve this issue, make one or more of the following changes.
Java
- If the job is not using the service-based shuffle, switch to using the
service-based Dataflow Shuffle by setting
--experiments=shuffle_mode=service
. For details and availability, see Dataflow Shuffle. - Add more workers. Try setting
--numWorkers
with a higher value when you run your pipeline. - Increase the size of the attached disk for workers. Try setting
--diskSizeGb
with a higher value when you run your pipeline. - Use an SSD-backed persistent disk. Try setting
--workerDiskType="compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/pd-ssd"
when you run your pipeline.
Python
- If the job is not using the service-based shuffle, switch to using the
service-based Dataflow Shuffle by setting
--experiments=shuffle_mode=service
. For details and availability, see Dataflow Shuffle. - Add more workers. Try setting
--num_workers
with a higher value when you run your pipeline. - Increase the size of the attached disk for workers. Try setting
--disk_size_gb
with a higher value when you run your pipeline. - Use an SSD-backed persistent disk. Try setting
--worker_disk_type="compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/pd-ssd"
when you run your pipeline.
Go
- If the job is not using the service-based shuffle, switch to using the
service-based Dataflow Shuffle by setting
--experiments=shuffle_mode=service
. For details and availability, see Dataflow Shuffle. - Add more workers. Try setting
--num_workers
with a higher value when you run your pipeline. - Increase the size of the attached disk for workers. Try setting
--disk_size_gb
with a higher value when you run your pipeline. - Use an SSD-backed persistent disk. Try setting
--disk_type="compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/pd-ssd"
when you run your pipeline.
- If the job is not using the service-based shuffle, switch to using the
service-based Dataflow Shuffle by setting
Encoding errors, IOExceptions, or unexpected behavior in user code
The Apache Beam SDKs and the Dataflow workers depend on common third-party components. These components import additional dependencies. Version collisions can result in unexpected behavior in the service. Also, some libraries aren't forward-compatible. You might need to pin to the listed versions that are in scope during execution. SDK and Worker Dependencies contains a list of dependencies and their required versions.
Error running LookupEffectiveGuestPolicies
When you run a Dataflow job, you might see the following error in the log files:
OSConfigAgent Error policies.go:49: Error running LookupEffectiveGuestPolicies:
error calling LookupEffectiveGuestPolicies: code: "Unauthenticated",
message: "Request is missing required authentication credential.
Expected OAuth 2 access token, login cookie or other valid authentication credential.
This error occurs if OS configuration management is enabled for the entire project.
To resolve this issue, disable VM Manager policies that apply to the entire project. If disabling VM Manager policies for the entire project isn't possible, you can safely ignore this error and filter it out of log monitoring tools.
Exhausted resource pool
When you create a Google Cloud resource, you might see the following error for an exhausted resource pool:
ERROR: ZONE_RESOURCE_POOL_EXHAUSTED
This error occurs for temporary stock-out conditions for a specific resource in a specific zone.
To resolve the issue, you can either wait or create the same resource in another zone. As a best practice, we recommend that you distribute your resources across multiple zones and regions to tolerate outages.
A fatal error has been detected by the Java Runtime Environment
The following error occurs during worker startup:
A fatal error has been detected by the Java Runtime Environment
This error occurs if the pipeline is using Java Native Interface (JNI) to run non-Java code and that code or the JNI bindings contain an error.
googclient_deliveryattempt attribute key error
Your Dataflow job fails with one of the following errors:
The request contains an attribute key that is not valid (key=googclient_deliveryattempt). Attribute keys must be non-empty and must not begin with 'goog' (case-insensitive).
Or:
Invalid extensions name: googclient_deliveryattempt
This error occurs when your Dataflow job has the following characteristics:
- The Dataflow job uses Streaming Engine.
- The pipeline has a Pub/Sub sink.
- The pipeline uses a pull subscription.
- The pipeline uses one of the Pub/Sub service APIs to publish messages instead of using the built-in Pub/Sub I/O sink.
- Pub/Sub is using the Java or C# client library.
- The Pub/Sub subscription has a dead-letter topic.
This error occurs because when you use the Pub/Sub Java or C#
client library and a dead-letter topic for a subscription is enabled, the delivery
attempts are in the googclient_deliveryattempt
message attribute instead of
the delivery_attempt
field. For more information, see
Track delivery attempts
in the "Handle message failures" page.
To workaround this issue, make one or more of the following changes.
- Disable Streaming Engine.
- Use the built-in
Apache Beam
PubSubIO
connector instead of the Pub/Sub service API. - Use a different type of Pub/Sub subscription.
- Remove the dead-letter topic.
- Don't use the Java or C# client library with your Pub/Sub pull subscription. For other options, see Client library code samples.
- In your pipeline code, when attribute keys start with
goog
, erase the message attributes before publishing the messages.
A hot key ... was detected
The following error occurs:
A hot key HOT_KEY_NAME was detected in...
These errors occur if your data contains a hot key. A hot key is a key with enough elements to negatively affect pipeline performance. These keys limit the ability of Dataflow to process elements in parallel, which increases execution time.
To print the human-readable key to the logs when a hot key is detected in the pipeline, use the hot key pipeline option.
To resolve this issue, check that your data is evenly distributed. If a key has disproportionately many values, consider the following courses of action:
- Rekey your data. Apply a
ParDo
transform to output new key-value pairs. - For Java jobs, use the
Combine.PerKey.withHotKeyFanout
transform. - For Python jobs, use the
CombinePerKey.with_hot_key_fanout
transform. - Enable Dataflow Shuffle.
To view hot keys in the Dataflow monitoring interface, see Troubleshoot stragglers in batch jobs.
Invalid table specification in Data Catalog
When you use Dataflow SQL to create Dataflow SQL jobs, your job might fail with the following error in the log files:
Invalid table specification in Data Catalog: Could not resolve table in Data Catalog
This error occurs if the Dataflow service account doesn't have access to the Data Catalog API.
To resolve this issue, enable the Data Catalog API in the Google Cloud project that you're using to write and run queries.
Alternately, assign the roles/datacatalog.viewer
role to the
Dataflow service account.
The job graph is too large
Your job might fail with the following error:
The job graph is too large. Please try again with a smaller job graph,
or split your job into two or more smaller jobs.
This error occurs if the graph size of your job exceeds 10 MB. Certain conditions in your pipeline can cause the job graph to exceed the limit. Common conditions include:
- A
Create
transform that includes a large amount of in-memory data. - A large
DoFn
instance that is serialized for transmission to remote workers. - A
DoFn
as an anonymous inner class instance that (possibly inadvertently) pulls in a large amount of data to be serialized. - A directed acyclic graph (DAG) is being used as part of a programmatic loop that is enumerating a large list.
To avoid these conditions, consider restructuring your pipeline.
Key Commit Too Large
When running a streaming job, the following error appears in the worker log files:
KeyCommitTooLargeException
This error occurs in streaming scenarios if a very large amount of data is
grouped without using a Combine
transform, or if a large amount of data is
produced from a single input element.
To reduce the possibility of encountering this error, use the following strategies:
- Ensure that processing a single element cannot result in outputs or state modifications exceeding the limit.
- If multiple elements were grouped by a key, consider increasing the key space to reduce the elements grouped per key.
- If elements for a key are emitted at a high frequency over a short time, that might result in many GB of events for that key in windows. Rewrite the pipeline to detect keys like this and only emit an output indicating the key was frequently present in that window.
- Use sublinear space
Combine
transforms for commutative and associate operations. Don't use a combiner if it doesn't reduce space. For example, combiner for strings that just appends strings together is worse than not using combiner.
Rejecting message over 7168K
When you run a Dataflow job created from a template, the job might fail with the following error:
Error: CommitWork failed: status: APPLICATION_ERROR(3): Pubsub publish requests are limited to 10MB, rejecting message over 7168K (size MESSAGE_SIZE) to avoid exceeding limit with byte64 request encoding.
This error occurs when messages written to a dead-letter queue exceed the size limit of 7168 K. As a workaround, enable Streaming Engine, which has a higher size limit. To enable Streaming Engine, use the following pipeline option.
Java
--enableStreamingEngine=true
Python
--enable_streaming_engine=true
Request Entity Too Large
When you submit your job, one of the following errors appears in your console or terminal window:
413 Request Entity Too Large
The size of serialized JSON representation of the pipeline exceeds the allowable limit
Failed to create a workflow job: Invalid JSON payload received
Failed to create a workflow job: Request payload exceeds the allowable limit
When you encounter an error about the JSON payload when submitting your job, the JSON representation of your pipeline exceeds the maximum 20 MB request size.
The size of your job is tied to the JSON representation of the pipeline. A larger pipeline means a larger request. Dataflow has a limitation that caps requests at 20 MB.
To estimate the size of the JSON request of your pipeline, run your pipeline with the following option:
Java
--dataflowJobFile=PATH_TO_OUTPUT_FILE
Python
--dataflow_job_file=PATH_TO_OUTPUT_FILE
Go
Outputting your job as JSON is not supported in Go.
This command writes a JSON representation of your job to a file. The size of the serialized file is a good estimate of the size of the request. The actual size is slightly larger due to some additional information included the request.
Certain conditions in your pipeline can cause the JSON representation to exceed the limit. Common conditions include:
- A
Create
transform that includes a large amount of in-memory data. - A large
DoFn
instance that is serialized for transmission to remote workers. - A
DoFn
as an anonymous inner class instance that (possibly inadvertently) pulls in a large amount of data to be serialized.
To avoid these conditions, consider restructuring your pipeline.
SDK pipeline options or staging file list exceeds size limit
When running a pipeline, one of the following errors occurs:
SDK pipeline options or staging file list exceeds size limit.
Please keep their length under 256K Bytes each and 512K Bytes in total.
Or:
Value for field 'resource.properties.metadata' is too large: maximum size
These errors occur if the pipeline couldn't be started due to Compute Engine metadata limits being exceeded. These limits can't be changed. Dataflow uses Compute Engine metadata for pipeline options. The limit is documented in the Compute Engine custom metadata limitations.
The following scenarios can cause the JSON representation to exceed the limit:
- There are too many JAR files to stage.
- The
sdkPipelineOptions
request field is too large.
To estimate the size of the JSON request of your pipeline, run your pipeline with the following option:
Java
--dataflowJobFile=PATH_TO_OUTPUT_FILE
Python
--dataflow_job_file=PATH_TO_OUTPUT_FILE
Go
Outputting your job as JSON is not supported in Go.
The size of the output file from this command must be less than 256 KB. The 512 KB in the error message refers to the total size of the output file and the custom metadata options for the Compute Engine VM instance.
You can get a rough estimate of the custom metadata option for VM instance from running Dataflow jobs in the project. Choose any running Dataflow job. Take a VM instance, and then navigate to the Compute Engine VM instance details page for that VM to check for the custom metadata section. The total length of the custom metadata and the file should be less than 512 KB. An accurate estimate for the failed job is not possible, because the VMs are not spun up for failed jobs.
If your JAR list is hitting the 256-KB limit, review it and reduce any unnecessary JAR files. If it's still too large, try running the Dataflow job by using an uber JAR. For an example that demonstrates how to create and use uber JAR, see Build and deploy an Uber JAR.
If the sdkPipelineOptions
request field is too large, include the following option
when you run your pipeline. The pipeline option is the same for Java, Python, and Go.
--experiments=no_display_data_on_gce_metadata
Shuffle key too large
The following error appears in the worker log files:
Shuffle key too large
This error occurs if the serialized key emitted to a particular (Co-)GroupByKey is too large after the corresponding coder is applied. Dataflow has a limit for serialized shuffle keys.
To resolve this issue, reduce the size of the keys or use more space-efficient coders.
For more information, see production limits for Dataflow.
Total number of BoundedSource objects ... is larger than the allowable limit
One of the following errors might occur when running jobs with Java:
Total number of BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit
Or:
Total size of the BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit
Java
This error might occur if you're reading from a very large number of files by using
TextIO
, AvroIO
, BigQueryIO
through EXPORT, or some other file-based
source. The particular limit depends on the details of your source, but it
is on the order of tens of thousands of files in one pipeline. For example,
embedding schema in AvroIO.Read
allows fewer files.
This error might also occur if you created a custom data source for your
pipeline and the splitIntoBundles
method of your source returned a list of
BoundedSource
objects which takes up more than 20 MB when serialized.
The allowable limit for the total size of the BoundedSource
objects
generated by the splitIntoBundles()
operation of your custom source is
20 MB.
To work around this limitation, make one of the following changes:
Enable Runner V2. Runner v2 converts sources into splittable DoFns that don't have this source split limit.
Modify your custom
BoundedSource
subclass so that the total size of the generatedBoundedSource
objects is smaller han the 20-MB limit. For example, your source might generate fewer splits initially, and rely on Dynamic Work Rebalancing to further split inputs on demand.
NameError
When you execute your pipeline using the Dataflow service, the following error occurs:
NameError
This error does not occur when you execute locally, such as when you execute
using the DirectRunner
.
This error occurs if your DoFn
s are using values in the global namespace that
are not available on the Dataflow worker.
By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Dataflow job.
To resolve this issue, use one of the following methods. If your DoFn
s are
defined in the main file and reference imports and functions in the global
namespace, set the --save_main_session
pipeline option to True
. This change
pickles the state of the global namespace to and loads it on the
Dataflow worker.
If you have objects in your global namespace that can't be pickled, a pickling error occurs. If the error is regarding a module that should be available in the Python distribution, import the module locally, where it's used.
For example, instead of:
import re … def myfunc(): # use re module
use:
def myfunc(): import re # use re module
Alternatively, if your DoFn
s span multiple files, use
a different approach to packaging your workflow and
managing dependencies.
Object is subject to bucket's retention policy
When you have a Dataflow job that writes to a Cloud Storage bucket, the job fails with the following error:
Object 'OBJECT_NAME' is subject to bucket's retention policy or object retention and cannot be deleted or overwritten
You might also see the following error:
Unable to rename "gs://BUCKET"
The first error occurs when object retention is enabled on the Cloud Storage bucket that the Dataflow job is writing to. For more information, see Enable and use object retention configurations.
To resolve this issue, use one of the following workarounds:
Write to a Cloud Storage bucket that doesn't have a retention policy on the
temp
folder.Remove the retention policy from the bucket that the job writes to. For more information, see Set an object's retention configuration.
The second error can indicate that the object retention is enabled on the Cloud Storage bucket, or it can indicate that the Dataflow worker service account doesn't have permission to write to the Cloud Storage bucket.
If you see the second error and object retention is enabled on the Cloud Storage bucket, try the workarounds described previously. If object retention isn't enabled on the Cloud Storage bucket, verify whether the Dataflow worker service account has write permission on the Cloud Storage bucket. For more information, see Access Cloud Storage buckets.
Processing stuck or operation ongoing
If Dataflow spends more time executing a DoFn
than the time
specified in TIME_INTERVAL without returning, the following message is displayed.
Java
Either of the two following log messages, depending on the version:
Processing stuck in step STEP_NAME for at least TIME_INTERVAL
Operation ongoing in bundle BUNDLE_ID for at least TIME_INTERVAL without outputting or completing: at STACK_TRACE
Python
Operation ongoing for over TIME_INTERVAL in state STATE in step STEP_ID without returning. Current Traceback: TRACEBACK
Go
Operation ongoing in transform TRANSFORM_ID for at least TIME_INTERVAL without outputting or completing in state STATE
This behavior has two possible causes:
- Your
DoFn
code is slow, or waiting for some slow external operation to complete. - Your
DoFn
code might be stuck, deadlocked, or abnormally slow to finish processing.
To determine which is the case, expand the
Cloud Monitoring log entry to
see a stack trace. Look for messages that indicate that the DoFn
code is stuck
or otherwise encountering issues. If no messages are present, the issue might be
the execution speed of the DoFn
code. Consider using
Cloud Profiler or other tool to
investigate the performance of your code.
If your pipeline is built on the Java VM (using either Java or Scala), you can investigate the cause of your stuck code. Take a full thread dump of the whole JVM (not just the stuck thread) by following these steps:
- Make note of the worker name from the log entry.
- In the Compute Engine section of the Google Cloud console, find the Compute Engine instance with the worker name you noted.
- Use SSH to connect to the instance with that name.
Run the following command:
curl http://localhost:8081/threadz
Pub/Sub quota errors
When running a streaming pipeline from Pub/Sub, the following errors occur:
429 (rateLimitExceeded)
Or:
Request was throttled due to user QPS limit being reached
These errors occur if your project has insufficient Pub/Sub quota.
To find out if your project has insufficient quota, follow these steps to check for client errors:
- Go to the Google Cloud console.
- In the menu on the left, select APIs & services.
- In the Search Box, search for Cloud Pub/Sub.
- Click the Usage tab.
- Check Response Codes and look for
(4xx)
client error codes.
Request is prohibited by organization's policy
When running a pipeline, the following error occurs:
Error trying to get gs://BUCKET_NAME/FOLDER/FILE:
{"code":403,"errors":[{"domain":"global","message":"Request is prohibited by organization's policy","reason":"forbidden"}],
"message":"Request is prohibited by organization's policy"}
This error occurs if the Cloud Storage bucket is outside of your service perimeter.
To resolve this issue, create an egress rule that allows access to the bucket outside of the service perimeter.
Staged package...is inaccessible
Jobs that used to succeed might fail with the following error:
Staged package...is inaccessible
To resolve this issue:
- Verify that the Cloud Storage bucket used for staging does not have TTL settings that cause staged packages to be deleted.
Verify that the worker service account of your Dataflow project has the permission to access the Cloud Storage bucket used for staging. Gaps in permission can be due to any of the following reasons:
- The Cloud Storage bucket used for staging is present in a different project.
- The Cloud Storage bucket used for staging was migrated from fine-grained access to uniform bucket-level access. Due to the inconsistency between IAM and ACL policies, migrating the staging bucket to uniform bucket-level access disallows ACLs for Cloud Storage resources. ACLs include the permissions held by the worker service account of your Dataflow project over the staging bucket.
For more information, see Accessing Cloud Storage buckets across Google Cloud projects.
A work item has failed 4 times
The following error occurs when a batch job fails:
The job failed because a work item has failed 4 times.
This error occurs if a single operation in a batch job causes the worker code to fail four times. Dataflow fails the job, and this message is displayed.
When running in streaming mode, a bundle including a failing item is retried indefinitely, which might cause your pipeline to permanently stall.
You can't configure this failure threshold. For more details, refer to pipeline error and exception handling.
To resolve this issue, look in the Cloud Monitoring logs of the job for the four individual failures. In the worker logs, look for Error-level or Fatal-level log entries that show exceptions or errors. The exception or error should appear at least four times. If the logs only contain generic timeout errors related to accessing external resources, such as MongoDB, verify that the worker service account has permission to access the subnetwork of the resource.
Timeout in Polling Result File
The following occurs when a job fails:
Timeout in polling result file: PATH. Possible causes are:
1. Your launch takes too long time to finish. Please check the logs on stackdriver.
2. Service account SERVICE_ACCOUNT may not have enough permissions to pull
container image IMAGE_PATH or create new objects in PATH.
3. Transient errors occurred, please try again.
The issue is often related to how the Python dependencies are being installed
by using the requirements.txt
file. The Apache Beam stager downloads the source of
all dependencies from PyPi, including the sources of transitive dependencies.
Then, the wheel
compilation happens implicitly during the pip
download
command for some of the Python packages that are dependencies of apache-beam
.
A timeout issue might occur because of the requirements.txt
file.
For more information, see
the Apache Arrow team's bug tracking this issue.
The
suggested workaround
is to install apache-beam
directly in the Dockerfile. This way, the timeout
for the requirements.txt
file is not applied.
Write Correct File/Write/WriteImpl/PreFinalize failed
When running a job, the job fails intermittently, and the following error occurs:
Workflow failed. Causes: S27:Write Correct File/Write/WriteImpl/PreFinalize failed., Internal Issue (ID): ID:ID, Unable to expand file pattern gs://BUCKET_NAME/temp/FILE
This error occurs when the same subfolder is used as the temporary storage location for multiple jobs that run concurrently.
To resolve this issue, don't use the same subfolder as the temporary storage location for multiple pipelines. For each pipeline, provide a unique subfolder to use as the temporary storage location.
Element exceeds maximum protobuf message size
When you run Dataflow jobs and your pipeline has large elements, you might see errors similar to the following examples:
Exception serializing message!
ValueError: Message org.apache.beam.model.fn_execution.v1.Elements exceeds maximum protobuf size of 2GB
Or:
Buffer size ... exceeds GRPC limit 2147483548. This is likely due to a single element that is too large.
You might also see a warning similar to the following example:
Data output stream buffer size ... exceeds 536870912 bytes. This is likely due to a large element in a PCollection.
These errors occur when your pipeline contains large elements.
To resolve this issue, if you use the Python SDK, upgrade to Apache Beam version 2.57.0 or later. The Python SDK versions 2.57.0 and later improve the processing of large elements and add relevant logging.
If the errors persist after upgrading or if you're not using the Python SDK, identify the step in the job where the error happens, and try to reduce the size of the elements in that step.
When PCollection
objects in your pipeline have large elements, the RAM requirements for the pipeline increase.
Large elements can also cause runtime errors,
especially when they cross the boundaries of fused stages.
Large elements can occur when a pipeline inadvertently materializes a large
iterable. For example, a pipeline that passes the output of
a GroupByKey
operation into an unnecessary Reshuffle
operation
materializes lists as single elements. These lists potentially contain
a large number of values for each key.
If the error happens in a step that uses a side input, be aware that using side inputs can introduce a fusion barrier. Check whether the transform that produces a large element and the transform that consumes it belong to the same stage.
When constructing your pipeline, follow these best practices:
- In
PCollections
use multiple small elements instead of a single large element. - Store large blobs in external storage systems. Either use
PCollections
to pass their metadata, or use a custom coder that reduces the size of the element. - If you must pass a PCollection that can exceed 2 GB as a side input, use iterable views, such as
AsIterable
andAsMultiMap
.
The maximum size for a single element in a Dataflow job is limited to 2 GB. For more information, see Quotas and limits.
Archive job errors
The following sections contain common errors that you might encounter when you try to archive a Dataflow job by using the API.
No value is provided
When you try to archive a Dataflow job by using the API, the following error might occur:
The field mask specifies an update for the field job_metadata.user_display_properties.archived in job JOB_ID, but no value is provided. To update a field, please provide a field for the respective value.
This error occurs for one of the following reasons:
The path specified for the
updateMask
field doesn't follow the correct format. This issue can occur due to typos.The
JobMetadata
isn't correctly specified. In theJobMetadata
field, foruserDisplayProperties
, use the key-value pair"archived":"true"
.
To resolve this error, verify that the command that you pass to the API matches the required format. For more details, see Archive a job.
The API does not recognize the value
When you try to archive a Dataflow job by using the API, the following error might occur:
The API does not recognize the value VALUE for the field job_metadata.user_display_properties.archived for job JOB_ID. REASON: Archived display property can only be set to 'true' or 'false'
This error occurs when the value provided in the archive jobs key-value pair isn't a
supported value. The supported values for the archive jobs key-value pair are
"archived":"true"
and "archived":"false"
.
To resolve this error, verify that the command that you pass to the API matches the required format. For more details, see Archive a job.
Cannot update both state and mask
When you try to archive a Dataflow job by using the API, the following error might occur:
Cannot update both state and mask.
This error occurs when you try to update both the job state and the archive status in the same API call. You can't make updates to both the job state and the updateMask query parameter in the same API call.
To resolve this error, update the job state in a separate API call. Make updates to the job state before updating the job archive status.
Workflow modification failed
When you try to archive a Dataflow job by using the API, the following error might occur:
Workflow modification failed.
This error usually occurs when you try to archive a job that is running.
To resolve this error, wait until the job completes before archiving it. Completed jobs have one of the following job states:
JOB_STATE_CANCELLED
JOB_STATE_DRAINED
JOB_STATE_DONE
JOB_STATE_FAILED
JOB_STATE_UPDATED
For more information, see Detect Dataflow job completion.
Container image errors
The following sections contain common errors that you might encounter when using custom containers and steps for resolving or troubleshooting the errors. The errors are typically prefixed with the following message:
Unable to pull container image due to error: DETAILED_ERROR_MESSAGE
Permission "containeranalysis.occurrences.list" denied
The following error appears in your log files:
Error getting old patchz discovery occurrences: generic::permission_denied: permission "containeranalysis.occurrences.list" denied for project "PROJECT_ID", entity ID "" [region="REGION" projectNum=PROJECT_NUMBER projectID="PROJECT_ID"]
the Container Analysis API is required for vulnerability scanning.
For more information, see OS scanning overview and Configuring access control in the Artifact Analysis documentation.
Error syncing pod ... failed to "StartContainer"
The following error occurs during worker startup:
Error syncing pod POD_ID, skipping: [failed to "StartContainer" for CONTAINER_NAME with CrashLoopBackOff: "back-off 5m0s restarting failed container=CONTAINER_NAME pod=POD_NAME].
A pod is a colocated group of Docker containers running on a Dataflow worker. This error occurs when one of the Docker containers in the pod fails to start. If the failure is not recoverable, the Dataflow worker isn't able to start, and Dataflow batch jobs eventually fail with errors like the following:
The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.
This error typically occurs when one of the containers is continuously crashing during startup.
To understand the root cause, look for the logs captured immediately prior to the failure. To analyze the logs, use the Logs Explorer. In the Logs Explorer, limit the log files to log entries emitted from the worker with container startup errors. To limit the log entries, complete the following steps:
- In the Logs Explorer, find the
Error syncing pod
log entry. - To see the labels associated with the log entry, expand the log entry.
- Click the label associated with the
resource_name
, and then click Show matching entries.
In the Logs Explorer, the Dataflow logs are organized into
several log streams. The Error syncing pod
message is emitted in the log named
kubelet
. However, the logs from the failing container could be in a different
log stream. Each container has a name. Use the following table to determine
which log stream might contain logs relevant to the failing container.
Container name | Log names |
---|---|
sdk, sdk0, sdk1, sdk-0-0, and similar | docker |
harness | harness, harness-startup |
python, java-batch, java-streaming | worker-startup, worker |
artifact | artifact |
When you query the Logs Explorer, make sure that the query either includes the relevant log names in the query builder interface or does not have restrictions on the log name.
After you select the relevant logs, the query result might look like the following example:
resource.type="dataflow_step"
resource.labels.job_id="2022-06-29_08_02_54-JOB_ID"
labels."compute.googleapis.com/resource_name"="testpipeline-jenkins-0629-DATE-cyhg-harness-8crw"
logName=("projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fdocker"
OR
"projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fworker-startup"
OR
"projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fworker")
Because the logs reporting the symptom of the container failure are sometimes
reported as INFO
, include INFO
logs in your analysis.
Typical causes of container failures include the following:
- Your Python pipeline has additional dependencies that are installed at
runtime, and the installation is unsuccessful. You might see errors like
pip install failed with error
. This issue might occur due to conflicting requirements, or due to a restricted networking configuration that prevents a Dataflow worker from pulling an external dependency from a public repository over the internet. A worker fails in the middle of the pipeline run due to an out of memory error. You might see an error like one of the following:
java.lang.OutOfMemoryError: Java heap space
Shutting down JVM after 8 consecutive periods of measured GC thrashing. Memory is used/total/max = 24453/42043/42043 MB, GC last/max = 58.97/99.89 %, #pushbacks=82, gc thrashing=true. Heap dump not written.
To debug an out of memory issue, see Troubleshoot Dataflow out of memory errors.
Dataflow is unable to pull the container image. For more information, see Image pull request failed with error.
The container used is not compatible with the worker VM's CPU architecture. In the harness startup logs, you might see an error like the following:
exec /opt/apache/beam/boot: exec format error
. To check the container image's architecture, rundocker image inspect $IMAGE:$TAG
and look for theArchitecture
key word. If it saysError: No such image: $IMAGE:$TAG
, you might need to pull the image first by runningdocker pull $IMAGE:$TAG
. For information on building multi-architecture images, see Build a multi-architecture container image.
After you identify the error causing the container to fail, try to address the error, and then resubmit the pipeline.
Image pull request failed with error
During worker startup, one of the following errors appears in the worker or job logs:
Image pull request failed with error
pull access denied for IMAGE_NAME
manifest for IMAGE_NAME not found: manifest unknown: Failed to fetch
Get IMAGE_NAME: Service Unavailable
These errors occur if a worker is unable to start up because the worker can't pull a Docker container image. This issue happens in the following scenarios:
- The custom SDK container image URL is incorrect
- The worker lacks credential or network access to the remote image
To resolve this issue:
- If you're using a custom container image with your job, verify that your image URL is correct and has a valid tag or digest. The Dataflow workers also need access to the image.
- Verify that public images can be pulled locally by running
docker pull $image
from an unauthenticated machine.
For private images or private workers:
- If you're using Container Registry to host your container image, it is recommended that you use Artifact Registry instead. Effective May 15, 2023, Container Registry is deprecated. If you use Container Registry, you can transition to Artifact Registry. If your images are in a different project than the one used to run your Google Cloud job, configure access control for the default Google Cloud service account.
- If using shared Virtual Private Cloud (VPC), make sure that workers can access the custom container repository host.
- Use
ssh
to connect with a running job worker VM and rundocker pull $image
to directly confirm that the worker is configured properly.
If workers fail several times in a row due to this error and work has started on a job, the job can fail with an error similar to the following message:
Job appears to be stuck.
If you remove access to the image while the job is running, either by removing the image itself or revoking the Dataflow worker Service Account Credentials or internet access to access images, Dataflow only logs errors. Dataflow doesn't fail the job. Dataflow also avoids failing long-running streaming pipelines to avoid losing pipeline state.
Other possible errors can arise from repository quota issues or outages. If you experience issues exceeding the Docker Hub quota for pulling public images or general third-party repository outages, consider using Artifact Registry as the image repository.
SystemError: unknown opcode
Your Python custom container pipeline might fail with the following error immediately after job submission:
SystemError: unknown opcode
In addition, the stack trace might include
apache_beam/internal/pickler.py
To resolve this issue, verify that the Python version that you're using locally matches the version in the container image up to the major and minor version. The difference in the patch version, such as 3.6.7 versus 3.6.8, does not create compatibility issues. The difference in minor version, such as 3.6.8 versus 3.8.2, can cause pipeline failures.
Worker errors
The following sections contain common worker errors that you might encounter and steps for resolving or troubleshooting the errors.
Call from Java worker harness to Python DoFn fails with error
If a call from the Java worker harness to a Python DoFn
fails, a relevant
error message is displayed.
To investigate the error, expand the Cloud Monitoring error log entry and look at the error message and traceback. It shows you which code failed so you can correct it if necessary. If you believe that the error is a bug in Apache Beam or Dataflow, report the bug.
EOFError: marshal data too short
The following error appears in the worker logs:
EOFError: marshal data too short
This error sometimes occurs when Python pipeline workers run out of disk space.
To resolve this issue, see No space left on device.
Failed to attach disk
When you try to launch a Dataflow job that uses C3 VMs with Persistent Disk, the job fails with one or both of the following errors:
Failed to attach disk(s), status: generic::invalid_argument: One or more operations had an error
Can not allocate sha384 (reason: -2), Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on...
These errors occur when you use C3 VMs with an unsupported Persistent Disk type. For more information, see Supported disk types for C3.
To use C3 VMs with your Dataflow job, choose the
pd-ssd
worker disk type. For more information, see
Worker-level options.
Java
--workerDiskType=pd-ssd
Python
--worker_disk_type=pd-ssd
Go
disk_type=pd-ssd
No space left on device
When a job runs out of disk space, the following error might appear in the worker logs:
No space left on device
This error can occur for one of the following reasons:
- The worker persistent storage runs out of free space, which can occur for
one of the following reasons:
- A job downloads large dependencies at runtime
- A job uses large custom containers
- A job writes many temporary data to local disk
- When using Dataflow Shuffle, Dataflow sets lower default disk size. As a result, this error might occur with jobs moving from worker-based shuffle.
- The worker boot disk fills up because it's logging more than 50 entries per second.
To resolve this issue, follow these troubleshooting steps:
To see disk resources associated with a single worker, look up VM instance details for worker VMs associated with your job. Part of the disk space is consumed by the operating system, binaries, logs, and containers.
To increase persistent disk or boot disk space, adjust the disk size pipeline option.
Track disk space usage on the worker VM instances by using Cloud Monitoring. See Receive worker VM metrics from the Monitoring agent for instructions explaining how to set this up.
Look for boot disk space issues by Viewing serial port output on the worker VM instances and looking for messages like:
Failed to open system journal: No space left on device
If you have many worker VM instances, you can create a script to
run gcloud compute instances get-serial-port-output
on all of them at once.
You can review that output instead.
Python pipeline fails after one hour of worker inactivity
When using the Apache Beam SDK for Python with Dataflow Runner V2 on worker machines with many CPU cores, use Apache Beam SDK 2.35.0 or later. If your job uses a custom container, use Apache Beam SDK 2.46.0 or later.
Consider pre-building your Python container. This step can improve VM startup times and horizontal autoscaling performance. To use this feature, enable the Cloud Build API on your project and submit your pipeline with the following parameter:
‑‑prebuild_sdk_container_engine=cloud_build
.
For more information, see Dataflow Runner V2.
You can also use a custom container image with all dependencies preinstalled.
Startup of the worker pool in zone failed to bring up any of the desired workers
The following error occurs:
Startup of the worker pool in zone ZONE_NAME failed to bring up any of the desired NUMBER workers.
The project quota may have been exceeded or access control policies may be preventing the operation;
review the Cloud Logging 'VM Instance' log for diagnostics.
This error occurs for one of the following reasons:
- You have exceeded one of the Compute Engine quotas that Dataflow worker creation relies on.
- Your organization has constraints in place that prohibit some aspect of the VM instance creation process, like the account being used, or the zone being targeted.
To resolve this issue, follow these troubleshooting steps:
Review the VM Instance log
- Go to the Cloud Logging viewer
- In the Audited Resource drop-down list, select VM Instance.
- In the All logs drop-down list, select compute.googleapis.com/activity_log.
- Scan the log for any entries related to VM instance creation failure.
Check your usage of Compute Engine quotas
To view Compute Engine resource usage compared to Dataflow quotas for the zone you're targeting, run the following command:
gcloud compute regions describe [REGION]
Review the results for the following resources to see if any are exceeding quota:
- CPUS
- DISKS_TOTAL_GB
- IN_USE_ADDRESSES
- INSTANCE_GROUPS
- INSTANCES
- REGIONAL_INSTANCE_GROUP_MANAGERS
If needed, request a quota change.
Review your organization policy constraints
- Go to the Organization policies page
- Review the constraints for any that might limit VM instance creation for either the account you're using (by default, the Dataflow service account) or in the zone that you're targeting.
- If you have a policy that restricts the use of external IP addresses, turn off external IP addresses for this job. For more information about turning off external IP addresses, see Configure internet access and firewall rules.
Timed out waiting for an update from the worker
When a Dataflow job fails, the following error occurs:
Root cause: Timed out waiting for an update from the worker. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact.
Sometimes, this error occurs when the worker runs out of memory or swap space. To resolve this issue, as a first step, try running the job again. If the job still fails and the same error occurs, try using a worker with more memory and disk space. For example, add the following pipeline startup option:
--worker_machine_type=m1-ultramem-40 --disk_size_gb=500
Changing the worker type could affect billed cost. For more information, see Troubleshoot Dataflow out of memory errors.
This error can also occur when your data contains a hot key. In this scenario, CPU utilization is high on some workers during most of the duration of the job. However, the number of workers does not reach the maximum allowed. For more information about hot keys and possible solutions, see Writing Dataflow pipelines with scalability in mind.
For additional solutions to this issue, see A hot key ... was detected.
If your Python code calls C/C++ code by using the Python extension mechanism,
check whether the extension code releases the Python Global Interpreter Lock (GIL) in computationally intensive
parts of code that don't access Python state. If the GIL is not released for a prolonged period of time, you might see error messages like:
Unable to retrieve status info from SDK harness <...> within allowed time
and SDK worker appears to be permanently unresponsive. Aborting the SDK
.
The libraries that facilitate interactions with extensions like Cython, and PyBind
have primitives to control GIL status. You can also manually release the GIL
and reacquire it before returning control to the Python interpreter by using the
Py_BEGIN_ALLOW_THREADS
and Py_END_ALLOW_THREADS
macros.
For more information, see Thread State and the Global Interpreter Lock
in the Python documentation.
It might be possible to retrieve stacktraces of a thread that is holding the GIL on a running Dataflow worker as follows:
# SSH to a currently running Dataflow worker VM, for example:
gcloud compute ssh --zone "us-central1-a" "beamapp-someworker-harness-abcd" --project "project-id"
# Find a container running the Python SDK harness.
CONTAINER_ID=`docker ps | grep python | head -1 | awk '{print $1}'`
# Start a shell in the running container.
docker exec --privileged -it $CONTAINER_ID /bin/bash
# Inspect python processes in the running container.
ps -A | grep python
PYTHON_PID=$(ps -A | grep python | head -1 | awk '{print $1}')
# Use pystack to retrieve stacktraces from the python process. Note which thread holds the GIL.
pip install pystack
pystack remote $PYTHON_PID
# To look up the native (C/C++) frames, use --native-all flag or use gdb:
pystack remote --native-all $PYTHON_PID
apt update && apt install -y gdb
gdb --quiet \
--eval-command="set pagination off" \
--eval-command="thread apply all bt" \
--eval-command "set confirm off" \
--eval-command="quit" -p $PYTHON_PID
In Python pipelines, in the default configuration, Dataflow assumes that each Python process running on the workers efficiently uses one vCPU core. If the pipeline code bypasses the GIL limitations, such as by using libraries that are implemented in C++, processing elements might use resources from more than one vCPU core, and the workers might not get enough CPU resources. To work around this issue, reduce the number of threads on the workers.
Transient errors publishing to topic
When your streaming job uses the at-least-once streaming mode and publishes to a Pub/Sub sink, the following error appears in the job logs:
There were transient errors publishing to topic
If your job runs correctly, this error is benign, and you can ignore it. Dataflow automatically retries sending the Pub/Sub messages with a backoff delay.
Java dependency issues
Incompatible classes and libraries can cause Java dependency issues. When your pipeline has Java dependency issues, one of the following errors might occur:
NoClassDefFoundError
: This error occurs when an entire class is not available during runtime.NoSuchMethodError
: This error occurs when the class in the classpath uses a version that doesn't contain the correct method or when the method signature changed.NoSuchFieldError
: This error occurs when the class in the classpath uses a version that doesn't have a field required during runtime.FATAL ERROR in native method
: This error occurs when a built-in dependency can't be loaded properly. When using uber JAR (shaded), don't include libraries that use signatures (such as Conscrypt) in the same JAR.
If your pipeline contains user-specific code and settings, the code can't contain mixed versions of libraries. If you're using a dependency management library, we recommend that you use Google Cloud Libraries BOM.
If you're using the Apache Beam SDK, to import the correct libraries BOM,
use beam-sdks-java-io-google-cloud-platform-bom
:
Maven
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-google-cloud-platform-bom</artifactId>
<version>BEAM_VERSION</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
Gradle
dependencies {
implementation(platform("org.apache.beam:beam-sdks-java-google-cloud-platform-bom:BEAM_VERSION"))
}
For more information, see Manage pipeline dependencies in Dataflow.
InaccessibleObjectException in JDK 17 and later
When you run pipelines with the Java Platform, Standard Edition Development Kit (JDK) versions 17 and later, the following error might appear in the worker log files:
Unable to make protected METHOD accessible:
module java.MODULE does not "opens java.MODULE" to ...
This issue occurs because starting in Java version 9, open module Java virtual machine (JVM) options are needed to access JDK internals. In Java 16 and later versions, open module JVM options are always required to access JDK internals.
To resolve this issue, when you pass modules to your
Dataflow pipeline to open, use the format
MODULE/PACKAGE=TARGET_MODULE(,TARGET_MODULE)*
with the jdkAddOpenModules
pipeline option. This format
allows access to the necessary library.
For example, if the error is
module java.base does not "opens java.lang" to unnamed module @...
,
then include the following pipeline option when you run your pipeline:
--jdkAddOpenModules=java.base/java.lang=ALL-UNNAMED
For more information, see the
DataflowPipelineOptions
class documentation.
BigQuery connector errors
The following sections contain common BigQuery connector errors that you might encounter and steps for resolving or troubleshooting the errors.
quotaExceeded
When using the BigQuery connector to write to BigQuery using streaming inserts, write throughput is lower than expected, and the following error might occur:
quotaExceeded
Slow throughput might be due to your pipeline exceeding the available
BigQuery streaming insert quota. If so, quota related
error messages from BigQuery appear in the Dataflow
worker logs (look for quotaExceeded
errors).
If you see quotaExceeded
errors, to resolve this issue:
- When using the Apache Beam SDK for Java, set the BigQuery sink
option
ignoreInsertIds()
. - When using the Apache Beam SDK for Python, use the
ignore_insert_ids
option.
These settings make you eligible for a one GB per sec, per-project BigQuery streaming insert throughput. For more information on caveats related to automatic message deduplication, see the BigQuery documentation. To increase the BigQuery streaming insert quota higher than one GBps, submit a request through the Google Cloud console.
If you don't see quota related errors in worker logs, the issue might be that
default bundling or batching related parameters don't provide adequate
parallelism for your pipeline to scale. You can adjust several
Dataflow BigQuery connector related configurations to
achieve the expected performance when writing to BigQuery using
streaming inserts. For example, for Apache Beam SDK for Java, adjust
numStreamingKeys
to match the maximum number of workers and consider
increasing insertBundleParallelism
to configure BigQuery connector to
write to BigQuery using more parallel threads.
For configurations available in the Apache Beam SDK for Java, see BigQueryPipelineOptions, and for configurations available in the Apache Beam SDK for Python, see the WriteToBigQuery transform.
rateLimitExceeded
When using the BigQuery connector, the following error occurs:
rateLimitExceeded
This error occurs if BigQuery too many
API requests
are sent during a short duration. BigQuery has short term quota limits.
It's possible for your Dataflow pipeline to
temporarily exceed such a quota. In this scenario,
API requests
from your Dataflow pipeline to BigQuery might fail, which
could result in rateLimitExceeded
errors in worker logs.
Dataflow retries such failures, so you can safely ignore these
errors. If you believe that your pipeline is affected by
rateLimitExceeded
errors, contact Cloud Customer Care.
Miscellaneous errors
The following sections contain miscellaneous errors that you might encounter and steps for resolving or troubleshooting the errors.
Can not allocate sha384
Your job runs correctly, but you see the following error in the job logs:
ima: Can not allocate sha384 (reason: -2)
If your job runs correctly, this error is benign, and you can ignore it. The worker VM base images sometimes produce this message. Dataflow automatically responds to and addresses the underlying issue.
A feature request exists to change the level of this message
from WARN
to INFO
. For more information, see
Lowering the Dataflow system launch error log level to WARN or INFO.
Error initializing dynamic plugin prober
Your job runs correctly, but you see the following error in the job logs:
Error initializing dynamic plugin prober" err="error (re-)creating driver directory: mkdir /usr/libexec/kubernetes: read-only file system
If your job runs correctly, this error is benign, and you can ignore it. This error occurs when the Dataflow job tries to create a directory without the necessary write permissions, and the task fails. If your job succeeds, the directory either wasn't needed, or Dataflow addressed the underlying issue.
A feature request exists to change the level of this message
from WARN
to INFO
. For more information, see
Lowering the Dataflow system launch error log level to WARN or INFO.
No such object: pipeline.pb
When listing jobs using the
JOB_VIEW_ALL
option, the
following error occurs:
No such object: BUCKET_NAME/PATH/pipeline.pb
This error can occur if you delete the pipeline.pb
file from the staging files
for the job.
Skipping pod synchronization
Your job runs correctly, but you see one of the following errors in the job logs:
Skipping pod synchronization" err="container runtime status check may not have completed yet"
Or:
Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
If your job runs correctly, these errors are benign, and you can ignore them.
The message container runtime status check may not have completed yet
occurs
when the Kubernetes kubelet is skipping the synchronization of pods because it's
waiting for the container runtime to initialize. This scenario occurs
for a variety of reasons, such as when the container runtime has recently
started or is restarting.
The when the message includes PLEG is not healthy: pleg has yet to be successful
,
the kubelet is waiting for the pod lifecycle event generator (PLEG) to become
healthy before synchronizing pods. The PLEG is responsible for generating events
that are used by the kubelet to track the state of pods.
A feature request exists to change the level of this message
from WARN
to INFO
. For more information, see
Lowering the Dataflow system launch error log level to WARN or INFO.
Recommendations
For guidance on recommendations generated by Dataflow Insights,
see Insights.