Common error guidance

This page describes some common errors you might encounter when running your Cloud Dataflow job, and suggests some courses of action for dealing with those errors.

Job messages:

Job submission errors:

Worker logs (Stackdriver):

Job messages

"The job failed because a work item failed 4 times."

If a single operation causes the worker code to fail four times, by the code throwing an exception or crashing, Cloud Dataflow fails the job, and the message a work item failed 4 times is displayed.

Look in the job's Stackdriver logs for the four individual failures. Look for Error-level or Fatal-level log entries in the worker logs that show exceptions or errors. You should see at least four of these exceptions or errors.

Encoding errors, IOExceptions, or unexpected behavior in user code.

The Apache Beam SDKs and the Cloud Dataflow workers depend on common third-party components. These components import additional dependencies. Version collisions can result in unexpected behavior in the service. If you are using any of these packages in your code, be aware that some libraries are not forward-compatible. You may need to pin to the listed versions that will be in scope during execution. SDK and Worker Dependencies contains a list of dependencies and their required versions.

Jobs that used to run, now fail with "Staged package...is inaccessible"

Verify that the Cloud Storage bucket used for staging does not have TTL settings that cause staged packages to be deleted.

Job submission errors

"413 Request Entity Too Large" / "The size of serialized JSON representation of the pipeline exceeds the allowable limit"

If you encounter an error about the JSON payload when submitting your job, it means your pipeline's JSON representation exceeds the maximum 20 MB request size. These errors might appear as one of the following messages in your console or terminal window:

  • 413 Request Entity Too Large
  • "The size of serialized JSON representation of the pipeline exceeds the allowable limit"
  • "Failed to create a workflow job: Invalid JSON payload received"
  • "Failed to create a workflow job: Request payload exceeds the allowable limit"

The size of your job is specifically tied to the JSON representation of the pipeline. A larger pipeline means a larger request. Cloud Dataflow currently has a limitation that caps requests at 20 MB.

To estimate the size of your pipeline's JSON request, run your pipeline with the following option:

Java: SDK 2.x

--dataflowJobFile=< path to output file >

Python

--dataflow_job_file=< path to output file >

Java: SDK 1.x

--dataflowJobFile=< path to output file >

This command writes a JSON representation of your job to a file. The size of the serialized file is a good estimate of the size of the request. The actual size will be slightly larger due to some additional information included the request.

Certain conditions in your pipeline can cause the JSON representation to exceed the limit. Common conditions include:

  • A Create transform that includes a large amount of in-memory data.
  • A large DoFn instance that is serialized for transmission to remote workers.
  • A DoFn as an anonymous inner class instance that (possibly inadvertently) pulls in a large amount of data to be serialized.

To avoid these conditions, consider restructuring your pipeline.

"The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs."

Your job's graph size must not exceed 10 MB. Certain conditions in your pipeline can cause the job graph to exceed the limit. Common conditions include:

  • A Create transform that includes a large amount of in-memory data.
  • A large DoFn instance that is serialized for transmission to remote workers.
  • A DoFn as an anonymous inner class instance that (possibly inadvertently) pulls in a large amount of data to be serialized.

To avoid these conditions, consider restructuring your pipeline.

"Total number of BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit" or "Total size of the BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit".

Java: SDK 2.x

You might encounter this error if you're reading from a very large number of files via TextIO, AvroIO, or some other file-based source. The particular limit depends on the details of your source (e.g. embedding schema in AvroIO.Read will allow fewer files), but it is on the order of tens of thousands of files in one pipeline.

You might also encounter this error if you've created a custom data source for your pipeline and your source's splitIntoBundles method returned a list of BoundedSource objects which takes up more than 20 MB when serialized.

The allowable limit for the total size of the BoundedSource objects generated by your custom source's splitIntoBundles() operation is 20 MB. You can work around this limitation by modifying your custom BoundedSource subclass so that the total size of the generated BoundedSource objects is smaller than the 20 MB limit. For example, your source might generate fewer splits initially, and rely on Dynamic Work Rebalancing to further split inputs on demand.

Java: SDK 1.x

You might encounter this error if you're reading from a very large number of files via TextIO, AvroIO, or some other file-based source. The particular limit depends on the details of your source (e.g. embedding schema in AvroIO.Read will allow fewer files), but it is on the order of tens of thousands of files in one pipeline.

You might also encounter this error if you've created a custom data source for your pipeline and your source's splitIntoBundles method returned a list of BoundedSource objects which takes up more than 20 MB when serialized.

The allowable limit for the total size of the BoundedSource objects generated by your custom source's splitIntoBundles() operation is 20 MB. You can work around this limitation by modifying your custom BoundedSource subclass so that the total size of the generated BoundedSource objects is smaller than the 20 MB limit. For example, your source might generate fewer splits initially, and rely on Dynamic Work Rebalancing to further split inputs on demand.

Worker logs (Stackdriver)

"Processing stuck in step <step_id> for at least <time_interval> without outputting or completing in state finish at <stack_trace>"

If Cloud Dataflow spends more time executing a DoFn than the time specified in <time_interval> without returning, this message is displayed.

This error has two possible causes:

  • Your DoFn code is simply slow, or waiting for some slow external operation to complete. In this case you can ignore the warning.
  • Or, your DoFn code might be stuck, deadlocked, or abnormally slow to finish processing. If you believe this is your case, expand the Stackdriver log entry to see a stack trace of the stuck code.

You can further investigate the cause of your stuck code if your pipeline is built on the Java VM (using either Java or Scala); take a full thread dump of the whole JVM (not just the stuck thread) by following these steps:

  1. Make note of the worker name from the log entry.
  2. In the Compute Engine section of the GCP Console, find the Compute Engine instance with the worker name you noted.
  3. SSH into the instance with that name.
  4. Run the following command:

    curl http://localhost:8081/threadz
    

RPC timeout exceptions, "DEADLINE_EXCEEDED" exceptions, or "Server Unresponsive" errors

If you encounter RPC timeouts, DEADLINE_EXCEEDED exceptions, or Server Unresponsive errors while your job runs, these typically indicate one of two problems:

  • The VPC network used for your job might be missing a firewall rule. The firewall rule needs to enable all TCP traffic among VMs in the VPC network you specified in your pipeline options. See Specifying your network and subnetwork for more details.

  • Your job is shuffle-bound. Consider one of, or a combination of, the following courses of action:

    Java: SDK 2.x

    • Add more workers. Try setting --numWorkers with a higher value when you run your pipeline.
    • Increase the size of the attached disk for workers. Try setting --diskSizeGb with a higher value when you run your pipeline.
    • Use an SSD-backed persistent disk. Try setting --workerDiskType="compute.googleapis.com/projects/<project>/zones/<zone>/diskTypes/pd-ssd" when you run your pipeline.

    Python

    • Add more workers. Try setting --num_workers with a higher value when you run your pipeline.
    • Increase the size of the attached disk for workers. Try setting --disk_size_gb with a higher value when you run your pipeline.
    • Use an SSD-backed persistent disk. Try setting --worker_disk_type="compute.googleapis.com/projects/<project>/zones/<zone>/diskTypes/pd-ssd" when you run your pipeline.

    Java: SDK 1.x

    • Add more workers. Try setting --numWorkers with a higher value when you run your pipeline.
    • Increase the size of the attached disk for workers. Try setting --diskSizeGb with a higher value when you run your pipeline.
    • Use an SSD-backed persistent disk. Try setting --workerDiskType="compute.googleapis.com/projects/<project>/zones/<zone>/diskTypes/pd-ssd" when you run your pipeline.

A call from the Java worker harness to a Python DoFn failed with the error <error message>

If a DoFn implemented in Python fails and throws an exception, a relevant error message is displayed.

Expand the Stackdriver error log entry and look at the error message and traceback. It shows you which code failed so you can correct it if necessary. If you believe that this is a bug in Apache Beam or Cloud Dataflow, please report the bug.

No space left on device

If a job shuffles a large amount of data or writes temporary data to local disks, the worker persistent storage might run out of free space.

If you see a message notifying you of no space left on the device, increase the size of your workers' persistent disks by setting the relevant pipeline option. For Java jobs, use the flag --diskSizeGb. For Python jobs, use --disk_size_gb.

Disk space errors such as "RESOURCE_EXHAUSTED: IO error: No space left on disk"

These errors usually indicate that you have allocated insufficient local disk space to process your job. If you are running your job with default settings, your job is running on 3 workers, each with 25 GB of local disk space, and with no autoscaling. Consider modifying the default settings to increase the number of workers available to your job, to increase the default disk size per worker, or to enable autoscaling.

"Bad request" warnings in shuffler logs

During the execution of a Cloud Dataflow job, Stackdriver logs might display a series of warnings similar to:

Unable to update setup work item <step_id> error: generic::invalid_argument: Http(400) Bad Request
Update range task returned 'invalid argument'. Assuming lost lease for work with id <lease_id>
with expiration time: <timestamp>, now: <timestamp>. Full status: generic::invalid_argument: Http(400) Bad Request

"Bad request" warnings appear because worker state information is stale or out of sync due to processing delays. Often, your Cloud Dataflow job will succeed despite these "bad request" warnings. If that is the case, ignore the warnings.

A hot key was detected in step <step_id> with age of <time_interval>s

These errors indicate that you have a hot key. A hot key is a key with enough elements to negatively impact pipeline performance. These keys limit Cloud Dataflow's ability to process elements in parallel, which increases execution time.

Check that your data is evenly distributed. If a key has disproportionately many values, consider the following courses of action:

Hai trovato utile questa pagina? Facci sapere cosa ne pensi:

Invia feedback per...

Hai bisogno di assistenza? Visita la nostra pagina di assistenza.