Troubleshooting Your Pipeline

This section is a compendium of troubleshooting tips and debugging strategies that you might find helpful if you're having trouble building or running your Dataflow pipeline. This information can help you detect a pipeline failure, determine the reason behind a failed pipeline run, and suggest some courses of action to correct the problem.

Dataflow provides real-time feedback on your job, and there is a basic set of steps you can use to check the error messages, logs, and for conditions such as your job's progress having stalled.

This section also contains a catalog of common errors you might encounter when running your Dataflow pipeline, and suggests some corrective actions and workarounds for each.

Checking Your Pipeline's Status

You can detect any errors in your pipeline runs by using the Dataflow Monitoring Interface.

  1. Go to the Google Cloud Platform Console.
  2. Select your Cloud Platform project from the project list.
  3. Click the menu in the upper left corner.
  4. Navigate to the Big Data section and click Dataflow. A list of running jobs appears in the right-hand pane.
  5. Select the pipeline job you want to view. You can see the jobs' status at a glance in the Status field: "Running," "Succeeded," or "Failed."
Figure 1: A list of Dataflow jobs in the Developer Console with jobs in the running, succeeded, and failed states.

Basic Troubleshooting Workflow

If one of your pipeline jobs has failed, you can select the job to view more detailed information on errors and run results. When you select a job, you can view the execution graph as well as some information about the job on the Summary page to the right of the graph. The top of the page contains a button to view logs, as well as indicators if the job generated errors or warnings during execution.

Figure 2: A Dataflow Job Summary with errors indicated.

Checking Job Error Messages

You can click the Logs button to view log messages generated by your pipeline code and the Dataflow Service. Filter the messages that appear in the logs panel by using the Minimum Severity drop-down menu. Select the Error filter to displays error messages only.

Click the triangle icon next to each error message to expand it.

Figure 3: A list of Dataflow Job Error messages, with one message expanded.

Viewing Step Logs for Your Job

When you select a step in your pipeline graph, the logs panel toggles from displaying Job Logs generated by the Dataflow service, to showing logs from the Compute Engine instances running your pipeline step.

Figure 4: The Cloud Logging button in the Dataflow Job Summary.

Google Cloud Logging amalgamates all of the collected logs from your project's Compute Engine instances in one location. See Logging Pipeline Messages for more information on using Dataflow's various logging capabilities.

Handling Automated Pipeline Rejection

In some cases, the Dataflow service identifies that your pipeline might trigger known SDK issues. To prevent pipelines that will likely encounter issues, from being submitted, Dataflow will automatically reject your pipeline and display the following message:

The workflow was automatically rejected by the service because it may trigger an identified bug in the
SDK (details below). If you think this identification is in error, and would like to override this
automated rejection, please re-submit this workflow with the following override flag: ${override-flag}.
Bug-details: ${bug-details-and-explanation}.
Contact for further help. Please use this identifier in your communication: ${bug-id}."

After reading the caveats in the linked bug details, if you want to try to run your pipeline anyway, you can override the automated rejection. Add the flag --experiments=<override-flag> and resubmit your pipeline.

Determining the Cause of a Pipeline Failure

Typically, a failed Dataflow pipeline run can be attributed to one of the following causes:

  • Graph or pipeline construction errors. These errors occur when Dataflow runs into a problem building the graph of steps that compose your pipeline, as described by your Dataflow program.
  • Errors in job validation. The Dataflow service validates any pipeline job you launch; errors in the validation process can prevent your job from being successfully created or executed. Validation errors can include problems with your Cloud Platform project's Cloud Storage bucket, or with your project's permissions.
  • Exceptions in worker code. These errors occur when there are errors or bugs in the user-provided code that Dataflow distributes to parallel workers, such as the DoFn instances of a ParDo transform.
  • Slow-running Pipelines or Lack of Output. If your pipeline runs slowly or runs for a long period of time without reporting results, you might check your quotas for streaming data sources and sinks such as Pub/Sub. There are also certain transforms that are better-suited to high-volume streaming pipelines than others.
  • Errors caused by transient failures in other Cloud Platform services. Your pipeline may fail because of a temporary outage or other problem in the Cloud Platform services upon which Dataflow depends, such as Compute Engine or Cloud Storage.

Detecting Graph or Pipeline Construction Errors

A graph construction error can occur when Dataflow is building the execution graph for your pipeline from the code in your Dataflow program. During graph construction time, Dataflow checks for illegal operations.

If Dataflow detects an error in graph construction, keep in mind that no job will be created on the Dataflow service, and thus you won't see any feedback in the Dataflow Monitoring Interface. Instead, you'll see an error message similar to the following in the console or terminal window where you ran your Dataflow program:


For example, if your pipeline attempts to perform an aggregation like GroupByKey on a globally-windowed, non-triggered, unbounded PCollection, you'll see an error message similar to the following:

... Exception in thread "main" java.lang.IllegalStateException:
... GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger.
... Use a Window.into or Window.triggering transform prior to GroupByKey


For example, if your pipeline uses type hints and the argument type in one of the transforms is not as expected, you'll see an error message similar to the following:

... in <module> run()
... in run | beam.Map('count', lambda (word, ones): (word, sum(ones))))
... in __or__ return self.pipeline.apply(ptransform, self)
... in apply transform.type_check_inputs(pvalueish)
... in type_check_inputs self.type_check_inputs_or_outputs(pvalueish, 'input')
... in type_check_inputs_or_outputs pvalue_.element_type)) Input type hint violation at group: expected Tuple[TypeVariable[K], TypeVariable[V]], got <type 'str'>

Should you encounter such an error, check your pipeline code to ensure that your pipeline's operations are legal.

Detecting Errors in Dataflow Job Validation

Once the Dataflow service has received your pipeline's graph, the service will attempt to validate your job. This includes making sure the service can access your job's associated Cloud Storage buckets for file staging and temporary output, checking for the required permissions in your Cloud Platform project, and making sure the service can access input and output sources (like files).

If your job fails the validation process, you'll see an error message in the Dataflow Monitoring Interface, as well as in your console or terminal window if you are using blocking execution. The error message will look similar to the following:


INFO: To access the Dataflow monitoring console, please navigate to
Submitted job: 2016-03-08_18_59_25-16868399470801620798
... Starting 3 workers...
... Executing operation BigQuery-Read+AnonymousParDo+BigQuery-Write
... Executing BigQuery import job "dataflow_job_16868399470801619475".
... Stopping worker pool...
... Workflow failed. Causes: ...BigQuery-Read+AnonymousParDo+BigQuery-Write failed.
Causes: ... BigQuery getting table "non_existent_table" from dataset "cws_demo" in project "" failed.
Message: Not found: Table HTTP Code: 404
... Worker pool stopped.
... run
INFO: Job finished with status FAILED
Exception in thread "main"
  Job 2016-03-08_18_59_25-16868399470801620798 failed with status FAILED


INFO:root:Created job with id: [2016-03-08_14_12_01-2117248033993412477]
... Checking required Cloud APIs are enabled.
... Job 2016-03-08_14_12_01-2117248033993412477 is in state JOB_STATE_RUNNING.
... Combiner lifting skipped for step group: GroupByKey not followed by a combiner.
... Expanding GroupByKey operations into optimizable parts.
... Lifting ValueCombiningMappingFns into MergeBucketsMappingFns
... Annotating graph with Autotuner information.
... Fusing adjacent ParDo, Read, Write, and Flatten operations
... Fusing consumer split into read
... Starting 1 workers...
... Executing operation read+split+pair_with_one+group/Reify+group/Write
... Executing failure step failure14
... Workflow failed.
Causes: ... read+split+pair_with_one+group/Reify+group/Write failed.
Causes: ... Unable to view metadata for files: gs://dataflow-samples/shakespeare/missing.txt.
... Cleaning up.
... Tearing down pending resources...
INFO:root:Job 2016-03-08_14_12_01-2117248033993412477 is in state JOB_STATE_FAILED.

Detecting an Exception in Worker Code

While your job is running, you may encounter errors or exceptions in the your worker code. This generally means that the DoFns in your pipeline code have generated unhandled exceptions, which result in failed tasks in your Dataflow job.

Exceptions in user code (for example, your DoFn instances) are reported in the Dataflow Monitoring Interface. If you run your pipeline with blocking execution, you'll also see error messages printed in your console or terminal window, such as the following:


INFO: To access the Dataflow monitoring console, please navigate to
Submitted job: 2016-03-08_19_09_07-6448127003704955959
... Expanding GroupByKey operations into optimizable parts.
... Lifting ValueCombiningMappingFns into MergeBucketsMappingFns
... Annotating graph with Autotuner information.
... Fusing adjacent ParDo, Read, Write, and Flatten operations
... Starting 1 workers...
... Executing operation TextIO.Write/DataflowPipelineRunner.BatchTextIOWrite/DataflowPipelineRunner.ReshardForWrite/GroupByKey/Create
... Executing operation AnonymousParDo+TextIO.Write/DataflowPipelineRunner.BatchTextIOWrite/DataflowPipelineRunner.ReshardForWrite/Window.Into()+TextIO.Write/DataflowPipelineRunner.BatchTextIOWrite/DataflowPipelineRunner.ReshardForWrite/RandomKey+TextIO.Write/DataflowPipelineRunner.BatchTextIOWrite/DataflowPipelineRunner.ReshardForWrite/GroupByKey/Reify+TextIO.Write/DataflowPipelineRunner.BatchTextIOWrite/DataflowPipelineRunner.ReshardForWrite/GroupByKey/Write
... Workers have started successfully.
... java.lang.ArithmeticException: / by zero
... Executing operation TextIO.Write/DataflowPipelineRunner.BatchTextIOWrite/DataflowPipelineRunner.ReshardForWrite/GroupByKey/Close
... Executing operation TextIO.Write/DataflowPipelineRunner.BatchTextIOWrite/DataflowPipelineRunner.ReshardForWrite/GroupByKey/Read+TextIO.Write/DataflowPipelineRunner.BatchTextIOWrite/DataflowPipelineRunner.ReshardForWrite/GroupByKey/GroupByWindow+TextIO.Write/DataflowPipelineRunner.BatchTextIOWrite/DataflowPipelineRunner.ReshardForWrite/Ungroup+TextIO.Write/DataflowPipelineRunner.BatchTextIOWrite/DataflowPipelineRunner.BatchTextIONativeWrite
... Stopping worker pool...
... Worker pool stopped.
... Cleaning up.

Note: The Dataflow service retries failed tasks up to 4 times in batch mode, and an unlimited number of times in streaming mode. In batch mode, your job will fail; in streaming, it may stall indefinitely.


INFO:root:Job 2016-03-08_14_21_32-8974754969325215880 is in state JOB_STATE_RUNNING.
INFO:root:... Expanding GroupByKey operations into optimizable parts.
INFO:root:... Lifting ValueCombiningMappingFns into MergeBucketsMappingFns
INFO:root:... Annotating graph with Autotuner information.
INFO:root:... Fusing adjacent ParDo, Read, Write, and Flatten operations
INFO:root:...: Starting 1 workers...
INFO:root:...: Executing operation group/Create
INFO:root:...: Value "group/Session" materialized.
INFO:root:...: Executing operation read+split+pair_with_one+group/Reify+group/Write
INFO:root:Job 2016-03-08_14_21_32-8974754969325215880 is in state JOB_STATE_RUNNING.
INFO:root:...: ...: Workers have started successfully.
INFO:root:Job 2016-03-08_14_21_32-8974754969325215880 is in state JOB_STATE_RUNNING.
INFO:root:...: Traceback (most recent call last):
  File ".../dataflow_worker/", line 384, in do_work self.current_executor.execute(work_item.map_task)
  File ".../apache_beam/examples/wordcount.runfiles/google3/third_party/py/apache_beam/examples/", line 73, in <lambda>
ValueError: invalid literal for int() with base 10: 'www'

Note: The Dataflow service retries failed tasks up to 4 times.

Consider guarding against errors in your code by adding exception handlers. For example, if you'd like to drop elements that fail some custom input validation done in a ParDo, handle the exception within your DoFn and drop the element. You can also track failing elements in a few different ways:

  • You can create an Aggregator to keep track of error counts.
  • Java

    See the Aggregator class, and an example of how to create a custom Aggregator.


    In the Dataflow SDK for Python, you can find the module in the package See an example of how to create a custom aggregator.

  • You can log the failing elements and check the output using Cloud Logging.
  • You can have your ParDo write the failing elements to a side output for later inspection.

Troubleshooting Slow-Running Pipelines or Lack of Output


If you have a high-volume streaming pipeline that is running slowly or stalled, there are a few things you can check:

Pub/Sub Quota

If your pipeline reads input from Google Cloud Pub/Sub, your Cloud Platform project may have insufficient Pub/Sub quota. One indication of this is if your job is generating a high number of 429 (Rate Limit Exceeded) errors. Try the following steps to check for such errors:

  1. Go to the Google Cloud Platform Console.
  2. In the left-hand navigation pane, click API Manager.
  3. In the Search Box, search for Google Cloud Pub/Sub.
  4. Click the Usage tab.
  5. Check Response Codes and look for (4xx) client error codes.

Use .withFanout In Your Combine Transforms

If your pipeline processes high-volume unbounded PCollections, we recommend:

  • Use Combine.Globally.withFanout instead of Combine.Globally.
  • Use Combine.PerKey.withHotKeyFanout instead of Count.PerKey.


This feature is not yet supported in the Dataflow SDK for Python.

Common Errors and Courses of Action

This section describes some common errors you might encounter when running your Dataflow job, and suggests some courses of action for correcting or otherwise dealing with those errors.

RPC timed out Exceptions, DEADLINE_EXCEEDED Exceptions, or Server Unresponsive Errors

If you encounter RPC timeouts, DEADLINE_EXCEEDED exceptions, or Server Unresponsive errors while your job runs, these typically indicate one of two problems:

  • The Google Compute Engine network used for your job may be missing a firewall rule. The firewall rule needs to enable all TCP traffic among the VMs in the project. The network might be either the network named default or a network you specified in your pipeline options.
  • Your job is shuffle-bound. Consider one of, or a combination of, the following courses of action:


    • Add more workers. Try setting --numWorkers with a higher value when you run your pipeline.
    • Increase the size of the attached disk for workers. Try setting --diskSizeGb with a higher value when you run your pipeline.
    • Use an SSD-backed persistent disk. Try setting --workerDiskType="" when you run your pipeline.


    • Add more workers. Try setting --num_workers with a higher value when you run your pipeline.
    • Increase the size of the attached disk for workers. Try setting --disk_size_gb with a higher value when you run your pipeline.
    • Use an SSD-backed persistent disk. Try setting --worker_disk_type="" when you run your pipeline.

Disk Space Errors such as RESOURCE_EXHAUSTED: IO error: No space left on disk

These errors usually indicate that you have allocated insufficient local disk space to process your job. If you are running your job with default settings, your job is running on 3 workers, each with 250 GB of local disk space, and with no auto-scaling. Consider modifying the default settings to increase the number of workers available to your job, to increase the default disk size per worker, or to enable auto-scaling.

413 Request Entity Too Large / "The size of serialized JSON representation of the pipeline exceeds the allowable limit"

If you encounter an error about the JSON payload when submitting your job, it means your pipeline's JSON representation exceeds the maximum 10MB request size. These errors might appear as one of the following messages in your console or terminal window:

  • 413 Request Entity Too Large
  • "The size of serialized JSON representation of the pipeline exceeds the allowable limit"
  • "Failed to create a workflow job: Invalid JSON payload received"
  • "Failed to create a workflow job: Request payload exceeds the allowable limit"

The size of your job is specifically tied to the JSON representation of the pipeline; a larger pipeline means a larger request. Dataflow currently has a limitation that caps requests at 10MB.

To estimate the size of your pipeline's JSON request, run your pipeline with the following option:


--dataflowJobFile=< path to output file >


--dataflow_job_file=< path to output file >

This command writes a JSON representation of your job to a file. The size of the serialized file is a good estimate of the size of the request; the actual size will be slightly larger due to some additional information included the request.

Certain conditions in your pipeline can cause the JSON representation to exceed the limit. Common conditions include:

  • A Create transform that includes a large amount of in-memory data.
  • A large DoFn instance that is serialized for transmission to remote workers.
  • A DoFn as an anonymous inner class instance that (possibly inadvertently) pulls in a large amount of data to be serialized.

Consider restructuring your pipeline to avoid these conditions.

"Total number of BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit" or "Total size of the BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit".


You might encounter this error if you're reading from a very large number of files via TextIO, AvroIO or some other file-based source. The particular limit depends on the details of your source (e.g. embedding schema in AvroIO.Read will allow fewer files), but it is on the order of tens of thousands of files in one pipeline.

You might also encounter this error if you've created a custom data source for your pipeline and your source's splitIntoBundles method returned a list of BoundedSource objects which takes up more than 20MB when serialized.

The allowable limit for the total size of the BoundedSource objects generated by your custom source's splitIntoBundles() operation is 20MB. You can work around this limitation by modifying your custom BoundedSource subclass so that the total size of the generated BoundedSource objects is smaller than the 20MB limit. For example, your source might generate fewer splits initially, and rely on Dynamic Work Rebalancing to further split inputs on demand.

Jobs that used to run, now fail with "Staged inaccessible"

Verify that the Cloud Storage bucket used for staging does not have TTL settings that cause staged packages to be deleted.

Encoding errors, IOExceptions, or unexpected behavior in user code.

The Dataflow service's SDKs and worker take dependencies on common third-party components, which themselves import various dependencies. Version collisions can result in unexpected behavior in the service. If you are using any of these packages in your code, be aware that some libraries are not forward-compatible and you may need to pin to the listed versions that will be in scope during execution. In order to determine whether your JAR has a conflicting version in use, consider inspecting the dependency tree of your project. You can generate the dependency tree with various tools, such as maven.


SDK 1.8.0 dependencies

GroupId ArtifactId Version
Additionally, the Dataflow Worker starts with the following logging JARs prepended to the classpath:
GroupId ArtifactId Version

Send feedback about...

Cloud Dataflow Documentation