Troubleshoot Flex Templates

This page provides troubleshooting tips and debugging strategies that you might find helpful if you're using Dataflow Flex Templates. This information can help you detect a polling timeout, determine the reason behind the timeout, and correct the problem.

Troubleshoot polling timeouts

This section provides steps for identifying the cause of polling timeouts.

Polling timeouts

Your Flex Template job might return the following error message:

Timeout in polling result file: ${file_path}.
Service account: ${service_account_email}
Image URL: ${image_url}
Troubleshooting guide at https://cloud.google.com/dataflow/docs/guides/common-errors#timeout-polling

This error can occur for the following reasons:

  1. The base Docker image was overridden.
  2. The service account that fills in ${service_account_email} does not have some necessary permissions.
  3. External IP addresses are disabled, and VMs can't connect to the set of external IP addresses used by Google APIs and services.
  4. The program that creates the graph takes too long to finish.
  5. Pipeline options are being overwritten.
  6. (Python only) There is a problem with the requirements.txt file.
  7. There was a transient error.

To resolve this issue, first check for transient errors by checking the job logs and retrying. If those steps don't resolve the issue, try the following troubleshooting steps.

Verify Docker entrypoint

Try this step if you're running a template from a custom Docker image rather than using one of the provided templates.

Check for the container entrypoint using the following command:

docker inspect $TEMPLATE_IMAGE

You should see the following:

Java

/opt/google/dataflow/java_template_launcher

Python

/opt/google/dataflow/python_template_launcher

If you get a different output, then the entrypoint of your Docker container is overridden. Restore $TEMPLATE_IMAGE to the default.

Check service account permissions

Check that the service account mentioned in the message has the following permissions:

  • It must be able read and write the Cloud Storage path that fills in ${file_path} in the message.
  • It must be able to read the Docker image that fills in ${image_url} in the message.

Configure Private Google Access

If external IP addresses are disabled, you need to allow Compute Engine VMs to connect to the set of external IP addresses used by Google APIs and services. Enable Private Google Access on the subnet used by the network interface of the VM.

For configuration details, see Configuring Private Google Access.

By default, when a Compute Engine VM lacks an external IP address assigned to its network interface, it can only send packets to other internal IP address destinations.

Check if the launcher program fails to exit

The program that constructs the pipeline must finish before the pipeline can be launched. The polling error could indicate that it took too long to do so.

Some things you can do to locate the cause in code are:

  • Check job logs and see if any operation appears to take a long time to complete. An example would be a request for an external resource.
  • Make sure no threads are blocking the program from exiting. Some clients might create their own threads, and if these clients are not shut down, the program waits forever for these threads to be joined.

Pipelines launched directly that don't use a template don't have these limitations. Therefore, if the pipeline worked directly but not as a template, then the use of a template might be the root cause. Finding the issue in the template and fixing the template might resolve the issue.

Verify whether required pipeline options are suppressed

When using Flex Templates, you can configure some pipeline options during pipeline initialization, but other pipeline options should not be changed. For more information, see the Failed to read the job file section on this page.

Remove Apache Beam from the requirements file (Python Only)

If your Dockerfile includes a requirements.txt with apache-beam[gcp], then you should remove it from the file and install it separately. Example:

RUN pip install apache-beam[gcp]
RUN pip install -U -r ./requirements.txt

Putting Apache Beam in the requirements file is known to lead to long launch times, and it often causes a timeout.

Polling timeouts when using Python

If you're running a Dataflow job by using Flex Template and Python, your job might queue for a period, fail to run, and then display the following error:

Timeout in polling

The error occurs because of the requirements.txt file that's used to install the required dependencies. When you launch a Dataflow job, all the dependencies are staged first to make these files accessible to the worker VMs. The process involves downloading and recompiling every direct and indirect dependency in the requirements.txt file. Some dependencies might take several minutes to compile. Notably PyArrow might take time to compile. PyArrow is an indirect dependency that's used by Apache Beam and most Cloud Client Libraries.

To optimize your job's performance, use a Dockerfile or a custom container to prepackage the dependencies. For more information, see Package dependencies in "Configure Flex Templates."

Job launch failures

The following section contains common errors that lead to job launch failures and steps for resolving or troubleshooting the errors.

Failed to read the job file

When you try to run a job from a Flex Template, your job might fail with the following error:

Failed to read the job file : gs://dataflow-staging-REGION-PROJECT_ID/staging/template_launches/TIMESTAMP/job_object with error message: ...: Unable to open template file

This error occurs when the necessary pipeline initialization options are overwritten. When using Flex Templates, you can configure some pipeline options during pipeline initialization, but other pipeline options should not be changed. If the command line arguments required by the Flex Template are overwritten, the job might ignore, override, or discard the pipeline options passed by the template launcher. The job might fail to launch, or a job that doesn't use the Flex Template might launch.

To avoid this issue, during pipeline initialization, do not change the following pipeline options in user code or in the metadata.json file:

Java

  • runner
  • project
  • jobName
  • templateLocation
  • region

Python

  • runner
  • project
  • job_name
  • template_location
  • region

Go

  • runner
  • project
  • job_name
  • template_location
  • region

Permission denied on resource

When you try to run a job from a Flex Template, your job might fail with the following error:

Permission "MISSING_PERMISSION" denied on resource "projects/PROJECT_ID/locations/REGION/repositories/REPOSITORY_NAME" (or it may not exist).

This error occurs when the used service account does not have permissions to access necessary resources to run a Flex Template.

To avoid this issue, please check the required permissions and adjust the service account accordingly.

Troubleshoot early startup issues

When the template launching process fails in an early stage, regular Flex Template logs might not be available. To investigate startup issues, enable serial port logging for the templates launcher VM.

To enable logging for Java templates, set the enableLauncherVmSerialPortLogging option to true. To enable logging for Python and Go templates, set the enable_launcher_vm_serial_port_logging option to true. In the Google Cloud console, the parameter is listed in Optional parameters as Enable Launcher VM Serial Port Logging.

You can view the serial port output logs of the templates launcher VM in Cloud Logging. To find the logs for a particular launcher VM, use the query resource.type="gce_instance" "launcher-number" where number starts with the current date in the format YYYMMDD.

Your organization policy might prohibit you from enabling serial port outputs logging.