This page provides troubleshooting tips and debugging strategies that you might find helpful if you're using Dataflow Flex Templates. This information can help you detect a polling timeout, determine the reason behind the timeout, and correct the problem.
Troubleshoot polling timeouts
This section provides steps for identifying the cause of polling timeouts.
Polling timeouts
Your Flex Template job might return the following error message:
Timeout in polling result file: ${file_path}.
Service account: ${service_account_email}
Image URL: ${image_url}
Troubleshooting guide at https://cloud.google.com/dataflow/docs/guides/common-errors#timeout-polling
This error can occur for the following reasons:
- The base Docker image was overridden.
- The service account that fills in
${service_account_email}
does not have some necessary permissions. - External IP addresses are disabled, and VMs can't connect to the set of external IP addresses used by Google APIs and services.
- The program that creates the graph takes too long to finish.
- Pipeline options are being overwritten.
- (Python only) There is a problem with the
requirements.txt
file. - There was a transient error.
To resolve this issue, first check for transient errors by checking the job logs and retrying. If those steps don't resolve the issue, try the following troubleshooting steps.
Verify Docker entrypoint
Try this step if you're running a template from a custom Docker image rather than using one of the provided templates.
Check for the container entrypoint using the following command:
docker inspect $TEMPLATE_IMAGE
You should see the following:
Java
/opt/google/dataflow/java_template_launcher
Python
/opt/google/dataflow/python_template_launcher
If you get a different output, then the entrypoint of your Docker container is
overridden. Restore $TEMPLATE_IMAGE
to the default.
Check service account permissions
Check that the service account mentioned in the message has the following permissions:
- It must be able read and write the Cloud Storage path that fills in
${file_path}
in the message. - It must be able to read the Docker image that fills in
${image_url}
in the message.
Configure Private Google Access
If external IP addresses are disabled, you need to allow Compute Engine VMs to connect to the set of external IP addresses used by Google APIs and services. Enable Private Google Access on the subnet used by the network interface of the VM.
For configuration details, see Configuring Private Google Access.
By default, when a Compute Engine VM lacks an external IP address assigned to its network interface, it can only send packets to other internal IP address destinations.
Check if the launcher program fails to exit
The program that constructs the pipeline must finish before the pipeline can be launched. The polling error could indicate that it took too long to do so.
Some things you can do to locate the cause in code are:
- Check job logs and see if any operation appears to take a long time to complete. An example would be a request for an external resource.
- Make sure no threads are blocking the program from exiting. Some clients might create their own threads, and if these clients are not shut down, the program waits forever for these threads to be joined.
Pipelines launched directly that don't use a template don't have these limitations. Therefore, if the pipeline worked directly but not as a template, then the use of a template might be the root cause. Finding the issue in the template and fixing the template might resolve the issue.
Verify whether required pipeline options are suppressed
When using Flex Templates, you can configure some pipeline options during pipeline initialization, but other pipeline options should not be changed. For more information, see the Failed to read the job file section on this page.
Remove Apache Beam from the requirements file (Python Only)
If your Dockerfile includes a requirements.txt
with apache-beam[gcp]
,
then you should remove it from the file and install it separately. Example:
RUN pip install apache-beam[gcp]
RUN pip install -U -r ./requirements.txt
Putting Apache Beam in the requirements file is known to lead to long launch times, and it often causes a timeout.
Polling timeouts when using Python
If you're running a Dataflow job by using Flex Template and Python, your job might queue for a period, fail to run, and then display the following error:
Timeout in polling
The error occurs because of the requirements.txt
file that's used to install the
required dependencies. When you launch a Dataflow job, all the
dependencies are staged first to make these files accessible to
the worker VMs. The process involves downloading and recompiling
every direct and indirect dependency in the requirements.txt
file.
Some dependencies might take several minutes to compile. Notably
PyArrow might
take time to compile. PyArrow is an indirect dependency that's used by
Apache Beam and most Cloud Client Libraries.
To optimize your job's performance, use a Dockerfile or a custom container to prepackage the dependencies. For more information, see Package dependencies in "Configure Flex Templates."
Job launch failures
The following section contains common errors that lead to job launch failures and steps for resolving or troubleshooting the errors.
Failed to read the job file
When you try to run a job from a Flex Template, your job might fail with the following error:
Failed to read the job file : gs://dataflow-staging-REGION-PROJECT_ID/staging/template_launches/TIMESTAMP/job_object with error message: ...: Unable to open template file
This error occurs when the necessary pipeline initialization options are overwritten. When using Flex Templates, you can configure some pipeline options during pipeline initialization, but other pipeline options should not be changed. If the command line arguments required by the Flex Template are overwritten, the job might ignore, override, or discard the pipeline options passed by the template launcher. The job might fail to launch, or a job that doesn't use the Flex Template might launch.
To avoid this issue, during pipeline initialization, do not change the following
pipeline options
in user code or in the metadata.json
file:
Java
runner
project
jobName
templateLocation
region
Python
runner
project
job_name
template_location
region
Go
runner
project
job_name
template_location
region
Permission denied on resource
When you try to run a job from a Flex Template, your job might fail with the following error:
Permission "MISSING_PERMISSION" denied on resource "projects/PROJECT_ID/locations/REGION/repositories/REPOSITORY_NAME" (or it may not exist).
This error occurs when the used service account does not have permissions to access necessary resources to run a Flex Template.
To avoid this issue, please check the required permissions and adjust the service account accordingly.
Troubleshoot early startup issues
When the template launching process fails in an early stage, regular Flex Template logs might not be available. To investigate startup issues, enable serial port logging for the templates launcher VM.
To enable logging for Java templates, set the
enableLauncherVmSerialPortLogging
option to true
. To enable logging for Python and Go templates, set the
enable_launcher_vm_serial_port_logging
option to true
. In the Google Cloud console, the parameter is
listed in Optional parameters as Enable Launcher VM Serial Port Logging.
You can view the serial port output logs of the templates launcher VM in
Cloud Logging. To find the logs for a particular launcher VM, use the query
resource.type="gce_instance" "launcher-number"
where number starts
with the current date in the format YYYMMDD
.
Your organization policy
might prohibit you from enabling serial port outputs logging.