This page provides troubleshooting tips and debugging strategies that you might find helpful if you're using Dataflow Flex Templates and experiencing polling timeouts. This information can help you detect a polling timeout, determine the reason behind the timeout, and correct the problem.
Troubleshoot polling timouts
Your Flex Template job might return the following error message:
Timeout in polling result file: ${file_path}.
Service account: ${service_account_email}
Image URL: ${image_url}
Troubleshooting guide at https://cloud.google.com/dataflow/docs/guides/common-errors#timeout-polling
This error can occur for the following reasons:
- The base Docker image was overridden.
- The service account that fills in
${service_account_email}
does not have some necessary permissions. - External IP addresses are disabled, and VMs can't connect to the set of external IP addresses used by Google APIs and services.
- The program that creates the graph takes too long to finish.
- (Python only) There is a problem with the
requirements.txt
file. - There was a transient error.
To resolve this issue, first check the job logs and retry, in case of transient failures. If those steps don't resolve the issue, try the following troubleshooting steps.
Verify Docker entrypoint
This step is for those running a template from a custom Docker image rather than using one of the provided templates.
Check for the container entrypoint using the following command:
docker inspect $TEMPLATE_IMAGE
You should see the following:
Java
/opt/google/dataflow/java_template_launcher
Python
/opt/google/dataflow/python_template_launcher
If you get a different output, then your Docker container's entrypoint is
overridden. Restore $TEMPLATE_IMAGE
to the default.
Check service account permissions
Check that the service account mentioned in the message has the following permissions:
- It must be able read and write the Cloud Storage path that fills in
${file_path}
in the message. - It must be able to read the Docker image that fills in
${image_url}
in the message.
Configure Private Google Access
If external IP addresses are disabled, you need to allow Compute Engine VMs to connect to the set of external IP addresses used by Google APIs and services by enabling Private Google Access on the subnet used by the VM's network interface.
For configuration details, see Configuring Private Google Access.
By default, when a Compute Engine VM lacks an external IP address assigned to its network interface, it can only send packets to other internal IP address destinations.
Check if the launcher program fails to exit
The program that constructs the pipeline must finish before the pipeline can be launched. The polling error could indicate that it took too long to do so.
Some things you can do to locate the cause in code are:
- Check job logs and see if any operation appears to take a long time to complete. An example would be a request for an external resource.
- Make sure no threads are blocking the program from exiting. Some clients might create their own threads, and if these clients are not shut down, the program waits forever for these threads to be joined.
Pipelines launched directly (not using a template) do not have these limitations, so if the pipeline worked directly but not as a template, then the use of a template could be the root cause. Finding the issue in the template and fixing the template might resolve the issue.
Remove Apache Beam from the requirements file (Python Only)
If your Dockerfile includes a requirements.txt
with apache-beam[gcp]
,
then you should remove it from the file and install it separately. Example:
RUN pip install apache-beam[gcp]
RUN pip install -U -r ./requirements.txt
Putting Apache Beam in the requirements file is known to lead to long launch times, and it often causes a timeout.
Polling timouts when using Python
If you are running a Dataflow job by using Flex Template and Python, your job might queue for a period of time, fail to run, and then display the following error:
Timeout in polling
The error occurs because of the requirements.txt
file that's used to install the
required dependencies. When you launch a Dataflow job, all the
dependencies are staged first to make these files accessible to
the worker VMs. The process involves downloading and recompiling
every direct and indirect dependency in the requirements.txt
file.
Some dependencies might take several minutes to compile—notably
PyArrow,
which is an indirect dependency that's used by Apache Beam and most Google Cloud
client libraries.
As a workaround, perform the following steps:
Download the precompiled dependencies in the Dockerfile into the Dataflow staging directory.
Set the environment variable
PIP_NO_DEPS
toTrue
.The setting prevents
pip
from re-downloading and re-compiling all the dependencies, which helps prevent the timeout error.
The following is a code sample that shows how the dependencies are pre-downloaded.
As an alternative workaround, you can use a custom container by doing the following:
Preinstall all dependencies in the custom container and delete
FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
fromdataflow/flex-templates/streaming_beam/Dockerfile
. For example:FROM gcr.io/dataflow-templates-base/python3-template-launcher-base ENV FLEX_TEMPLATE_PYTHON_PY_FILE="/template/streaming_beam.py" COPY . /template RUN pip install --no-cache-dir -r /template/requirements.txt
Set the
sdk_container_image
parameter in theflex-template run
command. For example:gcloud dataflow flex-template run $JOB_NAME \ --region=$REGION \ --template-file-gcs-location=$TEMPLATE_PATH \ --parameters=sdk_container_image=$CUSTOM_CONTAINER_IMAGE \ --additional-experiments=use_runner_v2
For more information, see Using custom containers in Dataflow.