Join the Apache Beam community on July 18th-20th for the Beam Summit 2022 to learn more about Beam and share your expertise.

Troubleshoot Flex Template timeouts

This page provides troubleshooting tips and debugging strategies that you might find helpful if you're using Dataflow Flex Templates and experiencing polling timeouts. This information can help you detect a polling timeout, determine the reason behind the timeout, and correct the problem.

Troubleshoot polling timouts

Your Flex Template job might return the following error message:

Timeout in polling result file: ${file_path}.
Service account: ${service_account_email}
Image URL: ${image_url}
Troubleshooting guide at https://cloud.google.com/dataflow/docs/guides/common-errors#timeout-polling

This error can occur for the following reasons:

  1. The base Docker image was overridden.
  2. The service account that fills in ${service_account_email} does not have some necessary permissions.
  3. External IP addresses are disabled, and VMs can't connect to the set of external IP addresses used by Google APIs and services.
  4. The program that creates the graph takes too long to finish.
  5. (Python only) There is a problem with the requirements.txt file.
  6. There was a transient error.

To resolve this issue, first check the job logs and retry, in case of transient failures. If those steps don't resolve the issue, try the following troubleshooting steps.

Verify Docker entrypoint

This step is for those running a template from a custom Docker image rather than using one of the provided templates.

Check for the container entrypoint using the following command:

docker inspect $TEMPLATE_IMAGE

You should see the following:

Java

/opt/google/dataflow/java_template_launcher

Python

/opt/google/dataflow/python_template_launcher

If you get a different output, then your Docker container's entrypoint is overridden. Restore $TEMPLATE_IMAGE to the default.

Check service account permissions

Check that the service account mentioned in the message has the following permissions:

  • It must be able read and write the Cloud Storage path that fills in ${file_path} in the message.
  • It must be able to read the Docker image that fills in ${image_url} in the message.

Configure Private Google Access

If external IP addresses are disabled, you need to allow Compute Engine VMs to connect to the set of external IP addresses used by Google APIs and services by enabling Private Google Access on the subnet used by the VM's network interface.

For configuration details, see Configuring Private Google Access.

By default, when a Compute Engine VM lacks an external IP address assigned to its network interface, it can only send packets to other internal IP address destinations.

Check if the launcher program fails to exit

The program that constructs the pipeline must finish before the pipeline can be launched. The polling error could indicate that it took too long to do so.

Some things you can do to locate the cause in code are:

  • Check job logs and see if any operation appears to take a long time to complete. An example would be a request for an external resource.
  • Make sure no threads are blocking the program from exiting. Some clients might create their own threads, and if these clients are not shut down, the program waits forever for these threads to be joined.

Pipelines launched directly (not using a template) do not have these limitations, so if the pipeline worked directly but not as a template, then the use of a template could be the root cause. Finding the issue in the template and fixing the template might resolve the issue.

Remove Apache Beam from the requirements file (Python Only)

If your Dockerfile includes a requirements.txt with apache-beam[gcp], then you should remove it from the file and install it separately. Example:

RUN pip install apache-beam[gcp]
RUN pip install -U -r ./requirements.txt

Putting Apache Beam in the requirements file is known to lead to long launch times, and it often causes a timeout.

Polling timouts when using Python

If you are running a Dataflow job by using Flex Template and Python, your job might queue for a period of time, fail to run, and then display the following error:

Timeout in polling

The error occurs because of the requirements.txt file that's used to install the required dependencies. When you launch a Dataflow job, all the dependencies are staged first to make these files accessible to the worker VMs. The process involves downloading and recompiling every direct and indirect dependency in the requirements.txt file. Some dependencies might take several minutes to compile—notably PyArrow, which is an indirect dependency that's used by Apache Beam and most Google Cloud client libraries.

As a workaround, perform the following steps:

  1. Download the precompiled dependencies in the Dockerfile into the Dataflow staging directory.

  2. Set the environment variable PIP_NO_DEPS to True.

    The setting prevents pip from re-downloading and re-compiling all the dependencies, which helps prevent the timeout error.

The following is a code sample that shows how the dependencies are pre-downloaded.

# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base

ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="/template/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="/template/streaming_beam.py"

COPY . /template

# We could get rid of installing libffi-dev and git, or we could leave them.
RUN apt-get update \
    && apt-get install -y libffi-dev git \
    && rm -rf /var/lib/apt/lists/* \
    # Upgrade pip and install the requirements.
    && pip install --no-cache-dir --upgrade pip \
    && pip install --no-cache-dir -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE \
    # Download the requirements to speed up launching the Dataflow job.
    && pip download --no-cache-dir --dest /tmp/dataflow-requirements-cache -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE

# Since we already downloaded all the dependencies, there's no need to rebuild everything.
ENV PIP_NO_DEPS=True

As an alternative workaround, you can use a custom container by doing the following:

  1. Preinstall all dependencies in the custom container and delete FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE from dataflow/flex-templates/streaming_beam/Dockerfile. For example:

    FROM gcr.io/dataflow-templates-base/python3-template-launcher-base
    ENV FLEX_TEMPLATE_PYTHON_PY_FILE="/template/streaming_beam.py"
    COPY . /template
    RUN pip install --no-cache-dir -r /template/requirements.txt
    
  2. Set the sdk_container_image parameter in the flex-template run command. For example:

    gcloud dataflow flex-template run $JOB_NAME \
    --region=$REGION \
    --template-file-gcs-location=$TEMPLATE_PATH \
    --parameters=sdk_container_image=$CUSTOM_CONTAINER_IMAGE \
    --additional-experiments=use_runner_v2
    

    For more information, see Using custom containers in Dataflow.