Troubleshoot Flex Templates

This page provides troubleshooting tips and debugging strategies that you might find helpful if you're using Dataflow Flex Templates. This information can help you detect a polling timeout, determine the reason behind the timeout, and correct the problem.

Troubleshoot polling timeouts

This section provides steps for identifying the cause of polling timeouts.

Polling timeouts

Your Flex Template job might return the following error message:

Timeout in polling result file: ${file_path}.
Service account: ${service_account_email}
Image URL: ${image_url}
Troubleshooting guide at https://cloud.google.com/dataflow/docs/guides/common-errors#timeout-polling

This error can occur for the following reasons:

  1. The base Docker image was overridden.
  2. The service account that fills in ${service_account_email} does not have some necessary permissions.
  3. External IP addresses are disabled, and VMs can't connect to the set of external IP addresses used by Google APIs and services.
  4. The program that creates the graph takes too long to finish.
  5. Pipeline options are being overwritten.
  6. (Python only) There is a problem with the requirements.txt file.
  7. There was a transient error.

To resolve this issue, first check for transient errors by checking the job logs and retrying. If those steps don't resolve the issue, try the following troubleshooting steps.

Verify Docker entrypoint

Try this step if you're running a template from a custom Docker image rather than using one of the provided templates.

Check for the container entrypoint using the following command:

docker inspect $TEMPLATE_IMAGE

The following output is expected:

Java

/opt/google/dataflow/java_template_launcher

Python

/opt/google/dataflow/python_template_launcher

If you get a different output, then the entrypoint of your Docker container is overridden. Restore $TEMPLATE_IMAGE to the default.

Check service account permissions

Check that the service account mentioned in the message has the following permissions:

  • It must be able read and write the Cloud Storage path that fills in ${file_path} in the message.
  • It must be able to read the Docker image that fills in ${image_url} in the message.

Configure Private Google Access

If external IP addresses are disabled, you need to allow Compute Engine VMs to connect to the set of external IP addresses used by Google APIs and services. Enable Private Google Access on the subnet used by the network interface of the VM.

For configuration details, see Configuring Private Google Access.

By default, when a Compute Engine VM lacks an external IP address assigned to its network interface, it can only send packets to other internal IP address destinations.

Check if the launcher program fails to exit

The program that constructs the pipeline must finish before the pipeline can be launched. The polling error could indicate that it took too long to do so.

Some things you can do to locate the cause in code are:

  • Check job logs and see if any operation appears to take a long time to complete. An example would be a request for an external resource.
  • Make sure no threads are blocking the program from exiting. Some clients might create their own threads, and if these clients are not shut down, the program waits forever for these threads to be joined.

Pipelines launched directly that don't use a template don't have these limitations. Therefore, if the pipeline worked directly but not as a template, then the use of a template might be the root cause. Finding the issue in the template and fixing the template might resolve the issue.

Verify whether required pipeline options are suppressed

When using Flex Templates, you can configure some but not all pipeline options during pipeline initialization. For more information, see the Failed to read the job file section in this document.

Remove Apache Beam from the requirements file (Python Only)

If your Dockerfile includes a requirements.txt with apache-beam[gcp], remove it from the file and install it separately. The following command demonstrates how to complete this step:

RUN pip install apache-beam[gcp]
RUN pip install -U -r ./requirements.txt

Putting Apache Beam in the requirements file can cause long launch times, often resulting in a timeout.

Polling timeouts when using Python

If you're running a Dataflow job by using a Flex Template and Python, your job might queue for a period, fail to run, and then display the following error:

Timeout in polling

The requirements.txt file that's used to install the required dependencies causes the error. When you launch a Dataflow job, all of the dependencies are staged first to make these files accessible to the worker VMs. This process involves downloading and compiling every direct and indirect dependency in the requirements.txt file. Some dependencies might take several minutes to compile. Notably PyArrow might take time to compile. PyArrow is an indirect dependency that's used by Apache Beam and most Cloud Client Libraries.

To optimize your job's performance, use a Dockerfile or a custom container to prepackage the dependencies. For more information, see Package dependencies in "Configure Flex Templates."

Job launch failures

The following section contains common errors that lead to job launch failures and steps for resolving or troubleshooting the errors.

Early startup issues

When the template launching process fails in an early stage, regular Flex Template logs might not be available. To investigate startup issues, enable serial port logging for the templates launcher VM.

To enable logging for Java templates, set the enableLauncherVmSerialPortLogging option to true. To enable logging for Python and Go templates, set the enable_launcher_vm_serial_port_logging option to true. In the Google Cloud console, the parameter is listed in Optional parameters as Enable Launcher VM Serial Port Logging.

You can view the serial port output logs of the templates launcher VM in Cloud Logging. To find the logs for a particular launcher VM, use the query resource.type="gce_instance" "launcher-number" where number starts with the current date in the format YYYMMDD.

Your organization policy might prohibit you from enabling serial port outputs logging.

Failed to read the job file

When you try to run a job from a Flex Template, your job might fail with one of the following errors:

Failed to read the job file : gs://dataflow-staging-REGION-PROJECT_ID/staging/template_launches/TIMESTAMP/job_object with error message: ...: Unable to open template file

Or:

Failed to read the result file : gs://BUCKET_NAME with error message: (ERROR_NUMBER): Unable to open template file: gs://BUCKET_NAME

This error occurs when the necessary pipeline initialization options are overwritten. When using Flex Templates, you can configure some but not all pipeline options during pipeline initialization. If the command line arguments required by the Flex Template are overwritten, the job might ignore, override, or discard the pipeline options passed by the template launcher. The job might fail to launch, or a job that doesn't use the Flex Template might launch.

To avoid this issue, during pipeline initialization, don't change the following pipeline options in user code or in the metadata.json file:

Java

  • runner
  • project
  • jobName
  • templateLocation
  • region

Python

  • runner
  • project
  • job_name
  • template_location
  • region

Go

  • runner
  • project
  • job_name
  • template_location
  • region

Permission denied on resource

When you try to run a job from a Flex Template, your job might fail with the following error:

Permission "MISSING_PERMISSION" denied on resource "projects/PROJECT_ID/locations/REGION/repositories/REPOSITORY_NAME" (or it may not exist).

This error occurs when the used service account does not have permissions to access necessary resources to run a Flex Template.

To avoid this issue, verify that the service account has the required permissions. Adjust the service account permissions as needed.

Flag provided but not defined

When you try to run a Go Flex Template with the worker_machine_type pipeline option, the pipeline fails with the following error:

flag provided but not defined: -machine_type

This error is caused by a known issue in the Apache Beam Go SDK versions 2.47.0 and earlier. To resolve this issue, upgrade to Apache Beam Go version 2.48.0 or later.

Flex Template launcher delay

When you submit a Flex Template job, the job request goes into a Spanner queue. The template launcher picks up the job from the Spanner queue and then runs the template. When Spanner has a message backlog, a significant delay might occur between the time you submit the job and the time the job launches.

To work around this issue, launch your Flex Template from a different region.

The template parameters are invalid

When you try to use the gcloud CLI to run a job that uses a Google-provided template, the following error occurs:

ERROR: (gcloud.beta.dataflow.flex-template.run) INVALID_ARGUMENT: The template
parameters are invalid. Details: defaultSdkHarnessLogLevel: Unrecognized
parameter defaultWorkerLogLevel: Unrecognized parameter

This error occurs because some Google-provided templates don't support the defaultSdkHarnessLog and defaultWorkerLog options.

As a workaround, copy the template specification file to a Cloud Storage bucket. Add the following additional parameters to the file.

"metadata": {
    ...
    "parameters": [
      ...,
      {
        "name": "defaultSdkHarnessLogLevel",
        "isOptional": true,
        "paramType": "TEXT"
      },
      {
        "name": "defaultWorkerLogLevel",
        "isOptional": true,
        "paramType": "TEXT"
      }
    ]
  }

After you make this change to the template file, use the following command to run the template.

--template-file-gcs-location=gs://BUCKET_NAME/FILENAME

Replace the following values:

  • BUCKET_NAME: the name of your Cloud Storage bucket
  • FILENAME: the name of your template specification file

Flex Template launcher logs show wrong severity

When a custom Flex Template launch fails, the following message appears in the log files with the severity ERROR:

ERROR: Error occurred in the launcher container: Template launch failed. See console logs.

The root cause of the launch failure usually appears in the logs prior to this message with the severity INFO. Although this log level may be incorrect, it is expected, because the Flex template launcher has no way to extract severity details from the log messages produced by the Apache Beam application.

If you want to see the correct severity for every message in the launcher log, configure your template to generate logs in the JSON format instead of in plain text. This configuration allows the template launcher to extract the correct log message severity. Use the following message structure:

{
  "message": "The original log message",
  "severity": "DEBUG/INFO/WARN/ERROR"
}

In Java, you can use Logback logger with a custom JSON appender implementation. For more information, see the Logback example configuration and the JSON appender example code in GitHub.

This issue only impacts the logs generated by the Flex Template launcher when the pipeline is launching. When the launch succeeds and the pipeline is running, the logs produced by Dataflow workers have the proper severity.

Google-provided templates show the correct severity during job launch, because the Google-provided templates use this JSON logging approach.