Using custom containers in Dataflow

This page describes how to customize the runtime environment of Python user code in Dataflow pipelines by supplying a custom container image.

When Dataflow launches worker VMs, it uses Docker container images. You can specify a custom container image instead of using one of the default Apache Beam images. When you specify a custom container image, Dataflow launches workers that pull the specified image. These are some reasons you might use a custom container:

  • Preinstalling pipeline dependencies to reduce worker start time.
  • Preinstalling pipeline dependencies not available in public repositories.
  • Launching third-party software in the background.
  • Customizing the execution environment.

For a more in-depth look into custom containers, see the Apache Beam custom container guide.

Before you begin

Verify that you have the Apache Beam SDK version 2.25.0, or later, installed. Check that this Apache Beam SDK version supports your Python version. For more information, see the guide for Installing the Apache Beam SDK.

Ensure you have Docker installed to test your container image locally. For more information, see Get Docker.

Creating and building the container image

In this example, we use Python 3.8 with the Apache Beam SDK version 2.25.0. To create a custom container image, specify the Apache Beam image as the parent image and add your own customizations.

  1. Create a new Dockerfile, specifying the apache/beam_python3.8_sdk:2.25.0 as the parent and add any customizations. For more information on writing Dockerfiles, see Best practices for writing Dockerfiles.
    FROM apache/beam_python3.8_sdk:2.25.0
    # Make your customizations here, for example:
    ENV FOO=/bar
    COPY path/to/myfile ./
  2. Build the child image and push this image to a container registry.

    Cloud Build

    export PROJECT=PROJECT
    export REPO=REPO
    export TAG=TAG
    export REGISTRY_HOST=HOST
    export IMAGE_URI=$REGISTRY_HOST/$PROJECT/$REPO:$TAG
    
    gcloud builds submit --tag $IMAGE_URI

    Docker

    export PROJECT=PROJECT
    export REPO=REPO
    export TAG=TAG
    export REGISTRY_HOST=HOST
    export IMAGE_URI=$REGISTRY_HOST/$PROJECT/$REPO:$TAG
    
    docker build -f Dockerfile -t $IMAGE_URI ./
    docker push $IMAGE_URI
    Replace the following:
    • PROJECT: the project name or user name.
    • REPO: the image repository name.
    • TAG: the image tag.
    • HOST: the image registry host name, for example, gcr.io.

Testing locally

Test the container image locally by running the Apache Beam wordcount example using the PortableRunner:

python -m apache_beam.examples.wordcount \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE \
  --runner=PortableRunner \
  --job_endpoint=embed \
  --environment_type=DOCKER \
  --environment_config=$IMAGE_URI

Replace the following:

  • INPUT_FILE: a file in the container image that can be read as a text file.
  • OUTPUT_FILE: a file path in the container image for writing the output. The output file is not accessible when execution completes because the container is shut down.
  • $IMAGE_URI: the custom container image URI. You can use the shell variable $IMAGE_URI constructed in the previous step if the variable is still in scope.

Look at the console logs to verify the pipeline completed successfully and the remote image, specified by $IMAGE_URI, was used.

For more information, see the Apache Beam guide for Running pipelines with custom container images.

Launching the Dataflow job

Specify the path to the container image when launching the Apache Beam pipeline on Dataflow. If you're launching a batch Python pipeline, you must set the --experiment=use_runner_v2 flag. If you're launching a streaming Python pipeline, specifying the experiment is not necessary. For example, to launch the batch wordcount example with a custom container image:

python -m apache_beam.examples.wordcount \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE \
  --project=PROJECT_ID \
  --region=REGION \
  --temp_location=TEMP_LOCATION \
  --runner=DataflowRunner \
  --experiment=use_runner_v2 \
  --worker_harness_container_image=$IMAGE_URI

Replace the following:

  • INPUT_FILE: the Cloud Storage input file path read by Dataflow when running the example.
  • OUTPUT_FILE: the Cloud Storage output file path written to by the example pipeline. This will contain the word counts.
  • PROJECT_ID: the ID of your Google Cloud project.
  • REGION: the regional endpoint for deploying your Dataflow job.
  • TEMP_LOCATION: the Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.
  • $IMAGE_URI: the custom container image URI. You can use the shell variable $IMAGE_URI constructed in the previous step if the variable is still in scope.

Troubleshooting

This section provides instructions for troubleshooting common issues found when interacting with custom containers in Dataflow.

Before reaching out for support, ensure that you have ruled out problems related to your container image by following the steps in testing locally and in the following troubleshooting sections.

Finding container logs

The Dataflow worker logs for container related error messages can be found using the Logs Explorer:

  1. Select the following log names:

    • dataflow.googleapis.com/docker
    • dataflow.googleapis.com/kubelet
    • dataflow.googleapis.com/worker
  2. Select the Dataflow Step resource and specify the job_id.

Workers don't start

If your workers aren't starting, or your job times out, verify that Dataflow is able to pull your custom container image.

Query the Dataflow logs using Logs Explorer for a log message Error Syncing pod... with the following query:

resource.type="dataflow_step" AND jsonPayload.message:("$IMAGE_URI") AND severity="ERROR"

The image must be accessible by Dataflow workers so that the workers can pull the image when starting. If you are using Container Registry to host your container image, the default Google Cloud service account is already able to access images in the same project. If Dataflow is unable to pull the container image, the workers cannot start.

For more information, see Configuring access control.