Using custom containers in Dataflow

This page describes how to customize the runtime environment of Python user code in Dataflow pipelines by supplying a custom container image. Custom containers are only supported for portable pipelines using Dataflow Runner v2.

When Dataflow launches worker VMs, it uses Docker container images. You can specify a custom container image instead of using one of the default Apache Beam images. When you specify a custom container image, Dataflow launches workers that pull the specified image. These are some reasons you might use a custom container:

  • Preinstalling pipeline dependencies to reduce worker start time.
  • Preinstalling pipeline dependencies not available in public repositories.
  • Launching third-party software in the background.
  • Customizing the execution environment.

For a more in-depth look into custom containers, see the Apache Beam custom container guide.

Before you begin

Verify that you have the Apache Beam SDK version 2.25.0, or later, installed. Check that this Apache Beam SDK version supports your Python version. For more information, see the guide for Installing the Apache Beam SDK.

Ensure you have Docker installed to test your container image locally. For more information, see Get Docker.

Creating and building the container image

In this example, we use Python 3.8 with the Apache Beam SDK version 2.25.0. To create a custom container image, specify the Apache Beam image as the parent image and add your own customizations.

  1. Create a new Dockerfile, specifying the apache/beam_python3.8_sdk:2.25.0 as the parent and add any customizations. For more information on writing Dockerfiles, see Best practices for writing Dockerfiles.
    FROM apache/beam_python3.8_sdk:2.25.0
    # Make your customizations here, for example:
    ENV FOO=/bar
    COPY path/to/myfile ./
  2. Build the child image and push this image to a container registry.

    Cloud Build

    export PROJECT=PROJECT
    export REPO=REPO
    export TAG=TAG
    export REGISTRY_HOST=HOST
    export IMAGE_URI=$REGISTRY_HOST/$PROJECT/$REPO:$TAG
    
    gcloud builds submit --tag $IMAGE_URI

    Docker

    export PROJECT=PROJECT
    export REPO=REPO
    export TAG=TAG
    export REGISTRY_HOST=HOST
    export IMAGE_URI=$REGISTRY_HOST/$PROJECT/$REPO:$TAG
    
    docker build -f Dockerfile -t $IMAGE_URI ./
    docker push $IMAGE_URI
    Replace the following:
    • PROJECT: the project name or user name.
    • REPO: the image repository name.
    • TAG: the image tag.
    • HOST: the image registry host name, for example, gcr.io.

Running a job with custom containers

When running your pipeline, use the same Python version and Apache Beam SDK version as included on the custom container image to avoid unexpected errors.

Testing locally

Test the container image locally by running the Apache Beam wordcount example using the PortableRunner:

python -m apache_beam.examples.wordcount \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE \
  --runner=PortableRunner \
  --job_endpoint=embed \
  --environment_type=DOCKER \
  --environment_config=$IMAGE_URI

Replace the following:

  • INPUT_FILE: an input file that can be read as a text file. This file must be accessible by the SDK harness
    container image, either preloaded on the container image or a remote file.
  • OUTPUT_FILE: a file path to write output to. This path is either a remote path or a local path on the container.
  • $IMAGE_URI: the custom container image URI. You can use the shell variable $IMAGE_URI constructed in the previous step if the variable is still in scope.

Once the pipeline has completed successfully, look at the console logs to verify the pipeline completed successfully and the remote image, specified by $IMAGE_URI, was used.

This local testing is meant to verify the actual container image itself. The pipeline itself should be verified without a container image before this step, for example by using the DirectRunner. The isolated nature of containers prevents access to local file systems and local configuration like environment variables.

This means local input files or credential files, and local environment variables like those used to set up Application Default Credentials are not accessible by the container itself. Local outputs are written to the container file system and not accessible after the container shuts down and pipeline completes.

If these outputs need to be persisted, provide a remote filesystem as the output path and make sure access to this remote filesystem is set up on the container itself. Adding temporary logging may also suffice for debugging purposes.

For more information, see the Apache Beam guide for Running pipelines with custom container images.

Launching the Dataflow job

Specify the path to the container image when launching the Apache Beam pipeline on Dataflow. If you're launching a batch Python pipeline, you must set the --experiment=use_runner_v2 flag. If you're launching a streaming Python pipeline, specifying the experiment is not necessary. For example, to launch the batch wordcount example with a custom container image:

python -m apache_beam.examples.wordcount \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE \
  --project=PROJECT_ID \
  --region=REGION \
  --temp_location=TEMP_LOCATION \
  --runner=DataflowRunner \
  --experiment=use_runner_v2 \
  --worker_harness_container_image=$IMAGE_URI

Replace the following:

  • INPUT_FILE: the Cloud Storage input file path read by Dataflow when running the example.
  • OUTPUT_FILE: the Cloud Storage output file path written to by the example pipeline. This will contain the word counts.
  • PROJECT_ID: the ID of your Google Cloud project.
  • REGION: the regional endpoint for deploying your Dataflow job.
  • TEMP_LOCATION: the Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.
  • $IMAGE_URI: the custom container image URI. You can use the shell variable $IMAGE_URI constructed in the previous step if the variable is still in scope.

Troubleshooting

This section provides instructions for troubleshooting common issues found when interacting with custom containers in Dataflow.

Before reaching out for support, ensure that you have ruled out problems related to your container image by following the steps in testing locally and in the following troubleshooting sections.

Finding container logs

The Dataflow worker logs for container related error messages can be found using the Logs Explorer:

  1. Select the following log names:

    • dataflow.googleapis.com/docker
    • dataflow.googleapis.com/kubelet
    • dataflow.googleapis.com/worker
  2. Select the Dataflow Step resource and specify the job_id.

Workers don't start

If your workers aren't starting, or your job times out, verify that Dataflow is able to pull your custom container image.

Query the Dataflow logs using Logs Explorer for a log message Error Syncing pod... with the following query:

resource.type="dataflow_step" AND jsonPayload.message:("$IMAGE_URI") AND severity="ERROR"

The image must be accessible by Dataflow workers so that the workers can pull the image when starting. If you are using Container Registry to host your container image, the default Google Cloud service account is already able to access images in the same project. If Dataflow is unable to pull the container image, the workers cannot start.

For more information, see Configuring access control.