Use custom containers in Dataflow

Stay organized with collections Save and categorize content based on your preferences.

This page describes how to customize the runtime environment of user code in Dataflow pipelines by supplying a custom container image. Custom containers are supported for pipelines using Dataflow Runner v2.

When Dataflow launches worker VMs, it uses Docker container images to launch containerized SDK processes on the workers. You can specify a custom container image instead of using one of the default Apache Beam images. When you specify a custom container image, Dataflow launches workers that pull the specified image. The following list includes reasons you might use a custom container:

  • Preinstalling pipeline dependencies to reduce worker start time.
  • Preinstalling pipeline dependencies that are not available in public repositories.
  • Prestaging large files to reduce worker start time.
  • Launching third-party software in the background.
  • Customizing the execution environment.

For a more in-depth look into custom containers, see the Apache Beam custom container guide. For examples of Python pipelines that use custom containers, see Dataflow custom containers.

Before you begin

Verify that the version of the Apache Beam SDK installed supports Runner v2 and your language version.

Use the following pipeline option(s) to enable custom containers:

Java

Use --experiments=use_runner_v2 to enable Runner v2.

Use --sdkContainerImage to specify a customer container image for your Java runtime.

Python

If using SDK version 2.30.0 or later, use the pipeline option --sdk_container_image.

For older versions of the SDK, use the pipeline option --worker_harness_container_image.

Go

If using SDK version 2.40.0 or later, use the pipeline option --sdk_container_image.

For older versions of the SDK, use the pipeline option --worker_harness_container_image.

For more information, see the guide for Installing the Apache Beam SDK.

To test your container image locally, you must have Docker installed. For more information, see Get Docker.

Default SDK Container Images

We recommend starting with a default Apache Beam SDK image as a base container image. Default images are released as part of Apache Beam releases to DockerHub.

Create and building the container image

This section provides examples for different ways to create a custom SDK container image.

A custom SDK container image must meet the following requirements:

  • The Apache Beam SDK and necessary dependencies are installed.
  • The default ENTRYPOINT script (/opt/apache/beam/boot on default containers) runs as the last step during container startup. See Modifying the container entrypoint for more information.

Use an Apache Beam base image

To create a custom container image, specify the Apache Beam image as the parent image and add your own customizations. For more information about writing Dockerfiles, see Best practices for writing Dockerfiles.

  1. Create a new Dockerfile, specifying the base image using the FROM instruction.

    Java

    In this example, we use Java 8 with the Apache Beam SDK version 2.46.0.

    FROM apache/beam_java8_sdk:2.46.0
    
    # Make your customizations here, for example:
    ENV FOO=/bar
    COPY path/to/myfile ./
    

    The runtime version of the custom container must match the runtime that you will use to start the pipeline. For example, if you will start the pipeline from a local Java 11 environment, the FROM line must specify a Java 11 environment: apache/beam_java11_sdk:....

    Python

    In this example, we use Python 3.8 with the Apache Beam SDK version 2.46.0.

    FROM apache/beam_python3.8_sdk:2.46.0
    
    # Make your customizations here, for example:
    ENV FOO=/bar
    COPY path/to/myfile ./
    

    The runtime version of the custom container must match the runtime that you will use to start the pipeline. For example, if you will start the pipeline from a local Python 3.8 environment, the FROM line must specify a Python 3.8 environment: apache/beam_python3.8_sdk:....

    Go

    In this example, we use Go with the Apache Beam SDK version 2.46.0.

    FROM apache/beam_go_sdk:2.46.0
    
    # Make your customizations here, for example:
    ENV FOO=/bar
    COPY path/to/myfile ./
    
  2. Build the child image and push this image to a container registry.

    Cloud Build

    export PROJECT=PROJECT
    export REPO=REPO
    export TAG=TAG
    export IMAGE_URI=gcr.io/$PROJECT/$REPO:$TAG
    gcloud builds submit . --tag $IMAGE_URI
    

    Docker

    export PROJECT=PROJECT
    export REPO=REPO
    export TAG=TAG
    export IMAGE_URI=gcr.io/$PROJECT/$REPO:$TAG
    docker build . --tag $IMAGE_URI
    docker push $IMAGE_URI
    

    Replace the following:

    • PROJECT: the project name or user name.
    • REPO: the image repository name.
    • TAG: the image tag, typically latest.

Use a custom base image or multi-stage builds

If you have an existing base image or need to modify some base aspect of the default Apache Beam images (OS version, patches, etc), use a multistage build process to copy the necessary artifacts from a default Apache Beam base image and provide your custom container image.

The following example shows a Dockerfile that copies files from the Apache Beam SDK:

Java

FROM openjdk:8

# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_java8_sdk:2.46.0 /opt/apache/beam /opt/apache/beam

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

Python

FROM python:3.8-slim

# Install SDK.
RUN pip install --no-cache-dir apache-beam[gcp]==2.46.0

# Verify that the image does not have conflicting dependencies.
RUN pip check

# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_python3.8_sdk:2.46.0 /opt/apache/beam /opt/apache/beam

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

This example assumes necessary dependencies (in this case, Python 3.8 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time. Important: the SDK version specified in the RUN and COPY instructions must match the version used to launch the pipeline.

Go

FROM golang:latest

# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_go_sdk:2.46.0 /opt/apache/beam /opt/apache/beam

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

Modify the container entrypoint

Custom containers must run the default ENTRYPOINT script /opt/apache/beam/boot, which initializes the worker environment and starts the SDK worker process. If you do not set this entrypoint, your worker does not start properly.

If you need to run your own script during container startup, your image ENTRYPOINT must properly start this worker SDK process, including passing Dataflow arguments to this script.

Therefore, your custom ENTRYPOINT must end with running /opt/apache/beam/boot, and any required arguments passed by Dataflow during container startup must be properly passed to the default boot script. To do this step, create a custom script that runs /opt/apache/beam/boot:

#!/bin/bash

echo "This is my custom script"

# ...

# Pass command arguments to the default boot script.
/opt/apache/beam/boot "$@"

Then, override the default ENTRYPOINT. The following Dockerfile example demonstrates this step:

Java

FROM apache/beam_java8_sdk:2.46.0

COPY script.sh path/to/my/script.sh
ENTRYPOINT [ "path/to/my/script.sh" ]

Python

FROM apache/beam_python3.8_sdk:2.46.0

COPY script.sh path/to/my/script.sh
ENTRYPOINT [ "path/to/my/script.sh" ]

Go

FROM apache/beam_go_sdk:2.46.0

COPY script.sh path/to/my/script.sh
ENTRYPOINT [ "path/to/my/script.sh" ]

Run a job with custom containers

This section discusses running pipelines with custom containers both with local runners for testing and on Dataflow. Skip to Launching the Dataflow job if you have already verified your container image and pipeline.

Before you begin

When running your pipeline, launch the pipeline using the Apache Beam SDK with the same version and language version as the SDK on your custom container image. This step avoids unexpected errors from incompatible dependencies or SDKs.

Test locally

To learn more about Apache Beam-specific usage, see the Apache Beam guide for Running pipelines with custom container images.

Basic testing with PortableRunner

To verify that remote container images can be pulled and can run a simple pipeline, use the Apache Beam PortableRunner. When you use the PortableRunner, job submission occurs in the local environment, and the DoFn execution happens in the Docker environment.

When you use GPUs, the Docker container might not have access to the GPUs. To test your container with GPUs, use the direct runner and follow the steps for testing a container image on a standalone VM with GPUs in the Debug with a standalone VM section of the Use GPUs page.

The following runs an example pipeline:

Java

mvn compile exec:java -Dexec.mainClass=com.example.package.MyClassWithMain \
    -Dexec.args="--runner=PortableRunner \
    --jobEndpoint=ENDPOINT \
    --defaultEnvironmentType=DOCKER \
    --defaultEnvironmentConfig=IMAGE_URI \
    --inputFile=INPUT_FILE \
    --output=OUTPUT_FILE"

Python

python path/to/my/pipeline.py \
  --runner=PortableRunner \
  --job_endpoint=ENDPOINT \
  --environment_type=DOCKER \
  --environment_config=IMAGE_URI \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE

Go

go path/to/my/pipeline.go \
  --runner=PortableRunner \
  --job_endpoint=ENDPOINT \
  --environment_type=DOCKER \
  --environment_config=IMAGE_URI \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE

Replace the following:

  • ENDPOINT: the job service endpoint to use. Should be in the form of address and port. For example: localhost:3000. Use embed to run an in-process job service.
  • IMAGE_URI: the custom container image URI. You can use the shell variable $IMAGE_URI constructed in the previous step if the variable is still in scope.
  • INPUT_FILE: an input file that can be read as a text file. This file must be accessible by the SDK harness
    container image, either preloaded on the container image or a remote file.
  • OUTPUT_FILE: a file path to write output to. This path is either a remote path or a local path on the container.

When the pipeline successfully completes, review the console logs to verify that the pipeline completed successfully and that the remote image, specified by $IMAGE_URI, was used.

After running the pipeline, files saved to the container are not in your local file system, and the container is stopped. You can copy files from the stopped container file system by using docker cp.

Alternatively:

  • Provide outputs to a remote file system like Cloud Storage. You might need to manually configure access for testing purposes, including for credential files or Application Default Credentials.
  • For quick debugging, add temporary logging.

Use Direct Runner

For more in-depth local testing of the container image and your pipeline, use the Apache Beam DirectRunner.

You can verify your pipeline separately from the container by testing in a local environment matching the container image, or by launching the pipeline on a running container.

Java

docker run -it --entrypoint "/bin/bash" $IMAGE_URI
...
# On docker container:
root@4f041a451ef3:/#  mvn compile exec:java -Dexec.mainClass=com.example.package.MyClassWithMain ...

Python

docker run -it --entrypoint "/bin/bash" $IMAGE_URI
...
# On docker container:
root@4f041a451ef3:/#  python path/to/my/pipeline.py ...

Go

docker run -it --entrypoint "/bin/bash" $IMAGE_URI
...
# On docker container:
root@4f041a451ef3:/#  go path/to/my/pipeline.go ...

The examples assume any pipeline files, including the pipeline itself, are on the custom container, have been mounted from a local file system, or are remote and accessible by Apache Beam and the container. For example, to use Maven (mvn) to run the previous Java example, Maven and its dependencies must be staged on the container. See Docker documentation on Storage and docker run for more information.

The goal for DirectRunner testing is to test your pipeline in the custom container environment, not to test running your container with its default ENTRYPOINT. Modify the ENTRYPOINT (for example, docker run --entrypoint ...) to either directly run your pipeline or to allow manually running commands on the container.

If you rely on specific configuration from running on Compute Engine, you can run this container directly on a Compute Engine VM. See Containers on Compute Engine for more information.

Launch the Dataflow job

When launching the Apache Beam pipeline on Dataflow, specify the path to the container image:

Java

Use --sdkContainerImage to specify a SDK container image for your Java runtime.

Use --experiments=use_runner_v2 to enable Runner v2.

Python

If using SDK version 2.30.0 or later, use the pipeline option --sdk_container_image to specify a SDK container image.

For older versions of the SDK, use the pipeline option --worker_harness_container_image to specify the location of container image to use for the worker harness.

Custom containers are only supported for Dataflow Runner v2. If you are launching a batch Python pipeline, set the --experiments=use_runner_v2 flag. If you are launching a streaming Python pipeline, specifying the experiment is not necessary as streaming Python pipelines use Runner v2 by default.

Go

If using SDK version 2.40.0 or later, use the pipeline option --sdk_container_image to specify a SDK container image.

For older versions of the SDK, use the pipeline option --worker_harness_container_image to specify the location of container image to use for the worker harness.

Custom containers are supported on all versions of the Go SDK because, they use Dataflow Runner v2 by default.

The following example demonstrates how to launch the batch wordcount example with a custom container.

Java

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
   -Dexec.args="--runner=DataflowRunner \
                --inputFile=INPUT_FILE \
                --output=OUTPUT_FILE \
                --project=PROJECT_ID \
                --region=REGION \
                --gcpTempLocation=TEMP_LOCATION \
                --diskSizeGb=DISK_SIZE_GB \
                --experiments=use_runner_v2 \
                --sdkContainerImage=$IMAGE_URI"

Python

Using the Apache Beam SDK for Python version 2.30.0 or later:

python -m apache_beam.examples.wordcount \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE \
  --project=PROJECT_ID \
  --region=REGION \
  --temp_location=TEMP_LOCATION \
  --runner=DataflowRunner \
  --disk_size_gb=DISK_SIZE_GB \
  --experiments=use_runner_v2 \
  --sdk_container_image=$IMAGE_URI

Go

wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \
          --output gs://<your-gcs-bucket>/counts \
          --runner dataflow \
          --project your-gcp-project \
          --region your-gcp-region \
          --temp_location gs://<your-gcs-bucket>/tmp/ \
          --staging_location gs://<your-gcs-bucket>/binaries/ \
          --sdk_container_image=$IMAGE_URI

Replace the following:

  • INPUT_FILE: the Cloud Storage input file path read by Dataflow when running the example.
  • OUTPUT_FILE: the Cloud Storage output file path written to by the example pipeline. This file contains the word counts.
  • PROJECT_ID: the ID of your Google Cloud project.
  • REGION: the regional endpoint for deploying your Dataflow job.
  • TEMP_LOCATION: the Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.
  • DISK_SIZE_GB: (Optional): If your container is large, consider increasing default boot disk size to avoid running out of disk space.
  • $IMAGE_URI: the SDK custom container image URI. You can use the shell variable $IMAGE_URI constructed in the previous step if the variable is still in scope.

Pre-build the python SDK custom container image with extra dependencies

When you launch a Python Dataflow job, extra dependencies, such as the ones specified by using --requirements_file and --extra_packages, are installed in each SDK worker container started by the Dataflow service. The dependency installation often leads to high CPU usage and a long warm up period on all newly started Dataflow workers when the job first starts and during autoscaling.

To avoid repetitive dependency installations, you can pre-build the custom Python SDK container image with the dependencies pre-installed.

Pre-build using a Dockerfile

To add any extra dependencies directly to your Python custom container, use the following commands:

FROM apache/beam_python3.8_sdk:2.46.0

# Pre-built python dependencies
RUN pip install lxml
# Pre-built other dependencies
RUN apt-get update \
  && apt-get dist-upgrade \
  && apt-get install -y --no-install-recommends ffmpeg

# Set the entrypoint to the Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

Submit your job with the --sdk_container_image and the --sdk_location pipeline options. The --sdk_location option prevents the SDK from downloading when your job launches. The SDK is retrieved directly from the container image.

python -m apache_beam.examples.wordcount \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE \
  --project=PROJECT_ID \
  --region=REGION \
  --temp_location=TEMP_LOCATION \
  --runner=DataflowRunner \
  --disk_size_gb=DISK_SIZE_GB \
  --experiments=use_runner_v2 \
  --sdk_container_image=$IMAGE_URI
  --sdk_location=container

Pre-build when submitting the job

You can also use the Python SDK container image pre-building workflow. Submit your job with the --prebuild_sdk_container_engine pipeline option and with the --docker_registry_push_url and --sdk_location options.

When using the --prebuild_sdk_container_engine pipeline option, Apache Beam generates a custom container and installs all the dependencies specified by the --requirements_file and --extra_packages options.

  • If your project has Cloud Build API enabled, you can build it with Cloud Build using the following pipeline options:

    python -m apache_beam.examples.wordcount \
      --input=INPUT_FILE \
      --output=OUTPUT_FILE \
      --project=PROJECT_ID \
      --region=REGION \
      --temp_location=TEMP_LOCATION \
      --runner=DataflowRunner \
      --disk_size_gb=DISK_SIZE_GB \    
      --experiments=use_runner_v2 \
      --requirements_file=./requirements.txt \
      --prebuild_sdk_container_engine=cloud_build \
      --docker_registry_push_url=GAR_PATH/GCR_PATH \
      --sdk_location=container
    
  • If you have a local Docker installation and prefer to pre-build the image locally, use the following pipeline options:

    python -m apache_beam.examples.wordcount \
      --input=INPUT_FILE \
      --output=OUTPUT_FILE \
      --project=PROJECT_ID \
      --region=REGION \
      --temp_location=TEMP_LOCATION \
      --runner=DataflowRunner \
      --disk_size_gb=DISK_SIZE_GB \
      --experiments=use_runner_v2 \
      --requirements_file=./requirements.txt \
      --prebuild_sdk_container_engine=local_docker \
      --docker_registry_push_url=GAR_PATH/GCR_PATH \
      --sdk_location=container
    

    Replace GAR_PATH/GCR_PATH with an Artifact Registry folder or a Container Registry path prefix. The prebuilt SDK container is tagged and pushed.

The SDK container image pre-building workflow uses the image passed using the --sdk_container_image pipeline option as the base image. If the option is not set, by default an Apache Beam image is used as the base image.

You can reuse a pre-built Python SDK container image in another job with the same dependencies and SDK version. To reuse the image, pass the pre-built container image URL to the other job by using the --sdk_container_image pipeline option. Remove the dependency options --requirements_file, --extra_packages, and --setup_file.

If you don't plan to reuse the image, delete it after the job completes. You can delete the image with the gcloud CLI or in the Artifact Registry or Container Registry pages in the Google Cloud console.

For example, if the image is stored in Container Registry, use the container images delete command:

   gcloud container images delete IMAGE --force-delete-tags

If the image is stored in Artifact Registry, use the artifacts docker images delete command:

   gcloud artifacts docker images delete IMAGE --delete-tags

Common issues:

  • If your job has extra Python dependencies from a private PyPi mirror and cannot be pulled by a remote Cloud Build job, try using the local docker option or try building your container using a Dockerfile.

  • If the Cloud Build job fails with docker exit code 137, the build job ran out of memory, potentially due to the size of the dependencies being installed. Use a larger Cloud Build worker machine type by passing --cloud_build_machine_type=machine_type, where machine_type is one of the following options:

    • n1-highcpu-8
    • n1-highcpu-32
    • e2-highcpu-8
    • e2-highcpu-32

By default, Cloud Build uses the machine type e2-medium.

Troubleshooting

This section provides instructions for troubleshooting issues with using custom containers in Dataflow. It focuses on issues with containers or workers not starting. If your workers are able to start and work is progressing, follow the general guidance for Troubleshooting your pipeline.

Before reaching out for support, ensure that you have ruled out problems related to your container image:

  • Follow the steps to test your container image locally.
  • Search for errors in the Job logs or in Worker logs, and compare any errors found with the common error guidance.
  • Make sure that the Apache Beam SDK version and language version you are using to launch the pipeline match the SDK version on your custom container image.
  • If using Java, make sure that the Java major version you use to launch the pipeline matches the version installed in your container image.
  • If using Python, make sure that the Python major-minor version you use to launch the pipeline matches the version installed in your container image, and that the image does not have conflicting dependencies. You can run pip check to confirm.

Find worker logs related to custom containers

The Dataflow worker logs for container-related error messages can be found using the Logs Explorer:

  1. Select log names. Custom container startup errors are most likely to be in one of the following:

    • dataflow.googleapis.com/kubelet
    • dataflow.googleapis.com/docker
    • dataflow.googleapis.com/worker-startup
    • dataflow.googleapis.com/harness-startup
  2. Select the Dataflow Step resource and specify the job_id.

If you are seeing Error Syncing pod... log messages, you should follow the common error guidance. You can query for these log messages in Dataflow worker logs using Logs Explorer with the following query:

resource.type="dataflow_step" AND jsonPayload.message:("$IMAGE_URI") AND severity="ERROR"

Common Issues

Job has errors or failed because container image cannot be pulled

Dataflow workers must be able to access custom container images. If the worker is unable to pull the image due to invalid URLs, misconfigured credentials, or missing network access, the worker fails to start.

For batch jobs where no work has started and several workers are unable to start sequentially, Dataflow fails the job. Otherwise, Dataflow logs errors but does not take further action to avoid destroying long-running job state.

For information about how to fix this issue, see Image pull request failed with error in the Troubleshoot Dataflow errors page.

Workers are not starting or work is not progressing

In some cases, if the SDK container fails to start due to an error, Dataflow is unable to determine whether the error is permanent or fatal and continuously attempts to restart the worker as it fails.

If there are no obvious errors but you see [topologymanager] RemoveContainer INFO-level logs in dataflow.googleapis.com/kubelet, these logs indicate that the custom container image is exiting early and did not start the long-running worker SDK process.

If workers have started successfully but no work is happening, an error might be preventing the SDK container from starting. In this case, the following error appears in the diagnostic recommendations:

Failed to start container

In addition, the worker logs do not contain lines such as the following:

Executing: python -m apache_beam.runners.worker.sdk_worker_main or Executing: java ... FnHarness

Find specific errors in Worker logs and check common error guidance.

Common causes for these issues include the following:

  • Problems with package installation, such as pip installation errors due to dependency issues. See Error syncing pod ... failed to "StartContainer".
  • Errors with the custom command arguments or with the ENTRYPOINT set in the Dockerfile. For example, a custom ENTRYPOINT does not start the default boot script /opt/apache/beam/boot or does not pass arguments appropriately to this script. For more information, see Modifying the container entrypoint.