Using custom containers in Dataflow

This page describes how to customize the runtime environment of Python user code in Dataflow pipelines by supplying a custom container image. Custom containers are supported for pipelines using Dataflow Runner v2

When Dataflow launches worker VMs, it uses Docker container images to launch containerized SDK processes on the workers. You can specify a custom container image instead of using one of the default Apache Beam images. When you specify a custom container image, Dataflow launches workers that pull the specified image. These are some reasons you might use a custom container:

  • Preinstalling pipeline dependencies to reduce worker start time.
  • Preinstalling pipeline dependencies that are not available in public repositories.
  • Prestaging large files to reduce worker start time.
  • Launching third-party software in the background.
  • Customizing the execution environment.

For a more in-depth look into custom containers, see the Apache Beam custom container guide

Before you begin

Verify that you have installed a version of the Apache Beam SDK that supports Runner v2 (2.21.0 or later), your language version, and one of the following pipeline options for custom containers:

  • If using SDK version 2.30.0 or later, use the pipeline option --sdk_container_image.
  • For older versions of the SDK, use the pipeline option --worker_harness_container_image.

For more information, see the guide for Installing the Apache Beam SDK.

To test your container image locally, you must have Docker installed. For more information, see Get Docker

Default SDK Container Images

We recommend starting with a default Apache Beam SDK image as a base container image. Default images are released as part of Apache Beam releases to DockerHub.

Creating and building the container image

This section provides examples for different ways to create a custom SDK container image that follows the Apache Beam SDK container contract.

A custom SDK container image must have the following:

  • The Apache Beam SDK and necessary dependencies installed
  • The default ENTRYPOINT script (/opt/apache/beam/boot on default containers) runs as the last step during container startup. See Customizing the entrypoint for more information.

Using Apache Beam base image

In this example, we use Python 3.8 with the Apache Beam SDK version 2.32.0. To create a custom container image, specify the Apache Beam image as the parent image and add your own customizations.

  1. Create a new Dockerfile, specifying the apache/beam_python3.8_sdk:2.32.0 as the parent and add any customizations. For more information on writing Dockerfiles, see Best practices for writing Dockerfiles.

    FROM apache/beam_python3.8_sdk:2.32.0
    # Make your customizations here, for example:
    ENV FOO=/bar
    COPY path/to/myfile ./
  2. Build the child image and push this image to a container registry.

    Cloud Build

    export REPO=REPO
    export TAG=TAG
    gcloud builds submit . --tag $IMAGE_URI


    export REPO=REPO
    export TAG=TAG
    docker build . --tag $IMAGE_URI
    docker push $IMAGE_URI

    Replace the following:

    • PROJECT: the project name or user name.
    • REPO: the image repository name.
    • TAG: the image tag, typically latest.

Using a custom base image or multi-stage builds

If you have an existing base image or need to modify some base aspect of the default Apache Beam images (OS version, patches, etc), use a multistage build process to copy over the necessary artifacts from a default Apache Beam base image and provide your custom container image.

A simple example Dockerfile:

FROM python:3.8-slim

# Install SDK
RUN pip install --no-cache-dir apache-beam[gcp]==2.32.0

# Copy files from official SDK image, including script/dependencies
COPY --from=apache/beam_python3.8_sdk:2.32.0 /opt/apache/beam /opt/apache/beam

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

This example assumes necessary dependencies (in this case, Python 3.8 and pip) have been installed on the existing base image.

Modifying the container entrypoint

Custom containers must run the default ENTRYPOINT script /opt/apache/beam/boot, which initializes the worker environment and starts the SDK worker process. If you do not set this entrypoint, your worker will hang and never properly start.

If you need to run your own script on container startup, you must ensure that your image ENTRYPOINT still properly start this worker SDK process, including passing Dataflow arguments to this script.

This means that your custom ENTRYPOINT must end with running /opt/apache/beam/boot, and that any required arguments passed by Dataflow on container startup are properly passed to the default boot script. This can be done by creating a custom script that runs /opt/apache/beam/boot:


echo "This is my custom script"

# ...

# Pass command arguments to the default boot script.
/opt/apache/beam/boot "$@"

Then, override the default ENTRYPOINT. An example Dockerfile:

FROM apache/beam_python3.8_sdk:2.32.0

COPY path/to/my/
ENTRYPOINT [ "path/to/my/" ]

Running a job with custom containers

This section discusses running pipelines with custom containers both with local runners for testing and on Dataflow. Skip to Launching the Dataflow job if you have already verified your container image and pipeline.

Before you begin

When running your pipeline, make sure to launch the pipeline using the Apache Beam SDK with the same version (e.g. 2.XX.0) and language version (e.g. Python 3.X) as the SDK on your custom container image. This avoids unexpected errors from incompatible dependencies or SDKs.

Testing locally

To learn more about Apache Beam-specific usage, see the Apache Beam guide for Running pipelines with custom container images.

Basic testing with PortableRunner

Test that remote container images can be pulled and can at least run a simple pipeline using the Apache Beam PortableRunner.

The following runs a simple Apache Beam wordcount example:

python path/to/my/ \
  --runner=PortableRunner \
  --job_endpoint=embed \
  --environment_type=DOCKER \
  --environment_config=<var>$IMAGE_URI</var> \
  --input=<var>INPUT_FILE</var> \

Replace the following:

  • IMAGE_URI: the custom container image URI. You can use the shell variable $IMAGE_URI constructed in the previous step if the variable is still in scope.
  • INPUT_FILE: an input file that can be read as a text file. This file must be accessible by the SDK harness
    container image, either preloaded on the container image or a remote file.
  • OUTPUT_FILE: a file path to write output to. This path is either a remote path or a local path on the container.

Once the pipeline has completed successfully, look at the console logs to verify the pipeline completed successfully and the remote image, specified by $IMAGE_URI, was used.

Note that after running, files saved to the container will not be in your local filesystem and the container will have stopped. Files can be copied over from the stopped container filesystem docker cp.

Alternatively: * Provide outputs to remote filesystems like Cloud Storage. Note this may require manually configuring access for testing purposes, including credential files or Application Default Credentials. * Add temporary logging for quick debugging

Using Direct Runner

For more in-depth local testing of the container image and your pipeline, use the Apache Beam DirectRunner.

You can verify your pipeline separately from the container by testing in an local environment matching the container image, or by launching the pipeline on a running container.

docker run -it --entrypoint "/bin/bash" $IMAGE_URI
# On docker container:
root@4f041a451ef3:/#  python path/to/my/ ...

This example assumes any pipeline files (including the pipeline itself, path/to/my/ are on the custom container itself, have been mounted from a local filesystem, or are remote and accessible by Apache Beam and the container. See Docker documentation on Storage and docker run for more information.

Note that the goal for DirectRunner testing is to test your pipeline in the custom container environment, not to test actually running your container with its default ENTRYPOINT. Modify the ENTRYPOINT (e.g. docker run --entrypoint ...) to either directly run your pipeline or allow manually running commands on the container.

If you rely on specific configuration from running on Compute Engine, you can run this container directly on a Compute Engine VM. See Containers on Compute Engine for more information.

Launching the Dataflow job

Specify the path to the container image when launching the Apache Beam pipeline on Dataflow using pipeline option --sdk_container_image.

Custom containers are only supported for Dataflow Runner v2. If you are launching a batch Python pipeline, set the --experiment=use_runner_v2 flag. If you are launching a streaming Python pipeline, specifying the experiment is not necessary as streaming Python pipelines use Runner v2 by default.

The following example demonstrates how to launch the batch wordcount example with a custom container using the Apache Beam SDK for Python version 2.30.0 or later:

python -m apache_beam.examples.wordcount \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE \
  --project=PROJECT_ID \
  --region=REGION \
  --temp_location=TEMP_LOCATION \
  --runner=DataflowRunner \
  --disk_size_gb=DISK_SIZE_GB \
  --experiment=use_runner_v2 \

Replace the following:

  • INPUT_FILE: the Cloud Storage input file path read by Dataflow when running the example.
  • OUTPUT_FILE: the Cloud Storage output file path written to by the example pipeline. This will contain the word counts.
  • PROJECT_ID: the ID of your Google Cloud project.
  • REGION: the regional endpoint for deploying your Dataflow job.
  • TEMP_LOCATION: the Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.
  • $IMAGE_URI: the custom container image URI. You can use the shell variable $IMAGE_URI constructed in the previous step if the variable is still in scope.
  • $DISK_SIZE_GB: (Optional): If your container is large in size, consider increasing default boot disk size to avoid running out of disk space.


This section provides instructions for troubleshooting issues with using custom containers in Dataflow. It focuses on issues with containers or workers not starting. If your workers are able to start and work is progressing, follow the general guidance for Troubleshooting your pipeline.

This guide focuses on issues with workers not starting, as most issues with custom containers will cause the container to fail to start and block workers from starting work.

If work is progressing on your pipeline, follow general guidance for Troubleshooting your pipeline.

Before reaching out for support, ensure that you have ruled out problems related to your container image:

  • Follow the steps to test your container image locally
  • Search for errors in the Job logs or in Worker logs, and compare any errors found against Common errors guidance
  • Make sure that the Apache Beam SDK version and language version you are using to launch the pipeline match the SDK on your custom container image.
  • If using Python, make sure that the Python major and minor version you use to launch the pipeline matches the version installed in your container image, and that the image does not have conflicting dependencies. You can run pip check{.external} to confirm.

Finding worker logs related to custom containers

The Dataflow worker logs for container-related error messages can be found using the Logs Explorer:

  1. Select log names. Custom container startup errors are most likely to be in one of:

  2. Select the Dataflow Step resource and specify the job_id.

In particular, if you are seeing Error Syncing pod... log messages, you should follow the common error guidance). You can query for these log messages in Dataflow worker logs using Logs Explorer with the following query:

resource.type="dataflow_step" AND jsonPayload.message:("$IMAGE_URI") AND severity="ERROR"

Common Issues

Job has errors or failed because container image cannot be pulled

Custom container images must be accessible by Dataflow workers. If the worker is unable to pull the image due to invalid URLs, misconfigured credentials, or missing network access, the worker will fail to start.

For batch jobs where no work has started and several workers are unable to start up sequentially, Dataflow fails the job. Otherwise, Dataflow logs errors but does not take further action to avoid destroying long-running job state.

Follow common error guidance to determine why and how to fix this issue.

Workers are not starting or work is not progressing

In some cases, if the SDK container fails to start due to some error, Dataflow is unable to determine that the error is permanent or fatal and will continuously attempt to restart the worker as it fails.

Find specific errors in {:#worker-logs} and check common error guidance.

If there are no obvious errors but you see [topologymanager] RemoveContainer INFO-level logs in, these logs indicate the custom container image is exiting early and did not start the long-running worker SDK process. This can happen if a custom ENTRYPOINT does not start the default boot script /opt/apache/beam/boot or did not pass arguments appropriately to this script. See Modifying the custom ENTRYPOINT.