Use custom containers in Dataflow

Stay organized with collections Save and categorize content based on your preferences.

This page describes how to customize the runtime environment of user code in Dataflow pipelines by supplying a custom container image. Custom containers are supported for pipelines using Dataflow Runner v2.

When Dataflow launches worker VMs, it uses Docker container images to launch containerized SDK processes on the workers. You can specify a custom container image instead of using one of the default Apache Beam images. When you specify a custom container image, Dataflow launches workers that pull the specified image. The following list includes reasons you might use a custom container:

  • Preinstalling pipeline dependencies to reduce worker start time.
  • Preinstalling pipeline dependencies that are not available in public repositories.
  • Prestaging large files to reduce worker start time.
  • Launching third-party software in the background.
  • Customizing the execution environment.

For a more in-depth look into custom containers, see the Apache Beam custom container guide. For examples of Python pipelines that use custom containers, see Dataflow custom containers.

Before you begin

Verify that you have installed a version of the Apache Beam SDK that supports Runner v2 and your language version.

Then you can use the following pipeline option(s) to enable custom containers:

Java

Use --experiments=use_runner_v2 to enable Runner v2.

Use --sdkContainerImage to specify a customer container image for your Java runtime.

Python

If using SDK version 2.30.0 or later, use the pipeline option --sdk_container_image.

For older versions of the SDK, use the pipeline option --worker_harness_container_image.

Go

If using SDK version 2.40.0 or later, use the pipeline option --sdk_container_image.

For older versions of the SDK, use the pipeline option --worker_harness_container_image.

For more information, see the guide for Installing the Apache Beam SDK.

To test your container image locally, you must have Docker installed. For more information, see Get Docker.

Default SDK Container Images

We recommend starting with a default Apache Beam SDK image as a base container image. Default images are released as part of Apache Beam releases to DockerHub.

Create and building the container image

This section provides examples for different ways to create a custom SDK container image.

A custom SDK container image must meet the following requirements:

  • The Apache Beam SDK and necessary dependencies are installed.
  • The default ENTRYPOINT script (/opt/apache/beam/boot on default containers) runs as the last step during container startup. See Modifying the container entrypoint for more information.

Use an Apache Beam base image

To create a custom container image, specify the Apache Beam image as the parent image and add your own customizations. For more information on writing Dockerfiles, see Best practices for writing Dockerfiles.

  1. Create a new Dockerfile, specifying the base image using the FROM instruction.

    Java

    In this example, we use Java 8 with the Apache Beam SDK version 2.43.0.

    FROM apache/beam_java8_sdk:2.43.0
    
    # Make your customizations here, for example:
    ENV FOO=/bar
    COPY path/to/myfile ./
    

    The runtime version of the custom container must match the runtime that you will use to start the pipeline. For example, if you will start the pipeline from a local Java 11 environment, the FROM line must specify a Java 11 environment: apache/beam_java11_sdk:....

    Python

    In this example, we use Python 3.8 with the Apache Beam SDK version 2.43.0.

    FROM apache/beam_python3.8_sdk:2.43.0
    
    # Make your customizations here, for example:
    ENV FOO=/bar
    COPY path/to/myfile ./
    

    The runtime version of the custom container must match the runtime that you will use to start the pipeline. For example, if you will start the pipeline from a local Python 3.8 environment, the FROM line must specify a Python 3.8 environment: apache/beam_python3.8_sdk:....

    Go

    In this example, we use Go with the Apache Beam SDK version 2.43.0.

    FROM apache/beam_go_sdk:2.43.0
    
    # Make your customizations here, for example:
    ENV FOO=/bar
    COPY path/to/myfile ./
    
  2. Build the child image and push this image to a container registry.

    Cloud Build

    export PROJECT=PROJECT
    export REPO=REPO
    export TAG=TAG
    export IMAGE_URI=gcr.io/$PROJECT/$REPO:$TAG
    gcloud builds submit . --tag $IMAGE_URI
    

    Docker

    export PROJECT=PROJECT
    export REPO=REPO
    export TAG=TAG
    export IMAGE_URI=gcr.io/$PROJECT/$REPO:$TAG
    docker build . --tag $IMAGE_URI
    docker push $IMAGE_URI
    

    Replace the following:

    • PROJECT: the project name or user name.
    • REPO: the image repository name.
    • TAG: the image tag, typically latest.

Use a custom base image or multi-stage builds

If you have an existing base image or need to modify some base aspect of the default Apache Beam images (OS version, patches, etc), use a multistage build process to copy over the necessary artifacts from a default Apache Beam base image and provide your custom container image.

Here's an example Dockerfile that copies files from the Apache Beam SDK:

Java

FROM openjdk:8

# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_java8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

Python

FROM python:3.8-slim

# Install SDK.
RUN pip install --no-cache-dir apache-beam[gcp]==2.43.0

# Verify that the image does not have conflicting dependencies.
RUN pip check

# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_python3.8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

This example assumes necessary dependencies (in this case, Python 3.8 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time. Important: the SDK version specified in the RUN and COPY instructions must match the version used to launch the pipeline.

Go

FROM golang:latest

# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_go_sdk:2.43.0 /opt/apache/beam /opt/apache/beam

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

Modify the container entrypoint

Custom containers must run the default ENTRYPOINT script /opt/apache/beam/boot, which initializes the worker environment and starts the SDK worker process. If you do not set this entrypoint, your worker won't start properly.

If you need to run your own script on container startup, you must ensure that your image ENTRYPOINT still properly starts this worker SDK process, including passing Dataflow arguments to this script.

This means that your custom ENTRYPOINT must end with running /opt/apache/beam/boot, and that any required arguments passed by Dataflow on container startup are properly passed to the default boot script. This can be done by creating a custom script that runs /opt/apache/beam/boot:

#!/bin/bash

echo "This is my custom script"

# ...

# Pass command arguments to the default boot script.
/opt/apache/beam/boot "$@"

Then, override the default ENTRYPOINT. An example Dockerfile:

Java

FROM apache/beam_java8_sdk:2.43.0

COPY script.sh path/to/my/script.sh
ENTRYPOINT [ "path/to/my/script.sh" ]

Python

FROM apache/beam_python3.8_sdk:2.43.0

COPY script.sh path/to/my/script.sh
ENTRYPOINT [ "path/to/my/script.sh" ]

Go

FROM apache/beam_go_sdk:2.43.0

COPY script.sh path/to/my/script.sh
ENTRYPOINT [ "path/to/my/script.sh" ]

Run a job with custom containers

This section discusses running pipelines with custom containers both with local runners for testing and on Dataflow. Skip to Launching the Dataflow job if you have already verified your container image and pipeline.

Before you begin

When running your pipeline, make sure to launch the pipeline using the Apache Beam SDK with the same version and language version as the SDK on your custom container image. This avoids unexpected errors from incompatible dependencies or SDKs.

Test locally

To learn more about Apache Beam-specific usage, see the Apache Beam guide for Running pipelines with custom container images.

Basic testing with PortableRunner

Test that remote container images can be pulled and can at least run a simple pipeline using the Apache Beam PortableRunner.

The following runs an example pipeline:

Java

mvn compile exec:java -Dexec.mainClass=com.example.package.MyClassWithMain \
    -Dexec.args="--runner=PortableRunner \
    --jobEndpoint=ENDPOINT \
    --defaultEnvironmentType=DOCKER \
    --defaultEnvironmentConfig=IMAGE_URI \
    --inputFile=INPUT_FILE \
    --output=OUTPUT_FILE"

Python

python path/to/my/pipeline.py \
  --runner=PortableRunner \
  --job_endpoint=ENDPOINT \
  --environment_type=DOCKER \
  --environment_config=IMAGE_URI \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE

Go

go path/to/my/pipeline.go \
  --runner=PortableRunner \
  --job_endpoint=ENDPOINT \
  --environment_type=DOCKER \
  --environment_config=IMAGE_URI \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE

Replace the following:

  • ENDPOINT: the job service endpoint to use. Should be in the form of address and port. For example: localhost:3000. Use embed to run an in-process job service.
  • IMAGE_URI: the custom container image URI. You can use the shell variable $IMAGE_URI constructed in the previous step if the variable is still in scope.
  • INPUT_FILE: an input file that can be read as a text file. This file must be accessible by the SDK harness
    container image, either preloaded on the container image or a remote file.
  • OUTPUT_FILE: a file path to write output to. This path is either a remote path or a local path on the container.

Once the pipeline has completed successfully, look at the console logs to verify the pipeline completed successfully and the remote image, specified by $IMAGE_URI, was used.

Note that after running, files saved to the container are not in your local file system and the container will have stopped. Files can be copied over from the stopped container file system using
docker cp.

Alternatively:

  • Provide outputs to a remote file system like Cloud Storage. Note this might require manually configuring access for testing purposes, including credential files or Application Default Credentials.
  • Add temporary logging for quick debugging.

Use Direct Runner

For more in-depth local testing of the container image and your pipeline, use the Apache Beam DirectRunner.

You can verify your pipeline separately from the container by testing in a local environment matching the container image, or by launching the pipeline on a running container.

Java

docker run -it --entrypoint "/bin/bash" $IMAGE_URI
...
# On docker container:
root@4f041a451ef3:/#  mvn compile exec:java -Dexec.mainClass=com.example.package.MyClassWithMain ...

Python

docker run -it --entrypoint "/bin/bash" $IMAGE_URI
...
# On docker container:
root@4f041a451ef3:/#  python path/to/my/pipeline.py ...

Go

docker run -it --entrypoint "/bin/bash" $IMAGE_URI
...
# On docker container:
root@4f041a451ef3:/#  go path/to/my/pipeline.go ...

The examples assume any pipeline files (including the pipeline itself) are on the custom container itself, have been mounted from a local file system, or are remote and accessible by Apache Beam and the container. For example, to use Maven (mvn) to run the Java example above, you must have Maven and its dependencies staged on the container. See Docker documentation on Storage and docker run for more information.

Note that the goal for DirectRunner testing is to test your pipeline in the custom container environment, not to test actually running your container with its default ENTRYPOINT. Modify the ENTRYPOINT (e.g. docker run --entrypoint ...) to either directly run your pipeline or allow manually running commands on the container.

If you rely on specific configuration from running on Compute Engine, you can run this container directly on a Compute Engine VM. See Containers on Compute Engine for more information.

Launch the Dataflow job

When launching the Apache Beam pipeline on Dataflow, specify the path to the container image as shown below:

Java

Use --sdkContainerImage to specify a SDK container image for your Java runtime.

Use --experiments=use_runner_v2 to enable Runner v2.

Python

If using SDK version 2.30.0 or later, use the pipeline option --sdk_container_image to specify a SDK container image.

For older versions of the SDK, use the pipeline option --worker_harness_container_image to specify the location of container image to use for the worker harness.

Custom containers are only supported for Dataflow Runner v2. If you are launching a batch Python pipeline, set the --experiments=use_runner_v2 flag. If you are launching a streaming Python pipeline, specifying the experiment is not necessary as streaming Python pipelines use Runner v2 by default.

Go

If using SDK version 2.40.0 or later, use the pipeline option --sdk_container_image to specify a SDK container image.

For older versions of the SDK, use the pipeline option --worker_harness_container_image to specify the location of container image to use for the worker harness.

Custom containers are supported on all versions of the Go SDK because, they use Dataflow Runner v2 by default.

The following example demonstrates how to launch the batch wordcount example with a custom container.

Java

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
   -Dexec.args="--runner=DataflowRunner \
                --inputFile=INPUT_FILE \
                --output=OUTPUT_FILE \
                --project=PROJECT_ID \
                --region=REGION \
                --gcpTempLocation=TEMP_LOCATION \
                --diskSizeGb=DISK_SIZE_GB \
                --experiments=use_runner_v2 \
                --sdkContainerImage=$IMAGE_URI"

Python

Using the Apache Beam SDK for Python version 2.30.0 or later:

python -m apache_beam.examples.wordcount \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE \
  --project=PROJECT_ID \
  --region=REGION \
  --temp_location=TEMP_LOCATION \
  --runner=DataflowRunner \
  --disk_size_gb=DISK_SIZE_GB \
  --experiments=use_runner_v2 \
  --sdk_container_image=$IMAGE_URI

Go

wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \
          --output gs://<your-gcs-bucket>/counts \
          --runner dataflow \
          --project your-gcp-project \
          --region your-gcp-region \
          --temp_location gs://<your-gcs-bucket>/tmp/ \
          --staging_location gs://<your-gcs-bucket>/binaries/ \
          --sdk_container_image=$IMAGE_URI

Replace the following:

  • INPUT_FILE: the Cloud Storage input file path read by Dataflow when running the example.
  • OUTPUT_FILE: the Cloud Storage output file path written to by the example pipeline. This file contains the word counts.
  • PROJECT_ID: the ID of your Google Cloud project.
  • REGION: the regional endpoint for deploying your Dataflow job.
  • TEMP_LOCATION: the Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.
  • DISK_SIZE_GB: (Optional): If your container is large, consider increasing default boot disk size to avoid running out of disk space.
  • $IMAGE_URI: the SDK custom container image URI. You can use the shell variable $IMAGE_URI constructed in the previous step if the variable is still in scope.

Pre-build the python SDK custom container image with extra dependencies

When you launch a Python Dataflow job, extra dependencies, such as the ones specified via --requirements_file and --extra_packages, are installed in each SDK worker container started by the Dataflow service. This often leads to high CPU usage and a long warm up period on all newly started Dataflow workers when the job first starts, or when autoscaling happens.

To avoid these repetitive dependency installations, you can pre-build the custom Python SDK container image with all the dependencies pre-installed.

To use the Python SDK container image pre-building workflow, submit your job with the --prebuild_sdk_container_engine option.

  • If your project has Cloud Build API enabled, you can build it with Cloud Build via:

    --prebuild_sdk_container_engine=cloud_build --docker_registry_push_url=GAR_PATH/GCR_PATH
    
  • If you have a local Docker installation and prefer to pre-build the image locally, you can do it through:

    --prebuild_sdk_container_engine=local_docker --docker_registry_push_url=GAR_PATH/GCR_PATH
    

    Replace the GAR_PATH/GCR_PATH with a Artifact Registry folder or Container Registry path prefix, with which the prebuilt SDK container image should be tagged and pushed.

The SDK container image pre-building workflow uses the image passed with option --sdk_container_image as the base image. If the option is not set, an Apache Beam image is used as the base image by default.

A pre-built Python SDK container image can be reused in another job with the same dependencies and SDK version. To do so, simply pass the pre-built container image URL with --sdk_container_image to the other job while removing the dependency options --requirements_file, --extra_packages, and --setup_file. If the image will not be reused, you should delete it after the job is done. You can delete the image with gcloud or from the Artifact Registry or Container Registry console. For example, if the image is stored in Container Registry, use

   gcloud container images delete <IMAGE> --force-delete-tags

If the image is stored in Artifact Registry, use

   gcloud artifacts docker images delete <IMAGE> --delete-tags

Common issues:

  • If your job has extra Python dependencies from a private pypi mirror and cannot be pulled by a remote Cloud Build job, try with the local docker option.

  • If the Cloud Build job fails with 'docker exit code 137', it means the build job ran out of memory, potentially due to the size of the dependencies being installed. Try to use a larger Cloud Build worker machine type by passing --cloud_build_machine_type={machine_type}, where {machine_type} can be one of [n1-highcpu-8, n1-highcpu-32, e2-highcpu-8, e2-highcpu-32]. By default Cloud Build uses e2-medium.

Troubleshooting

This section provides instructions for troubleshooting issues with using custom containers in Dataflow. It focuses on issues with containers or workers not starting. If your workers are able to start and work is progressing, follow the general guidance for Troubleshooting your pipeline.

Before reaching out for support, ensure that you have ruled out problems related to your container image:

  • Follow the steps to test your container image locally.
  • Search for errors in the Job logs or in Worker logs, and compare any errors found with the common error guidance.
  • Make sure that the Apache Beam SDK version and language version you are using to launch the pipeline match the SDK version on your custom container image.
  • If using Java, make sure that the Java major version you use to launch the pipeline matches the version installed in your container image.
  • If using Python, make sure that the Python major-minor version you use to launch the pipeline matches the version installed in your container image, and that the image does not have conflicting dependencies. You can run pip check to confirm.

Find worker logs related to custom containers

The Dataflow worker logs for container-related error messages can be found using the Logs Explorer:

  1. Select log names. Custom container startup errors are most likely to be in one of the following:

    • dataflow.googleapis.com/kubelet
    • dataflow.googleapis.com/docker
    • dataflow.googleapis.com/worker-startup
    • dataflow.googleapis.com/harness-startup
  2. Select the Dataflow Step resource and specify the job_id.

If you are seeing Error Syncing pod... log messages, you should follow the common error guidance. You can query for these log messages in Dataflow worker logs using Logs Explorer with the following query:

resource.type="dataflow_step" AND jsonPayload.message:("$IMAGE_URI") AND severity="ERROR"

Common Issues

Job has errors or failed because container image cannot be pulled

Dataflow workers must be able to access custom container images. If the worker is unable to pull the image due to invalid URLs, misconfigured credentials, or missing network access, the worker fails to start.

For batch jobs where no work has started and several workers are unable to start sequentially, Dataflow fails the job. Otherwise, Dataflow logs errors but does not take further action to avoid destroying long-running job state.

For information about how to fix this issue, see Image pull request failed with error in the Troubleshoot Dataflow errors page.

Workers are not starting or work is not progressing

In some cases, if the SDK container fails to start due to an error, Dataflow is unable to determine whether the error is permanent or fatal and continuously attempts to restart the worker as it fails.

If there are no obvious errors but you see [topologymanager] RemoveContainer INFO-level logs in dataflow.googleapis.com/kubelet, these logs indicate that the custom container image is exiting early and did not start the long-running worker SDK process.

If workers have started successfully but no work is happening, an error might be preventing the SDK container from starting. In this case, the following error appears in the diagnostic recommendations:

Failed to start container

In addition, the worker logs do not contain lines such as the following:

Executing: python -m apache_beam.runners.worker.sdk_worker_main or Executing: java ... FnHarness

Find specific errors in Worker logs and check common error guidance.

Common causes for these issues include the following:

  • Problems with package installation, such as pip installation errors due to dependency issues. See Error syncing pod ... failed to "StartContainer".
  • Errors with the custom command arguments or with the ENTRYPOINT set in the Dockerfile. For example, a custom ENTRYPOINT does not start the default boot script /opt/apache/beam/boot or does not pass arguments appropriately to this script. For more information, see Modifying the container entrypoint.