Using custom containers in Dataflow

This page describes how to customize the runtime environment of user code in Dataflow pipelines by supplying a custom container image. Custom containers are supported for pipelines using Dataflow Runner v2.

When Dataflow launches worker VMs, it uses Docker container images to launch containerized SDK processes on the workers. You can specify a custom container image instead of using one of the default Apache Beam images. When you specify a custom container image, Dataflow launches workers that pull the specified image. These are some reasons you might use a custom container:

  • Preinstalling pipeline dependencies to reduce worker start time.
  • Preinstalling pipeline dependencies that are not available in public repositories.
  • Prestaging large files to reduce worker start time.
  • Launching third-party software in the background.
  • Customizing the execution environment.

For a more in-depth look into custom containers, see the Apache Beam custom container guide.

Before you begin

Verify that you have installed a version of the Apache Beam SDK that supports Runner v2 and your language version.

Then you can use the following pipeline option(s) to enable custom containers:

Java

Use --experiments=use_runner_v2 to enable Runner v2.

Use --sdkContainerImage to specify a customer container image for your Java runtime.

Python

If using SDK version 2.30.0 or later, use the pipeline option --sdk_container_image.

For older versions of the SDK, use the pipeline option --worker_harness_container_image.

For more information, see the guide for Installing the Apache Beam SDK.

To test your container image locally, you must have Docker installed. For more information, see Get Docker.

Default SDK Container Images

We recommend starting with a default Apache Beam SDK image as a base container image. Default images are released as part of Apache Beam releases to DockerHub.

Creating and building the container image

This section provides examples for different ways to create a custom SDK container image.

A custom SDK container image must meet the following requirements:

  • The Apache Beam SDK and necessary dependencies are installed.
  • The default ENTRYPOINT script (/opt/apache/beam/boot on default containers) runs as the last step during container startup. See Modifying the container entrypoint for more information.

Using Apache Beam base image

To create a custom container image, specify the Apache Beam image as the parent image and add your own customizations. For more information on writing Dockerfiles, see Best practices for writing Dockerfiles.

  1. Create a new Dockerfile, specifying the base image using the FROM instruction.

    Java

    In this example, we use Java 8 with the Apache Beam SDK version 2.34.0.

    FROM apache/beam_java8_sdk:2.34.0
    
    # Make your customizations here, for example:
    ENV FOO=/bar
    COPY path/to/myfile ./
    

    The runtime version of the custom container must match the runtime that you will use to start the pipeline. For example, if you will start the pipeline from a local Java 11 environment, the FROM line must specify a Java 11 environment: apache/beam_java11_sdk:....

    Python

    In this example, we use Python 3.8 with the Apache Beam SDK version 2.34.0.

    FROM apache/beam_python3.8_sdk:2.34.0
    
    # Make your customizations here, for example:
    ENV FOO=/bar
    COPY path/to/myfile ./
    

    The runtime version of the custom container must match the runtime that you will use to start the pipeline. For example, if you will start the pipeline from a local Python 3.8 environment, the FROM line must specify a Python 3.8 environment: apache/beam_python3.8_sdk:....

  2. Build the child image and push this image to a container registry.

    Cloud Build

    export PROJECT=PROJECT
    export REPO=REPO
    export TAG=TAG
    export IMAGE_URI=gcr.io/$PROJECT/$REPO:$TAG
    gcloud builds submit . --tag $IMAGE_URI
    

    Docker

    export PROJECT=PROJECT
    export REPO=REPO
    export TAG=TAG
    export IMAGE_URI=gcr.io/$PROJECT/$REPO:$TAG
    docker build . --tag $IMAGE_URI
    docker push $IMAGE_URI
    

    Replace the following:

    • PROJECT: the project name or user name.
    • REPO: the image repository name.
    • TAG: the image tag, typically latest.

Using a custom base image or multi-stage builds

If you have an existing base image or need to modify some base aspect of the default Apache Beam images (OS version, patches, etc), use a multistage build process to copy over the necessary artifacts from a default Apache Beam base image and provide your custom container image.

Here's an example Dockerfile that copies files from the Apache Beam SDK:

Java

FROM openjdk:8

# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_java8_sdk:2.34.0 /opt/apache/beam /opt/apache/beam

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

Python

FROM python:3.8-slim

# Install SDK.
RUN pip install --no-cache-dir apache-beam[gcp]==2.34.0

# Verify that the image does not have conflicting dependencies.
RUN pip check

# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_python3.8_sdk:2.34.0 /opt/apache/beam /opt/apache/beam

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

This example assumes necessary dependencies (in this case, Python 3.8 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time. The version specified in the RUN instruction must match the version used to launch the pipeline.

Modifying the container entrypoint

Custom containers must run the default ENTRYPOINT script /opt/apache/beam/boot, which initializes the worker environment and starts the SDK worker process. If you do not set this entrypoint, your worker won't start properly.

If you need to run your own script on container startup, you must ensure that your image ENTRYPOINT still properly starts this worker SDK process, including passing Dataflow arguments to this script.

This means that your custom ENTRYPOINT must end with running /opt/apache/beam/boot, and that any required arguments passed by Dataflow on container startup are properly passed to the default boot script. This can be done by creating a custom script that runs /opt/apache/beam/boot:

#!/bin/bash

echo "This is my custom script"

# ...

# Pass command arguments to the default boot script.
/opt/apache/beam/boot "$@"

Then, override the default ENTRYPOINT. An example Dockerfile:

Java

FROM apache/beam_java8_sdk:2.34.0

COPY script.sh path/to/my/script.sh
ENTRYPOINT [ "path/to/my/script.sh" ]

Python

FROM apache/beam_python3.8_sdk:2.34.0

COPY script.sh path/to/my/script.sh
ENTRYPOINT [ "path/to/my/script.sh" ]

Running a job with custom containers

This section discusses running pipelines with custom containers both with local runners for testing and on Dataflow. Skip to Launching the Dataflow job if you have already verified your container image and pipeline.

Before you begin

When running your pipeline, make sure to launch the pipeline using the Apache Beam SDK with the same version and language version as the SDK on your custom container image. This avoids unexpected errors from incompatible dependencies or SDKs.

Testing locally

To learn more about Apache Beam-specific usage, see the Apache Beam guide for Running pipelines with custom container images.

Basic testing with PortableRunner

Test that remote container images can be pulled and can at least run a simple pipeline using the Apache Beam PortableRunner.

The following runs an example pipeline:

Java

mvn compile exec:java -Dexec.mainClass=com.example.package.MyClassWithMain \
    -Dexec.args="--runner=PortableRunner \
    --job_endpoint=embed \
    --environment_type=DOCKER \
    --environment_config=IMAGE_URI \
    --inputFile=INPUT_FILE \
    --output=OUTPUT_FILE"

Python

python path/to/my/pipeline.py \
  --runner=PortableRunner \
  --job_endpoint=embed \
  --environment_type=DOCKER \
  --environment_config=IMAGE_URI \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE

Replace the following:

  • IMAGE_URI: the custom container image URI. You can use the shell variable $IMAGE_URI constructed in the previous step if the variable is still in scope.
  • INPUT_FILE: an input file that can be read as a text file. This file must be accessible by the SDK harness
    container image, either preloaded on the container image or a remote file.
  • OUTPUT_FILE: a file path to write output to. This path is either a remote path or a local path on the container.

Once the pipeline has completed successfully, look at the console logs to verify the pipeline completed successfully and the remote image, specified by $IMAGE_URI, was used.

Note that after running, files saved to the container will not be in your local filesystem and the container will have stopped. Files can be copied over from the stopped container filesystem using docker cp.

Alternatively:

  • Provide outputs to a remote file system like Cloud Storage. Note this may require manually configuring access for testing purposes, including credential files or Application Default Credentials.
  • Add temporary logging for quick debugging.

Using Direct Runner

For more in-depth local testing of the container image and your pipeline, use the Apache Beam DirectRunner.

You can verify your pipeline separately from the container by testing in a local environment matching the container image, or by launching the pipeline on a running container.

Java

docker run -it --entrypoint "/bin/bash" $IMAGE_URI
...
# On docker container:
root@4f041a451ef3:/#  mvn compile exec:java -Dexec.mainClass=com.example.package.MyClassWithMain ...

Python

docker run -it --entrypoint "/bin/bash" $IMAGE_URI
...
# On docker container:
root@4f041a451ef3:/#  python path/to/my/pipeline.py ...

The examples assume any pipeline files (including the pipeline itself) are on the custom container itself, have been mounted from a local file system, or are remote and accessible by Apache Beam and the container. For example, to use Maven (mvn) to run the Java example above, you must have Maven and its dependencies staged on the container. See Docker documentation on Storage and docker run for more information.

Note that the goal for DirectRunner testing is to test your pipeline in the custom container environment, not to test actually running your container with its default ENTRYPOINT. Modify the ENTRYPOINT (e.g. docker run --entrypoint ...) to either directly run your pipeline or allow manually running commands on the container.

If you rely on specific configuration from running on Compute Engine, you can run this container directly on a Compute Engine VM. See Containers on Compute Engine for more information.

Launching the Dataflow job

When launching the Apache Beam pipeline on Dataflow, specify the path to the container image as shown below:

Java

Use --sdkContainerImage to specify a customer container image for your Java runtime.

Use --experiments=use_runner_v2 to enable Runner v2.

Python

If using SDK version 2.30.0 or later, use the pipeline option --sdk_container_image.

For older versions of the SDK, use the pipeline option --worker_harness_container_image.

Custom containers are only supported for Dataflow Runner v2. If you are launching a batch Python pipeline, set the --experiments=use_runner_v2 flag. If you are launching a streaming Python pipeline, specifying the experiment is not necessary as streaming Python pipelines use Runner v2 by default.

The following example demonstrates how to launch the batch wordcount example with a custom container.

Java

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
   -Dexec.args="--runner=DataflowRunner \
                --inputFile=INPUT_FILE \
                --output=OUTPUT_FILE \
                --project=PROJECT_ID \
                --region=REGION \
                --gcpTempLocation=TEMP_LOCATION \
                --diskSizeGb=DISK_SIZE_GB \
                --experiments=use_runner_v2 \
                --sdkContainerImage=$IMAGE_URI"

Python

Using the Apache Beam SDK for Python version 2.30.0 or later:

python -m apache_beam.examples.wordcount \
  --input=INPUT_FILE \
  --output=OUTPUT_FILE \
  --project=PROJECT_ID \
  --region=REGION \
  --temp_location=TEMP_LOCATION \
  --runner=DataflowRunner \
  --disk_size_gb=DISK_SIZE_GB \
  --experiments=use_runner_v2 \
  --sdk_container_image=$IMAGE_URI

Replace the following:

  • INPUT_FILE: the Cloud Storage input file path read by Dataflow when running the example.
  • OUTPUT_FILE: the Cloud Storage output file path written to by the example pipeline. This will contain the word counts.
  • PROJECT_ID: the ID of your Google Cloud project.
  • REGION: the regional endpoint for deploying your Dataflow job.
  • TEMP_LOCATION: the Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.
  • DISK_SIZE_GB: (Optional): If your container is large, consider increasing default boot disk size to avoid running out of disk space.
  • $IMAGE_URI: the custom container image URI. You can use the shell variable $IMAGE_URI constructed in the previous step if the variable is still in scope.

Troubleshooting

This section provides instructions for troubleshooting issues with using custom containers in Dataflow. It focuses on issues with containers or workers not starting. If your workers are able to start and work is progressing, follow the general guidance for Troubleshooting your pipeline.

Before reaching out for support, ensure that you have ruled out problems related to your container image:

  • Follow the steps to test your container image locally.
  • Search for errors in the Job logs or in Worker logs, and compare any errors found against Common errors guidance.
  • Make sure that the Apache Beam SDK version and language version you are using to launch the pipeline match the SDK on your custom container image.
  • If using Java, make sure that the Java major version you use to launch the pipeline matches the version installed in your container image.
  • If using Python, make sure that the Python major-minor version you use to launch the pipeline matches the version installed in your container image, and that the image does not have conflicting dependencies. You can run pip check to confirm.

Finding worker logs related to custom containers

The Dataflow worker logs for container-related error messages can be found using the Logs Explorer:

  1. Select log names. Custom container startup errors are most likely to be in one of:

    • dataflow.googleapis.com/docker
    • dataflow.googleapis.com/kubelet
    • dataflow.googleapis.com/worker-startup
  2. Select the Dataflow Step resource and specify the job_id.

In particular, if you are seeing Error Syncing pod... log messages, you should follow the common error guidance). You can query for these log messages in Dataflow worker logs using Logs Explorer with the following query:

resource.type="dataflow_step" AND jsonPayload.message:("$IMAGE_URI") AND severity="ERROR"

Common Issues

Job has errors or failed because container image cannot be pulled

Custom container images must be accessible by Dataflow workers. If the worker is unable to pull the image due to invalid URLs, misconfigured credentials, or missing network access, the worker will fail to start.

For batch jobs where no work has started and several workers are unable to start up sequentially, Dataflow fails the job. Otherwise, Dataflow logs errors but does not take further action to avoid destroying long-running job state.

Follow common error guidance to determine why and how to fix this issue.

Workers are not starting or work is not progressing

In some cases, if the SDK container fails to start due to some error, Dataflow is unable to determine that the error is permanent or fatal and will continuously attempt to restart the worker as it fails.

Find specific errors in Worker logs and check common error guidance.

If there are no obvious errors but you see [topologymanager] RemoveContainer INFO-level logs in dataflow.googleapis.com/kubelet, these logs indicate the custom container image is exiting early and did not start the long-running worker SDK process. This can happen if a custom ENTRYPOINT does not start the default boot script /opt/apache/beam/boot or did not pass arguments appropriately to this script. See Modifying the custom ENTRYPOINT.