This page describes how to customize the runtime environment of user code in Dataflow pipelines by supplying a custom container image. Custom containers are supported for pipelines using Dataflow Runner v2.
When Dataflow launches worker VMs, it uses Docker container images to launch containerized SDK processes on the workers. You can specify a custom container image instead of using one of the default Apache Beam images. When you specify a custom container image, Dataflow launches workers that pull the specified image. The following list includes reasons you might use a custom container:
- Preinstalling pipeline dependencies to reduce worker start time.
- Preinstalling pipeline dependencies that are not available in public repositories.
- Prestaging large files to reduce worker start time.
- Launching third-party software in the background.
- Customizing the execution environment.
For a more in-depth look into custom containers, see the Apache Beam custom container guide. For examples of Python pipelines that use custom containers, see Dataflow custom containers.
Before you begin
Verify that the version of the Apache Beam SDK installed supports Runner v2 and your language version.
Use the following pipeline option(s) to enable custom containers:
Java
Use --experiments=use_runner_v2
to enable Runner v2.
Use --sdkContainerImage
to specify a customer container image for your Java runtime.
Python
If using SDK version 2.30.0 or later, use the pipeline option --sdk_container_image
.
For older versions of the SDK, use the pipeline option --worker_harness_container_image
.
Go
If using SDK version 2.40.0 or later, use the pipeline option --sdk_container_image
.
For older versions of the SDK, use the pipeline option --worker_harness_container_image
.
For more information, see the guide for Installing the Apache Beam SDK.
To test your container image locally, you must have Docker installed. For more information, see Get Docker.
Default SDK Container Images
We recommend starting with a default Apache Beam SDK image as a base container image. Default images are released as part of Apache Beam releases to DockerHub.
Create and building the container image
This section provides examples for different ways to create a custom SDK container image.
A custom SDK container image must meet the following requirements:
- The Apache Beam SDK and necessary dependencies are installed.
- The default
ENTRYPOINT
script (/opt/apache/beam/boot
on default containers) runs as the last step during container startup. See Modifying the container entrypoint for more information.
Use an Apache Beam base image
To create a custom container image, specify the Apache Beam image as the parent image and add your own customizations. For more information about writing Dockerfiles, see Best practices for writing Dockerfiles.
Create a new
Dockerfile
, specifying the base image using theFROM
instruction.Java
In this example, we use Java 8 with the Apache Beam SDK version 2.46.0.
FROM apache/beam_java8_sdk:2.46.0 # Make your customizations here, for example: ENV FOO=/bar COPY path/to/myfile ./
The runtime version of the custom container must match the runtime that you will use to start the pipeline. For example, if you will start the pipeline from a local Java 11 environment, the
FROM
line must specify a Java 11 environment:apache/beam_java11_sdk:...
.Python
In this example, we use Python 3.8 with the Apache Beam SDK version 2.46.0.
FROM apache/beam_python3.8_sdk:2.46.0 # Make your customizations here, for example: ENV FOO=/bar COPY path/to/myfile ./
The runtime version of the custom container must match the runtime that you will use to start the pipeline. For example, if you will start the pipeline from a local Python 3.8 environment, the
FROM
line must specify a Python 3.8 environment:apache/beam_python3.8_sdk:...
.Go
In this example, we use Go with the Apache Beam SDK version 2.46.0.
FROM apache/beam_go_sdk:2.46.0 # Make your customizations here, for example: ENV FOO=/bar COPY path/to/myfile ./
Build the child image and push this image to a container registry.
Cloud Build
export PROJECT=PROJECT
export REPO=REPO
export TAG=TAG
export IMAGE_URI=gcr.io/$PROJECT/$REPO:$TAG
gcloud builds submit . --tag $IMAGE_URI
Docker
export PROJECT=PROJECT
export REPO=REPO
export TAG=TAG
export IMAGE_URI=gcr.io/$PROJECT/$REPO:$TAG
docker build . --tag $IMAGE_URI
docker push $IMAGE_URI
Replace the following:
PROJECT
: the project name or user name.REPO
: the image repository name.TAG
: the image tag, typicallylatest
.
Use a custom base image or multi-stage builds
If you have an existing base image or need to modify some base aspect of the default Apache Beam images (OS version, patches, etc), use a multistage build process to copy the necessary artifacts from a default Apache Beam base image and provide your custom container image.
The following example shows a Dockerfile that copies files from the Apache Beam SDK:
Java
FROM openjdk:8
# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_java8_sdk:2.46.0 /opt/apache/beam /opt/apache/beam
# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]
Python
FROM python:3.8-slim
# Install SDK.
RUN pip install --no-cache-dir apache-beam[gcp]==2.46.0
# Verify that the image does not have conflicting dependencies.
RUN pip check
# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_python3.8_sdk:2.46.0 /opt/apache/beam /opt/apache/beam
# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]
This example assumes necessary dependencies (in this case, Python 3.8 and pip
)
have been installed on the existing base image. Installing the Apache Beam
SDK into the image will ensure that the image has the necessary SDK dependencies
and reduce the worker startup time. Important: the SDK version specified in
the RUN
and COPY
instructions must match the version used to launch the pipeline.
Go
FROM golang:latest
# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_go_sdk:2.46.0 /opt/apache/beam /opt/apache/beam
# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]
Modify the container entrypoint
Custom containers must run the default ENTRYPOINT
script
/opt/apache/beam/boot
, which initializes the worker environment and starts the
SDK worker process. If you do not set this entrypoint, your worker does not
start properly.
If you need to run your own script during container startup, your
image ENTRYPOINT
must properly start this worker SDK process,
including passing Dataflow arguments to this script.
Therefore, your custom ENTRYPOINT
must end with running
/opt/apache/beam/boot
, and any required arguments passed by Dataflow during
container startup
must be properly passed to the default boot script. To do this step, create
a custom script that runs /opt/apache/beam/boot
:
#!/bin/bash
echo "This is my custom script"
# ...
# Pass command arguments to the default boot script.
/opt/apache/beam/boot "$@"
Then, override the default ENTRYPOINT
. The following Dockerfile
example
demonstrates this step:
Java
FROM apache/beam_java8_sdk:2.46.0
COPY script.sh path/to/my/script.sh
ENTRYPOINT [ "path/to/my/script.sh" ]
Python
FROM apache/beam_python3.8_sdk:2.46.0
COPY script.sh path/to/my/script.sh
ENTRYPOINT [ "path/to/my/script.sh" ]
Go
FROM apache/beam_go_sdk:2.46.0
COPY script.sh path/to/my/script.sh
ENTRYPOINT [ "path/to/my/script.sh" ]
Run a job with custom containers
This section discusses running pipelines with custom containers both with local runners for testing and on Dataflow. Skip to Launching the Dataflow job if you have already verified your container image and pipeline.
Before you begin
When running your pipeline, launch the pipeline using the Apache Beam SDK with the same version and language version as the SDK on your custom container image. This step avoids unexpected errors from incompatible dependencies or SDKs.
Test locally
To learn more about Apache Beam-specific usage, see the Apache Beam guide for Running pipelines with custom container images.
Basic testing with PortableRunner
To verify that remote container images can be pulled and can run a simple
pipeline, use the Apache Beam PortableRunner
. When you use the
PortableRunner
, job submission occurs in the local environment, and the
DoFn
execution happens in the Docker environment.
When you use GPUs, the Docker container might not have access to the GPUs. To test your container with GPUs, use the direct runner and follow the steps for testing a container image on a standalone VM with GPUs in the Debug with a standalone VM section of the Use GPUs page.
The following runs an example pipeline:
Java
mvn compile exec:java -Dexec.mainClass=com.example.package.MyClassWithMain \
-Dexec.args="--runner=PortableRunner \
--jobEndpoint=ENDPOINT \
--defaultEnvironmentType=DOCKER \
--defaultEnvironmentConfig=IMAGE_URI \
--inputFile=INPUT_FILE \
--output=OUTPUT_FILE"
Python
python path/to/my/pipeline.py \
--runner=PortableRunner \
--job_endpoint=ENDPOINT \
--environment_type=DOCKER \
--environment_config=IMAGE_URI \
--input=INPUT_FILE \
--output=OUTPUT_FILE
Go
go path/to/my/pipeline.go \
--runner=PortableRunner \
--job_endpoint=ENDPOINT \
--environment_type=DOCKER \
--environment_config=IMAGE_URI \
--input=INPUT_FILE \
--output=OUTPUT_FILE
Replace the following:
ENDPOINT
: the job service endpoint to use. Should be in the form of address and port. For example:localhost:3000
. Useembed
to run an in-process job service.IMAGE_URI
: the custom container image URI. You can use the shell variable$IMAGE_URI
constructed in the previous step if the variable is still in scope.INPUT_FILE
: an input file that can be read as a text file. This file must be accessible by the SDK harness
container image, either preloaded on the container image or a remote file.OUTPUT_FILE
: a file path to write output to. This path is either a remote path or a local path on the container.
When the pipeline successfully completes, review the console logs to verify that
the pipeline completed successfully and that the remote image, specified by
$IMAGE_URI
, was used.
After running the pipeline, files saved to the container are not in your local
file system, and the container is stopped. You can copy files from
the stopped container file system by using
docker cp
.
Alternatively:
- Provide outputs to a remote file system like Cloud Storage. You might need to manually configure access for testing purposes, including for credential files or Application Default Credentials.
- For quick debugging, add temporary logging.
Use Direct Runner
For more in-depth local testing of the container image and your pipeline, use the Apache Beam DirectRunner.
You can verify your pipeline separately from the container by testing in a local environment matching the container image, or by launching the pipeline on a running container.
Java
docker run -it --entrypoint "/bin/bash" $IMAGE_URI
...
# On docker container:
root@4f041a451ef3:/# mvn compile exec:java -Dexec.mainClass=com.example.package.MyClassWithMain ...
Python
docker run -it --entrypoint "/bin/bash" $IMAGE_URI
...
# On docker container:
root@4f041a451ef3:/# python path/to/my/pipeline.py ...
Go
docker run -it --entrypoint "/bin/bash" $IMAGE_URI
...
# On docker container:
root@4f041a451ef3:/# go path/to/my/pipeline.go ...
The examples assume any pipeline files, including the pipeline itself, are on
the custom container, have been mounted from a local file system, or are
remote and accessible by Apache Beam and the container. For example, to use
Maven (mvn
) to run the previous Java example, Maven and its
dependencies must be staged on the container. See Docker documentation on
Storage and
docker run
for more information.
The goal for DirectRunner testing is to test your pipeline
in the custom container environment, not to test running your container
with its default ENTRYPOINT
. Modify the ENTRYPOINT
(for example,
docker run --entrypoint ...
) to either directly run your pipeline or to allow
manually running commands on the container.
If you rely on specific configuration from running on Compute Engine, you can run this container directly on a Compute Engine VM. See Containers on Compute Engine for more information.
Launch the Dataflow job
When launching the Apache Beam pipeline on Dataflow, specify the path to the container image:
Java
Use --sdkContainerImage
to specify a SDK container image for your Java runtime.
Use --experiments=use_runner_v2
to enable Runner v2.
Python
If using SDK version 2.30.0 or later, use the pipeline option --sdk_container_image
to specify a SDK container image.
For older versions of the SDK, use the pipeline option --worker_harness_container_image
to specify the location of container image to use for the worker harness.
Custom containers are only supported for Dataflow Runner v2. If
you are launching a batch Python pipeline, set the --experiments=use_runner_v2
flag.
If you are launching a streaming Python pipeline, specifying the experiment is
not necessary as streaming Python pipelines use Runner v2 by default.
Go
If using SDK version 2.40.0 or later, use the pipeline option --sdk_container_image
to specify a SDK container image.
For older versions of the SDK, use the pipeline option --worker_harness_container_image
to specify the location of container image to use for the worker harness.
Custom containers are supported on all versions of the Go SDK because, they use Dataflow Runner v2 by default.
The following example demonstrates how to launch the batch wordcount
example
with a custom container.
Java
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=DataflowRunner \
--inputFile=INPUT_FILE \
--output=OUTPUT_FILE \
--project=PROJECT_ID \
--region=REGION \
--gcpTempLocation=TEMP_LOCATION \
--diskSizeGb=DISK_SIZE_GB \
--experiments=use_runner_v2 \
--sdkContainerImage=$IMAGE_URI"
Python
Using the Apache Beam SDK for Python version 2.30.0 or later:
python -m apache_beam.examples.wordcount \
--input=INPUT_FILE \
--output=OUTPUT_FILE \
--project=PROJECT_ID \
--region=REGION \
--temp_location=TEMP_LOCATION \
--runner=DataflowRunner \
--disk_size_gb=DISK_SIZE_GB \
--experiments=use_runner_v2 \
--sdk_container_image=$IMAGE_URI
Go
wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \
--output gs://<your-gcs-bucket>/counts \
--runner dataflow \
--project your-gcp-project \
--region your-gcp-region \
--temp_location gs://<your-gcs-bucket>/tmp/ \
--staging_location gs://<your-gcs-bucket>/binaries/ \
--sdk_container_image=$IMAGE_URI
Replace the following:
INPUT_FILE
: the Cloud Storage input file path read by Dataflow when running the example.OUTPUT_FILE
: the Cloud Storage output file path written to by the example pipeline. This file contains the word counts.PROJECT_ID
: the ID of your Google Cloud project.REGION
: the regional endpoint for deploying your Dataflow job.TEMP_LOCATION
: the Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.DISK_SIZE_GB
: (Optional): If your container is large, consider increasing default boot disk size to avoid running out of disk space.$IMAGE_URI
: the SDK custom container image URI. You can use the shell variable$IMAGE_URI
constructed in the previous step if the variable is still in scope.
Pre-build the python SDK custom container image with extra dependencies
When you launch a Python Dataflow job, extra dependencies, such as the ones specified by using
--requirements_file
and --extra_packages
, are installed in each SDK worker container started by the
Dataflow service. The dependency installation often leads to high CPU
usage and a long warm up period on all newly started Dataflow workers when the job first starts
and during autoscaling.
To avoid repetitive dependency installations, you can pre-build the custom Python SDK container image with the dependencies pre-installed.
Pre-build using a Dockerfile
To add any extra dependencies directly to your Python custom container, use the following commands:
FROM apache/beam_python3.8_sdk:2.46.0
# Pre-built python dependencies
RUN pip install lxml
# Pre-built other dependencies
RUN apt-get update \
&& apt-get dist-upgrade \
&& apt-get install -y --no-install-recommends ffmpeg
# Set the entrypoint to the Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]
Submit your job with the --sdk_container_image
and the --sdk_location
pipeline options.
The --sdk_location
option prevents the SDK from downloading when your job launches.
The SDK is retrieved directly from the container image.
python -m apache_beam.examples.wordcount \
--input=INPUT_FILE \
--output=OUTPUT_FILE \
--project=PROJECT_ID \
--region=REGION \
--temp_location=TEMP_LOCATION \
--runner=DataflowRunner \
--disk_size_gb=DISK_SIZE_GB \
--experiments=use_runner_v2 \
--sdk_container_image=$IMAGE_URI
--sdk_location=container
Pre-build when submitting the job
You can also use the Python SDK container image pre-building workflow.
Submit your job with the --prebuild_sdk_container_engine
pipeline option
and with the --docker_registry_push_url
and --sdk_location
options.
When using the --prebuild_sdk_container_engine
pipeline option,
Apache Beam generates a custom container and installs all the
dependencies specified by the --requirements_file
and --extra_packages
options.
If your project has Cloud Build API enabled, you can build it with Cloud Build using the following pipeline options:
python -m apache_beam.examples.wordcount \ --input=INPUT_FILE \ --output=OUTPUT_FILE \ --project=PROJECT_ID \ --region=REGION \ --temp_location=TEMP_LOCATION \ --runner=DataflowRunner \ --disk_size_gb=DISK_SIZE_GB \ --experiments=use_runner_v2 \ --requirements_file=./requirements.txt \ --prebuild_sdk_container_engine=cloud_build \ --docker_registry_push_url=GAR_PATH/GCR_PATH \ --sdk_location=container
If you have a local Docker installation and prefer to pre-build the image locally, use the following pipeline options:
python -m apache_beam.examples.wordcount \ --input=INPUT_FILE \ --output=OUTPUT_FILE \ --project=PROJECT_ID \ --region=REGION \ --temp_location=TEMP_LOCATION \ --runner=DataflowRunner \ --disk_size_gb=DISK_SIZE_GB \ --experiments=use_runner_v2 \ --requirements_file=./requirements.txt \ --prebuild_sdk_container_engine=local_docker \ --docker_registry_push_url=GAR_PATH/GCR_PATH \ --sdk_location=container
Replace
GAR_PATH/GCR_PATH
with an Artifact Registry folder or a Container Registry path prefix. The prebuilt SDK container is tagged and pushed.
The SDK container image pre-building workflow uses the image passed using the
--sdk_container_image
pipeline option as the base image. If the option is not
set, by default an Apache Beam image is used as the base image.
You can reuse a pre-built Python SDK container image in another job with the same dependencies and SDK version.
To reuse the image, pass the pre-built container image URL to the other job
by using the --sdk_container_image
pipeline option. Remove the dependency
options --requirements_file
, --extra_packages
, and --setup_file
.
If you don't plan to reuse the image, delete it after the job completes. You can delete the image with the gcloud CLI or in the Artifact Registry or Container Registry pages in the Google Cloud console.
For example, if the image is stored in
Container Registry, use the
container images delete
command:
gcloud container images delete IMAGE --force-delete-tags
If the image is stored in Artifact Registry, use the
artifacts docker images delete
command:
gcloud artifacts docker images delete IMAGE --delete-tags
Common issues:
If your job has extra Python dependencies from a private PyPi mirror and cannot be pulled by a remote Cloud Build job, try using the local docker option or try building your container using a Dockerfile.
If the Cloud Build job fails with
docker exit code 137
, the build job ran out of memory, potentially due to the size of the dependencies being installed. Use a larger Cloud Build worker machine type by passing--cloud_build_machine_type=machine_type
, where machine_type is one of the following options:n1-highcpu-8
n1-highcpu-32
e2-highcpu-8
e2-highcpu-32
By default, Cloud Build uses the machine type e2-medium
.
Troubleshooting
This section provides instructions for troubleshooting issues with using custom containers in Dataflow. It focuses on issues with containers or workers not starting. If your workers are able to start and work is progressing, follow the general guidance for Troubleshooting your pipeline.
Before reaching out for support, ensure that you have ruled out problems related to your container image:
- Follow the steps to test your container image locally.
- Search for errors in the Job logs or in Worker logs, and compare any errors found with the common error guidance.
- Make sure that the Apache Beam SDK version and language version you are using to launch the pipeline match the SDK version on your custom container image.
- If using Java, make sure that the Java major version you use to launch the pipeline matches the version installed in your container image.
- If using Python, make sure that the Python major-minor version you use to
launch the pipeline matches the version installed in your container image,
and that the image does not have conflicting dependencies. You can run
pip check
to confirm.
Find worker logs related to custom containers
The Dataflow worker logs for container-related error messages can be found using the Logs Explorer:
Select log names. Custom container startup errors are most likely to be in one of the following:
dataflow.googleapis.com/kubelet
dataflow.googleapis.com/docker
dataflow.googleapis.com/worker-startup
dataflow.googleapis.com/harness-startup
Select the
Dataflow Step
resource and specify thejob_id
.
If you are seeing Error Syncing pod...
log messages, you should
follow the common error guidance.
You can query for these log messages in Dataflow worker logs using
Logs Explorer with the following query:
resource.type="dataflow_step" AND jsonPayload.message:("$IMAGE_URI") AND severity="ERROR"
Common Issues
Job has errors or failed because container image cannot be pulled
Dataflow workers must be able to access custom container images. If the worker is unable to pull the image due to invalid URLs, misconfigured credentials, or missing network access, the worker fails to start.
For batch jobs where no work has started and several workers are unable to start sequentially, Dataflow fails the job. Otherwise, Dataflow logs errors but does not take further action to avoid destroying long-running job state.
For information about how to fix this issue, see Image pull request failed with error in the Troubleshoot Dataflow errors page.
Workers are not starting or work is not progressing
In some cases, if the SDK container fails to start due to an error, Dataflow is unable to determine whether the error is permanent or fatal and continuously attempts to restart the worker as it fails.
If there are no obvious errors but you see [topologymanager] RemoveContainer
INFO
-level logs in dataflow.googleapis.com/kubelet
, these logs indicate that the
custom container image is exiting early and did not start the long-running
worker SDK process.
If workers have started successfully but no work is happening, an error might be preventing the SDK container from starting. In this case, the following error appears in the diagnostic recommendations:
Failed to start container
In addition, the worker logs do not contain lines such as the following:
Executing: python -m apache_beam.runners.worker.sdk_worker_main or Executing: java ... FnHarness
Find specific errors in Worker logs and check common error guidance.
Common causes for these issues include the following:
- Problems with package installation, such as
pip
installation errors due to dependency issues. See Error syncing pod ... failed to "StartContainer". - Errors with the custom command arguments or with the
ENTRYPOINT
set in the Dockerfile. For example, a customENTRYPOINT
does not start the default boot script/opt/apache/beam/boot
or does not pass arguments appropriately to this script. For more information, see Modifying the container entrypoint.