This page describes how to customize the runtime environment of Python user code in Dataflow pipelines by supplying a custom container image.
When Dataflow launches worker VMs, it uses Docker container images. You can specify a custom container image instead of using one of the default Apache Beam images. When you specify a custom container image, Dataflow launches workers that pull the specified image. These are some reasons you might use a custom container:
- Preinstalling pipeline dependencies to reduce worker start time.
- Preinstalling pipeline dependencies not available in public repositories.
- Launching third-party software in the background.
- Customizing the execution environment.
For a more in-depth look into custom containers, see the Apache Beam custom container guide.
Before you begin
Verify that you have the Apache Beam SDK version 2.25.0, or later, installed. Check that this Apache Beam SDK version supports your Python version. For more information, see the guide for Installing the Apache Beam SDK.
Ensure you have Docker installed to test your container image locally. For more information, see Get Docker.
Creating and building the container image
In this example, we use Python 3.8 with the Apache Beam SDK version 2.25.0. To create a custom container image, specify the Apache Beam image as the parent image and add your own customizations.
- Create a new
Dockerfile
, specifying theapache/beam_python3.8_sdk:2.25.0
as the parent and add any customizations. For more information on writing Dockerfiles, see Best practices for writing Dockerfiles.FROM apache/beam_python3.8_sdk:2.25.0 # Make your customizations here, for example: ENV FOO=/bar COPY path/to/myfile ./
- Build the child image and push this image to a container registry.
Cloud Build
export PROJECT=PROJECT export REPO=REPO export TAG=TAG export REGISTRY_HOST=HOST export IMAGE_URI=$REGISTRY_HOST/$PROJECT/$REPO:$TAG gcloud builds submit --tag $IMAGE_URI
Docker
export PROJECT=PROJECT export REPO=REPO export TAG=TAG export REGISTRY_HOST=HOST export IMAGE_URI=$REGISTRY_HOST/$PROJECT/$REPO:$TAG docker build -f Dockerfile -t $IMAGE_URI ./ docker push $IMAGE_URI
-
PROJECT
: the project name or user name. REPO
: the image repository name.TAG
: the image tag.-
HOST
: the image registry host name, for example,gcr.io
.
-
Testing locally
Test the container image locally by running the Apache Beam wordcount
example using the PortableRunner
:
python -m apache_beam.examples.wordcount \ --input=INPUT_FILE \ --output=OUTPUT_FILE \ --runner=PortableRunner \ --job_endpoint=embed \ --environment_type=DOCKER \ --environment_config=$IMAGE_URI
Replace the following:
INPUT_FILE
: a file in the container image that can be read as a text file.OUTPUT_FILE
: a file path in the container image for writing the output. The output file is not accessible when execution completes because the container is shut down.$IMAGE_URI
: the custom container image URI. You can use the shell variable$IMAGE_URI
constructed in the previous step if the variable is still in scope.
Look at the console logs to verify the pipeline completed successfully and the
remote image, specified by $IMAGE_URI
, was used.
For more information, see the Apache Beam guide for Running pipelines with custom container images.
Launching the Dataflow job
Specify the path to the container image when launching the Apache Beam
pipeline on Dataflow. If you're launching a batch Python
pipeline, you must set the --experiment=use_runner_v2
flag. If you're
launching a streaming Python pipeline, specifying the experiment is not
necessary. For example, to launch the batch wordcount example with a custom
container image:
python -m apache_beam.examples.wordcount \ --input=INPUT_FILE \ --output=OUTPUT_FILE \ --project=PROJECT_ID \ --region=REGION \ --temp_location=TEMP_LOCATION \ --runner=DataflowRunner \ --experiment=use_runner_v2 \ --worker_harness_container_image=$IMAGE_URI
Replace the following:
INPUT_FILE
: the Cloud Storage input file path read by Dataflow when running the example.OUTPUT_FILE
: the Cloud Storage output file path written to by the example pipeline. This will contain the word counts.PROJECT_ID
: the ID of your Google Cloud project.REGION
: the regional endpoint for deploying your Dataflow job.TEMP_LOCATION
: the Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.$IMAGE_URI
: the custom container image URI. You can use the shell variable$IMAGE_URI
constructed in the previous step if the variable is still in scope.
Troubleshooting
This section provides instructions for troubleshooting common issues found when interacting with custom containers in Dataflow.
Before reaching out for support, ensure that you have ruled out problems related to your container image by following the steps in testing locally and in the following troubleshooting sections.
Finding container logs
The Dataflow worker logs for container related error messages can be found using the Logs Explorer:
Select the following log names:
dataflow.googleapis.com/docker
dataflow.googleapis.com/kubelet
dataflow.googleapis.com/worker
Select the
Dataflow Step
resource and specify thejob_id
.
Workers don't start
If your workers aren't starting, or your job times out, verify that Dataflow is able to pull your custom container image.
Query the Dataflow logs using
Logs Explorer for a log message
Error Syncing pod...
with the following query:
resource.type="dataflow_step" AND jsonPayload.message:("$IMAGE_URI") AND severity="ERROR"
The image must be accessible by Dataflow workers so that the workers can pull the image when starting. If you are using Container Registry to host your container image, the default Google Cloud service account is already able to access images in the same project. If Dataflow is unable to pull the container image, the workers cannot start.
For more information, see Configuring access control.