Use custom containers in Dataflow

You can customize the runtime environment of user code in Dataflow pipelines by supplying a custom container image. Custom containers are supported for pipelines that use Dataflow Runner v2.

When Dataflow starts up worker VMs, it uses Docker container images to launch containerized SDK processes on the workers. By default, a pipeline uses a prebuilt Apache Beam image. However, you can provide a custom container image for your Dataflow job. When you specify a custom container image, Dataflow launches workers that pull the specified image.

You might use a custom container for the following reasons:

  • Preinstall pipeline dependencies to reduce worker start time.
  • Preinstall pipeline dependencies that are not available in public repositories.
  • Preinstall pipeline dependencies when access to public repositories is turned off. Access might be turned off for security reasons.
  • Prestage large files to reduce worker start time.
  • Launch third-party software in the background.
  • Customize the execution environment.

For more information about custom containers in Apache Beam, see the Apache Beam custom container guide. For examples of Python pipelines that use custom containers, see Dataflow custom containers.

Next steps