Dataproc uses images to tie together useful Google Cloud Platform connectors and Apache Spark & Apache Hadoop components into one package that can be deployed on a Dataproc cluster. These images contain the base operating system (Debian or Ubuntu) for the cluster, along with core and optional components needed to run jobs, such as Spark, Hadoop, and Hive. These images will be upgraded periodically to include new improvements and features. Dataproc versioning allows you to select sets of software versions when you create clusters.
How versioning works
When an image is created, it is given an Image Version number in the following format:
version_major.version_minor.version_sub_minor-os_distribution
The following OS distributions are currently maintained:
OS Distribution Code | OS Distribution |
---|---|
debian10 | Debian 10 |
debian11 | Debian 11 |
rocky8 | Rocky Linux 8 |
rocky9 | Rocky Linux 9 |
ubuntu18 | Ubuntu 18.04 LTS |
ubuntu20 | Ubuntu 20.04 LTS |
ubuntu22 | Ubuntu 22.04 LTS |
See old image versions for previously supported OS distributions.
The recommended practice is to specify the major.minor
image
version for production environments or when compatibility with specific component
versions is important. The sub-minor and OS distributions will be automatically
set to the latest weekly release.
Selecting versions
When you create a new Dataproc cluster, the latest available
Debian image version will be used by default. You can select a
Debian, Rocky Linux or Ubuntu image version when creating a cluster (see the
Dataproc Image version List).
When specifying Debian-based images, you can omit the OS Distribution
Code suffix, for example by specifying 2.0
to select the 2.0-debian10
image.
The OS suffix must be used to select a Rocky Linux or
Ubuntu-based image, for example by specifying 2.0-ubuntu18
.
gcloud Command
When using the gcloud dataproc clusters create
command, you can
use the --image-version
argument to specify an image version for
the new cluster.
Debian image example:
gcloud dataproc clusters create cluster-name \ --image-version=2.0 \ --region=region
Ubuntu image example:
gcloud dataproc clusters create cluster-name \ --image-version=2.0-ubuntu18 \ --region=region
Best practice is to omit the sub-minor version so that the latest sub-minor version is used. However, if necessary, the sub-minor version can be specified, for example, "2.0.20".
You can check your current version with the Google Cloud CLI.
gcloud dataproc clusters describe cluster-name \ --region=region
REST API
You can specify the SoftwareConfig
imageVersion
field as part of a
cluster.create
API request.
Example
POST /v1/projects/project-id/regions/us-central1/clusters/ { "projectId": "project-id", "clusterName": "example-cluster", "config": { "configBucket": "", "gceClusterConfig": { "subnetworkUri": "default", "zoneUri": "us-central1-b" }, "masterConfig": { ... } }, "workerConfig": { ... } }, "softwareConfig": { "imageVersion": "2.0" } } }
Console
Open the Dataproc Create a cluster page. The Set up cluster panel is selected. The Image Type and Version field in the Versioning section shows the image that will be used when creating the cluster. The image release date is also shown. Initially, the default image, the latest available Debian version, is shown as selected. Click CHANGE to display a lists of available images. You can select a standard or custom image to use for your cluster.
When new versions are created
New major versions will be created periodically to incorporate one or more of the following:
- Major releases for:
- Spark, Hadoop, and other Big Data components
- Google Cloud connectors
- Major changes or updates to Dataproc functionality
New minor versions will be created periodically to incorporate one or more of the following:
- Minor releases and updates for:
- Spark, Hadoop, and other Big Data components
- Google Cloud connectors
- Minor changes or updates to Dataproc functionality
When a new minor version is created, its Debian image becomes the default for the major version, and represents the latest release of the major version.
New sub-minor versions will be created periodically to incorporate one or more of the following:
- Patches or fixes for a component in the image
- Component sub-minor version upgrades
Image Version and Dataproc support
Minor image versions are supported for 24 months after initial GA (General Availability) release. During this period, clusters using these image versions are eligible for support (to receive fixes, recreate your cluster using the latest supported sub-minor image version). After the support window has closed, clusters using the image versions are not eligible for support.
Old Image Versions
Previously supported OS distributions
The following OS distributions were previously supported:
OS Distribution Code | OS Distribution | Last Patched (End of support) |
---|---|---|
debian9 | Debian 9 | July 10, 2020 |
deb8 | Debian 8 | October 26, 2018 |
Image Versions without explicit OS distribution
Prior to August 16, 2018, image versions were built with Debian 8, and omitted the OS Distribution Code. They are specified in the following format:
version_major.version_minor.version_sub_minor
0.1 and 0.2
Image versions released as alpha or beta releases prior to
Dataproc version 1.0
general availability
are not subject to the
Dataproc support policy.