Create a Dataproc custom image

Dataproc clusters can be provisioned with a custom image that includes a user's pre-installed packages. This page shows you how to create a custom image and install it on a Dataproc cluster.

Usage considerations and limitations

  • Custom image lifetime: To help ensure that clusters receive the latest service updates and bug fixes, the creation of clusters with a custom image is limited to 365 days from the custom image creation date. Note that existing clusters created with a custom-image can run indefinitely.

    You may need to use automation if you want to create clusters with a specific custom image for a period of greater than 365 days. For more information, see How to create a cluster with an expired custom image.

  • Linux only: The instructions in this document apply to Linux operating systems only. Other operating systems may be supported in future Dataproc releases.

  • Supported base images: Custom image builds require starting from a Dataproc base image. The following base images are supported: Debian, Rocky Linux, and Ubuntu.

    • Base image availability: New images announced in the Dataproc release notes are not available for use as the base for custom images until one week from their announcement date.
  • Using optional components: By default, custom images inherit all of the Dataproc optional components (OS packages and configs) from their base images, You can customize the default OS package versions and configs, but you must specify the optional component name when you create your cluster.

    Example cluster creation command:

    gcloud dataproc clusters create --optional-components=COMPONENT_NAME \
        --image=CUSTOM_IMAGE_URI  \
         ... other flags
    

    If the component name is not specified when you create the cluster, the optional component, including any custom OS packages and configs, will be deleted.

  • Using hosted custom images: If you use a custom image hosted in another project, the Dataproc Service Agent service account in your project must have compute.images.get permission on the image in the host project. You can grant this permission by granting the roles/compute.imageUser role on the hosted image to your project's Dataproc Service Agent service account (see Sharing custom images within an organization).

  • Using secure boot MOK (Machine Owner Key) secrets: To enable secure boot with your Dataproc custom image, do the following:

    1. Enable the the Secret Manager API (secretmanager.googleapis.com. Dataproc generates and manages a key pair using the Secret Manager service.

    2. Add the --service-account="SERVICE_ACCOUNT" flag to the generate_custom_image.py command when you generate a custom image. Note: You must grant to the service account the Secret Manager Viewer role (roles/secretmanager.viewer) on the project and the Secret Manager Accessor role (roles/secretmanager.secretAccessor) on' the public and private secrets.

    For more information with examples, see the README.md and other files within the examples/secure-boot directory of the GoogleCloudDataproc/custom-images repository on GitHub.

    To disable secure boot: By default, the Dataproc custom image scripts generate and manage a key pair using Secret Manager when run from a Dataproc cluster. If you don't want to use secure-boot with your custom image, include the --trusted-cert="" (empty flag value) to the generate_custom_image.py command when you generate your custom image.

Before you begin

Make sure to set up your project before generating your custom image.

Set up your project

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Dataproc API, Compute Engine API, and Cloud Storage APIs.

    Enable the APIs

  5. Install the Google Cloud CLI.
  6. To initialize the gcloud CLI, run the following command:

    gcloud init
  7. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  8. Make sure that billing is enabled for your Google Cloud project.

  9. Enable the Dataproc API, Compute Engine API, and Cloud Storage APIs.

    Enable the APIs

  10. Install the Google Cloud CLI.
  11. To initialize the gcloud CLI, run the following command:

    gcloud init
  12. Install Python 3.11+
  13. Prepare a customization script that installs custom packages and/or updates configurations, for example:
      #! /usr/bin/bash
      apt-get -y update
      apt-get install python-dev
      apt-get install python-pip
      pip install numpy
      

Create a Cloud Storage bucket in your project

  1. In the Google Cloud console, go to the Cloud Storage Buckets page.

    Go to Buckets page

  2. Click Create bucket.
  3. On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
    • For Name your bucket, enter a name that meets the bucket naming requirements.
    • For Choose where to store your data, do the following:
      • Select a Location type option.
      • Select a Location option.
    • For Choose a default storage class for your data, select a storage class.
    • For Choose how to control access to objects, select an Access control option.
    • For Advanced settings (optional), specify an encryption method, a retention policy, or bucket labels.
  4. Click Create.

Generate a custom image

You use generate_custom_image.py, a Python program, to create a Dataproc custom image.

How it works

The generate_custom_image.py program launches a temporary Compute Engine VM instance with the specified Dataproc base image, then runs the customization script inside the VM instance to install custom packages and or update configurations. After the customization script finishes, it shuts down the VM instance and creates a Dataproc custom image from the VM instance disk. The temporary VM is deleted after the custom image is created. The custom image is saved and can be used to create Dataproc clusters.

The generate_custom_image.py program uses gcloud CLI to run multi-step workflows on Compute Engine.

Run the code

Fork or clone the files on GitHub at Dataproc custom images.

Then, run the generate_custom_image.py script to have Dataproc generate and save your custom image.

python3 generate_custom_image.py \
    --image-name=CUSTOM_IMAGE_NAME \
    [--family=CUSTOM_IMAGE_FAMILY_NAME] \
    --dataproc-version=IMAGE_VERSION \
    --customization-script=LOCAL_PATH \
    --zone=ZONE \
    --gcs-bucket=gs://BUCKET_NAME \
    [--no-smoke-test]

Required flags

  • --image-name: the output name for your custom image. Note: the image name must match the regular expression [a-z](?:[-a-z0-9]{0,61}[a-z0-9]) with no underscores or spaces, and less than 64 characters.
  • --dataproc-version: the Dataproc image version to use in your custom image. Specify the version in x.y.z-os or x.y.z-rc-os format, for example, "2.0.69-debian10".
  • --customization-script: a local path to your script that the tool will run to install your custom packages or perform other customizations. This script is run only on the temporary VM used to create the custom image. You can specify a different initialization script for other initialization actions you want to perform when you create a cluster with your custom image.
  • --zone: the Compute Engine zone where generate_custom_image.py will create a temporary VM to use to create your custom image.
  • --gcs-bucket: a URI, in the format gs://BUCKET_NAME, that points to your Cloud Storage bucket. generate_custom_image.py writes log files to this bucket.

Optional flags

  • --family: the image family for the custom image. Image families are used to group similar images together, and can be used when creating a cluster as a pointer to the most recent image in the family. For example, custom-2-2-debian12.
  • --no-smoke-test: This is an optional flag that disables smoke testing the newly built custom image. The smoke test creates a Dataproc test cluster with the newly built image, runs a small job, and then deletes the cluster at the end of the test. The smoke test runs by default to verify that the newly built custom image can create a functional Dataproc cluster. Disabling this step by using the --no-smoke-test flag speeds up the custom image build process, but its use is not recommended.
  • --subnet: The subnetwork to use to create the VM that builds the custom Dataproc image. If your project is part of a shared VPC, you must specify the full subnetwork URL in the following format: projects/HOST_PROJECT_ID/regions/REGION/subnetworks/SUBNET.

For a listing of additional optional flags, see Optional Arguments on GitHub.

If generate_custom_image.py is successful, the imageURI of the custom image is displayed in the terminal window output (the full imageUri is shown in bold below):

...
managedCluster:
    clusterName: verify-image-20180614213641-8308a4cd
    config:
      gceClusterConfig:
        zoneUri: ZONE
      masterConfig:
        imageUri: https://www.googleapis.com/compute/beta/projects/PROJECT_ID/global/images/CUSTOM_IMAGE_NAME
...

INFO:__main__:Successfully built Dataproc custom image: CUSTOM_IMAGE_NAME
INFO:__main__:

#####################################################################
  WARNING: DATAPROC CUSTOM IMAGE 'CUSTOM_IMAGE_NAME'
           WILL EXPIRE ON 2018-07-14 21:35:44.133000.
#####################################################################

Custom image version labels (Advanced usage)

When using Dataproc's standard custom image tool, the tool sets a goog-dataproc-version label on the created custom image. The label reflects the feature capabilities and protocols used by Dataproc to manage the software on the image.

Advanced usage: If you use your own process to create a custom Dataproc image, you must add the goog-dataproc-version label manually to your custom image, as follows:

  1. Extract the goog-dataproc-version label from the base Dataproc image used to create the custom image.

    gcloud compute images describe ${BASE_DATAPROC_IMAGE} \
        --project cloud-dataproc \
        --format="value(labels.goog-dataproc-version)"
    

  2. Set the label on the custom image.

    gcloud compute images add-labels IMAGE_NAME --labels=[KEY=VALUE,...]
    

Use a custom image

You specify the custom image when you create a Dataproc cluster. A custom image is saved in Cloud Compute Images, and is valid to create a Dataproc cluster for 365 days from its creation date (see How to create a cluster with an expired custom image to use a custom image after its 365-day expiration date).

Custom image URI

You pass the imageUri of the custom image to the cluster creation operation. This URI can be specified in one of three ways:

  1. Full URI:
    https://www.googleapis.com/compute/beta/projects/PROJECT_ID/global/images/`gs://`BUCKET_NAME`
  2. Partial URI: projects/PROJECT_ID/global/images/CUSTOM_IMAGE_NAME
  3. Short name: CUSTOM_IMAGE_NAME

Custom images can also be specified by their family URI, which always chooses the most recent image within the image family.

  1. Full URI:
    https://www.googleapis.com/compute/beta/projects/PROJECT_ID/global/images/family/CUSTOM_IMAGE_FAMILY_NAME/var>
  2. Partial URI: projects/PROJECT_ID/global/images/family/CUSTOM_IMAGE_FAMILY_NAME

How to find the custom image URI

Google Cloud CLI

Run the following command to list the names of your custom images.

gcloud compute images list

Pass the name of your custom image to the following command to list the URI (selfLink) of your custom image.

gcloud compute images describe custom-image-name

Output snippet:

...
name: CUSTOM_IMAGE_NAME
selfLink: https://www.googleapis.com/compute/v1/projects/PROJECT_ID/global/images/CUSTOM_IMAGE_NAME
...

Console

  1. Open the Compute Engine→Images page in the Google Cloud console, and then click the image name. You can insert a query in the filter images field to limit the number of displayed images.
  2. The Images details page opens. Click Equivalent REST.
  3. The REST response lists additional information about the image, including the selfLink, which is the image URI.
    {
      ...
      "name": "my-custom-image",
      "selfLink": "projects/PROJECT_ID/global/images/CUSTOM_IMAGE_NAME",
      "sourceDisk": ...,
      ...
    }
    

Create a cluster with a custom image

create a cluster with master and worker nodes that use a custom image using the gcloud CLI, the Dataproc API, or the Google Cloud console.

gcloud CLI

Create a Dataproc cluster with a custom image using the dataproc clusters create command with the --image flag.

Example:
gcloud dataproc clusters create CLUSTER-NAME \
    --image=CUSTOM_IMAGE_URI \
    --region=REGION \
    ... other flags ...

REST API

Create a cluster with a custom image by specifying custom image URI in the InstanceGroupConfig.imageUri field in the masterConfig, workerConfig, and, if applicable, secondaryWorkerConfig object included in a cluster.create API request.

Example: REST request to create a standard Dataproc cluster (one master, two worker nodes) with a custom image.

POST /v1/projects/PROJECT_ID/regions/REGION/clusters/
{
  "clusterName": "CLUSTER_NAME",
  "config": {
    "masterConfig": {
      "imageUri": "projects/PROJECT_ID/global/images/CUSTOM_IMAGE_NAME"
    },
    "workerConfig": {
      "imageUri": "projects/PROJECT_ID/global/images/CUSTOM_IMAGE_NAME"
    }
  }
}
  

Console

  1. Open the Dataproc Create a cluster page. The Set up cluster panel is selected.
  2. In the Versioning section, click Change. Select the Custom Image tab, choose the custom image to use for your Dataproc cluster, then click Select. Cluster's VMs will be provisioned with the selected custom image.

Override Dataproc cluster properties with a custom image

You can use custom images to overwrite cluster properties that are set during cluster creation. If you create a cluster with a custom image, and the cluster creation operation sets properties with values that are different from those that are set by your custom image, the property values set by your custom image will take precedence.

To set cluster properties with your custom image:

  1. In your custom image customization script, create a dataproc.custom.properties file in /etc/google-dataproc, then set cluster property values in the file.

    • Sample dataproc.custom.properties file:
    dataproc.conscrypt.provider.enable=VALUE
    dataproc.logging.stackdriver.enable=VALUE
    
    • Sample customization script file-creation snippet to override two cluster properties:
    cat <<EOF >/etc/google-dataproc/dataproc.custom.properties
    dataproc.conscrypt.provider.enable=true
    dataproc.logging.stackdriver.enable=false
    EOF
    

How to create a cluster with an expired custom image

By default, custom images expire 365 days from the date of creation of the image. You can create a cluster that uses an expired custom image by completing the following steps.

  1. Attempt to create a Dataproc cluster with an expired custom image or a custom image that will expire within 10 days.

    gcloud dataproc clusters create CLUSTER-NAME \
        --image=CUSTOM-IMAGE-NAME \
        --region=REGION \
        ... other flags ...
    
  2. The gcloud CLI will issue an error message that includes the cluster dataproc:dataproc.custom.image.expiration.token property name and token value.

dataproc:dataproc.custom.image.expiration.token=TOKEN_VALUE

Copy the TOKEN_VALUE string to the clipboard.

  1. Use the gcloud CLI to create the Dataproc cluster again, adding the copied TOKEN_VALUE as a cluster property.

    gcloud dataproc clusters create CLUSTER-NAME \
        --image=CUSTOM-IMAGE-NAME \
        --properties=dataproc:dataproc.custom.image.expiration.token=TOKEN_VALUE \
        --region=REGION \
        ... other flags ...
    

Cluster creation with the custom image should succeed.