Create a Dataproc custom image

Dataproc clusters can be provisioned with a custom image that includes a user's pre-installed packages. The following steps explain how to create a custom image and install it on a Dataproc cluster.

Notes:

  • The instructions in this document apply to Linux operating systems only. Other operating systems may be supported in future Dataproc releases.
  • Custom image builds require starting from a Dataproc base image ( Debian, Rocky Linux, and Ubuntu base images are supported).
  • Using optional components: By default, custom images inherit all of the Dataproc optional components (OS packages and configs) from their base images, You can customize the default OS package versions and configs, but you must specify the optional component name when you create your cluster (for example, by running the gcloud dataproc clusters create --optional-components=COMPONENT_NAME command—see Adding optional components). If the component name is not specified when you create the cluster, the component (including any custom OS packages and configs) will be deleted.

Before you begin

Set up your project

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Dataproc API, Compute Engine API, and Cloud Storage APIs.

    Enable the APIs

  5. Install the Google Cloud CLI.
  6. To initialize the gcloud CLI, run the following command:

    gcloud init
  7. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  8. Make sure that billing is enabled for your Google Cloud project.

  9. Enable the Dataproc API, Compute Engine API, and Cloud Storage APIs.

    Enable the APIs

  10. Install the Google Cloud CLI.
  11. To initialize the gcloud CLI, run the following command:

    gcloud init
  12. Install Python 3.11+
  13. Prepare a customization script that installs custom packages and/or updates configurations, for example:
      #! /usr/bin/bash
      apt-get -y update
      apt-get install python-dev
      apt-get install python-pip
      pip install numpy
      

Create a Cloud Storage bucket in your project

  1. In the Google Cloud console, go to the Cloud Storage Buckets page.

    Go to Buckets page

  2. Click Create bucket.
  3. On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
    • For Name your bucket, enter a name that meets the bucket naming requirements.
    • For Choose where to store your data, do the following:
      • Select a Location type option.
      • Select a Location option.
    • For Choose a default storage class for your data, select a storage class.
    • For Choose how to control access to objects, select an Access control option.
    • For Advanced settings (optional), specify an encryption method, a retention policy, or bucket labels.
  4. Click Create.

Generate a custom image

You will use generate_custom_image.py, a Python program to create a Dataproc custom image.

How it works

The generate_custom_image.py program launches a temporary Compute Engine VM instance with the specified Dataproc base image, then runs the customization script inside the VM instance to install custom packages and/or update configurations. After the customization script finishes, it shuts down the VM instance and creates a Dataproc custom image from the disk of the VM instance. The temporary VM is deleted after the custom image is created. The custom image is saved and can be used to create Dataproc clusters.

The generate_custom_image.py program uses gcloud CLI to run multi-step workflows on Compute Engine.

Run the code

Fork or clone the files on GitHub at Dataproc custom images. Then, run the generate_custom_image.py program to have Dataproc generate and save your custom image.

python3 generate_custom_image.py \
    --image-name=CUSTOM_IMAGE_NAME \
    [--family=CUSTOM_IMAGE_FAMILY_NAME] \
    --dataproc-version=IMAGE_VERSION \
    --customization-script=LOCAL_PATH \
    --zone=ZONE \
    --gcs-bucket=gs://BUCKET_NAME \
    [--no-smoke-test]

Required flags

  • --image-name: the output name for your custom image. Note: the image name must match regex [a-z](?:[-a-z0-9]{0,61}[a-z0-9]) — for example, no underscores or spaces, less than 64 characters.
  • --dataproc-version: the Dataproc image version to use in your custom image. Specify the version in "x.y.z-os" or "x.y.z-rc-os" format, for example, "2.0.69-debian10".
  • --customization-script: a local path to your script that the tool will run to install your custom packages or perform other customizations. Note that this script is only run on the temporary VM used to create the custom image. You can specify a different initialization script for any other initialization actions you want to perform when you create a cluster with your custom image.
  • --zone: the Compute Engine zone where generate_custom_image.py will create a temporary VM to use to create your custom image.
  • --gcs-bucket: a URI, in the format gs://BUCKET_NAME, which points to the Cloud Storage bucket that you created in Create a Cloud Storage bucket in your project. generate_custom_image.py will write log files to this bucket.

Optional flags

  • --family: the image family for the image. Image families are used to group similar images together, and can be used when creating a cluster as a pointer to the most recent image in the family. For example, "custom-1-5-debian10".
  • --no-smoke-test: This is an optional flag that disables smoke testing the newly built custom image. The smoke test creates a Dataproc test cluster with the newly built image, runs a small job, and then deletes the cluster at the end of the test. The smoke test runs by default to verify that the newly built custom image can create a functional Dataproc cluster. Disabling this step by using the --no-smoke-test flag will speed up the custom image build process, but its use is not recommended.
  • --subnet: The subnetwork to use to create the VM that builds the custom Dataproc image. If your project is part of a shared VPC, you must specify the full subnetwork URL in the following format: projects/HOST_PROJECT_ID/regions/REGION/subnetworks/SUBNET.

For a listing of additional optional flags, see Optional Arguments on GitHub.

If generate_custom_image.py is successful, the imageURI of the custom image will be listed in the terminal window output (the full imageUri is shown in bold below):

...
managedCluster:
    clusterName: verify-image-20180614213641-8308a4cd
    config:
      gceClusterConfig:
        zoneUri: ZONE
      masterConfig:
        imageUri: https://www.googleapis.com/compute/beta/projects/PROJECT_ID/global/images/CUSTOM_IMAGE_NAME
...

INFO:__main__:Successfully built Dataproc custom image: CUSTOM_IMAGE_NAME
INFO:__main__:

#####################################################################
  WARNING: DATAPROC CUSTOM IMAGE 'CUSTOM_IMAGE_NAME'
           WILL EXPIRE ON 2018-07-14 21:35:44.133000.
#####################################################################

Custom image version labels for advanced users

When using Dataproc's standard custom image tool, the tool automatically sets a required goog-dataproc-version label on the created custom image. The label reflects the feature capabilities and protocols used by Dataproc to manage the software on the image.

Advanced users who using their own process to create a custom Dataproc image must add the label manually to their custom image, as follows:

  1. Extract the goog-dataproc-version label from the base Dataproc image used to create the custom image.

    gcloud compute images describe ${BASE_DATAPROC_IMAGE} \
        --project cloud-dataproc \
        --format="value(labels.goog-dataproc-version)"
    

  2. Set the label on the custom image.

    gcloud compute images add-labels IMAGE_NAME --labels=[KEY=VALUE,...]
    

Use a custom image

You specify the custom image when you create a Dataproc cluster. A custom image is saved in Cloud Compute Images, and is valid to create a Dataproc cluster for 365 days from its creation date (see How to create a cluster with an expired custom image to use a custom image after its 365-day expiration date).

Custom image URI

You pass the imageUri of the custom image to the cluster create operation. This URI can be specified in one of three ways:

  1. Full URI:
    https://www.googleapis.com/compute/beta/projects/PROJECT_ID/global/images/`gs://`BUCKET_NAME`
  2. Partial URI: projects/PROJECT_ID/global/images/CUSTOM_IMAGE_NAME
  3. Short name: CUSTOM_IMAGE_NAME

Custom images can also be specified by their family URI, which always chooses the most recent image within the image family.

  1. Full URI:
    https://www.googleapis.com/compute/beta/projects/PROJECT_ID/global/images/family/CUSTOM_IMAGE_FAMILY_NAME/var>
  2. Partial URI: projects/PROJECT_ID/global/images/family/CUSTOM_IMAGE_FAMILY_NAME

How to find the custom image URI

gcloud Command

Run the following gcloud command to list the names of your custom images:

gcloud compute images list

Pass the name of your custom image to the following gcloud command to list the URI (selfLink) of your custom image:

gcloud compute images describe custom-image-name
...
name: CUSTOM_IMAGE_NAME
selfLink: https://www.googleapis.com/compute/v1/projects/PROJECT_ID/global/images/CUSTOM_IMAGE_NAME
...

Console

  1. Open the Compute Engine→Images page in the Google Cloud console and click the image name. You can insert a query in the filter-images text box to limit the number of displayed images.
  2. The Images details page opens. Click Equivalent REST.
  3. The REST response lists additional information about the image, including the selfLink, which is the image URI.
    {
      ...
      "name": "my-custom-image",
      "selfLink": "projects/PROJECT_ID/global/images/CUSTOM_IMAGE_NAME",
      "sourceDisk": ...,
      ...
    }
    

Create a cluster with a custom image

You can create a cluster with master and worker nodes that use a custom image with the gcloud command-line tool, the Dataproc API, or the Google Cloud console.

gcloud Command

You can create a Dataproc cluster with a custom image using the dataproc clusters create command with the --image flag. Example:
gcloud dataproc clusters create CLUSTER-NAME \
    --image=CUSTOM_IMAGE_URI \
    --region=REGION \
    ... other flags ...

REST API

You can create a cluster with a custom image by specifying custom image URI in the InstanceGroupConfig.imageUri field in the masterConfig, workerConfig, and, if applicable, secondaryWorkerConfig object included in a cluster.create API request.

Example: REST request to create a standard Dataproc cluster (one master, two worker nodes) with a custom image.

POST /v1/projects/PROJECT_ID/regions/REGION/clusters/
{
  "clusterName": "CLUSTER_NAME",
  "config": {
    "masterConfig": {
      "imageUri": "projects/PROJECT_ID/global/images/CUSTOM_IMAGE_NAME"
    },
    "workerConfig": {
      "imageUri": "projects/PROJECT_ID/global/images/CUSTOM_IMAGE_NAME"
    }
  }
}
  

Console

  1. In the Google Cloud console, open the Dataproc Create a cluster page. The Set up cluster panel is selected.
  2. In the Versioning section, click CHANGE. Select the CUSTOM IMAGE tab, choose the custom image to use for your Dataproc cluster, then click SELECT.

When you submit the Create a cluster form, your cluster's VMs will be provisioned with the selected custom image.

Override Dataproc cluster properties with a custom image

You can use custom images to overwrite any cluster properties set during cluster creation. If a user creates a cluster with your custom image but sets cluster properties different from those you set with your custom image, your custom image cluster property settings will take precedence.

To set cluster properties with your custom image:

  1. In your custom image customization script, create a dataproc.custom.properties file in /etc/google-dataproc, then set cluster property values in the file.
    • Sample dataproc.custom.properties file contents:
      dataproc.conscrypt.provider.enable=VALUE
      dataproc.logging.stackdriver.enable=VALUE
      

Sample customization script file-creation snippet to override two cluster properties:

     cat <<EOF >/etc/google-dataproc/dataproc.custom.properties
     dataproc.conscrypt.provider.enable=true
     dataproc.logging.stackdriver.enable=false
     EOF

How to create a cluster with an expired custom image

By default, custom images expire 365 days from the date of creation of the image. You can create a cluster that uses an expired custom image by completing the following steps.

  1. Attempt to create a Dataproc cluster with an expired custom image or a custom image that will expire within 10 days.

    gcloud dataproc clusters create CLUSTER-NAME \
        --image=CUSTOM-IMAGE-NAME \
        --region=REGION \
        ... other flags ...
    

  2. The gcloud CLI will issue an error message that includes the cluster dataproc:dataproc.custom.image.expiration.token property name and token value.

    dataproc:dataproc.custom.image.expiration.token=TOKEN_VALUE
    
    Copy the "token value" string to the clipboard.

  3. Use the gcloud CLI to create the Dataproc cluster again, adding the "token value" copied above as a cluster property.

    gcloud dataproc clusters create CLUSTER-NAME \
        --image=CUSTOM-IMAGE-NAME \
        --properties=dataproc:dataproc.custom.image.expiration.token=TOKEN_VALUE \
        --region=REGION \
        ... other flags ...
    

Cluster creation with the custom image should succeed.