Create a Cloud Dataproc Custom Image

Cloud Dataproc clusters can be provisioned with a custom image that includes a user's pre-installed packages. The following steps explain how to create a custom image and install it on a Cloud Dataproc cluster.

Before you begin

Set up your project

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. Enable the Cloud Dataproc API, Compute Engine API, and Cloud Storage APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.
  6. Install Python 2.7+
  7. Download Daisy, then change its attributes to make it executable by running:
    chmod +x daisy

Create a Cloud Storage bucket in your project

  1. In the GCP Console, go to the Cloud Storage browser.

    Go to the Cloud Storage browser

  2. Click Create bucket.
  3. In the Create bucket dialog, specify the following attributes:
  4. Click Create.

Generating a custom image

You will use generate_custom_image.py, a Python program to create and stream a Compute Engine image file to a Cloud Storage bucket.

How it Works

The generate_custom_image.py Python program uses Daisy, a tool for running multi-step workflows on Compute Engine, to create a custom image. The program launches a temporary Compute Engine VM instance, which it uses to create a Cloud Dataproc custom image that includes your custom packages. The custom image is saved and can be used to create Cloud Dataproc clusters. The temporary VM is deleted after the custom image is created.

.

Running the code

Fork or clone the files on GitHub at GoogleCloudPlatform/cloud-dataproc/custom-images/. Then, run the generate_custom_image.py program to have Cloud Dataproc generate and save your custom image.

python generate_custom_image.py \
    --image-name custom_image_name \
    --dataproc-version Cloud Dataproc version (example: "1.2.22") \
    --customization-script local path to your custom script \
    --daisy-path local path to the downloaded daisy binary \
    --zone Compute Engine zone to use for the location of the temporary VM \
    --gcs-bucket URI (gs://bucket-name) of a Cloud Storage bucket in your project \
    [--no-smoke-test: "true" or "false" (default: "false"]

Required Flags

  • --image-name: the output name for your custom image (no spaces; underscores are allowed).
  • --dataproc-version: the Cloud Dataproc image version to use in your custom image. Specify the version in "x.y.z" format, for example, "1.2.22".
  • --customization-script: a local path to your script that the tool will run to install your custom packages or perform other customizations. Note that this script is only run on the temporary VM used to create the custom image. You can specify a different initialization script for any other initialization actions you want to perform when you create a cluster with your custom image.
  • --daisy-path: a local path to the daisy binary that you downloaded in Before you begin.
  • --zone: the Compute Engine zone where generate_custom_image.py will create a temporary VM to use to create your custom image.
  • --gcs-bucket: a URI, in the format gs://bucket-name, which points to the Cloud Storage bucket that you created in Create a Cloud Storage bucket in your project. generate_custom_image.py will write log files to this bucket.

Optional Flags

  • --no-smoke-test: default value: "false". This is an optional flag that can be set to "true" to disable smoke testing the newly built custom image. The smoke test creates a Cloud Dataproc test cluster with the newly built image, runs a small job, and then deletes the cluster at the end of the test. The smoke test runs by default to verify that the newly built custom image can create a functional Cloud Dataproc cluster. Disabling this step will speed up the custom image build process, but is not recommended.

If the gcloud command is successful, the imageURI of the custom image will be listed in the terminal window output (the full imageUri is shown in bold below:

...
managedCluster:
    clusterName: verify-image-20180614213641-8308a4cd
    config:
      gceClusterConfig:
        zoneUri: zone
      masterConfig:
        imageUri: https://www.googleapis.com/compute/beta/projects/project-id/global/images/custom-image-name
...

INFO:__main__:Successfully built Dataproc custom image: custom-image-name
INFO:__main__:

#####################################################################
  WARNING: DATAPROC CUSTOM IMAGE 'custom-image-name'
           WILL EXPIRE ON 2018-07-14 21:35:44.133000.
#####################################################################

Using a custom image

You specify the custom image when you create a Cloud Dataproc cluster. A custom image is saved in Cloud Compute Images, and is valid to create a Cloud Dataproc cluster for 30 days from its creation date (see How to create a cluster with an expired custom image to use a custom image after its 30-day expiration date).

Custom image URI

You pass the imageUri of the custom image to the cluster create operation. This URI can be specified in one of three ways:

  1. Full URI:
    https://www.googleapis.com/compute/beta/projects/project-id/global/images/custom-image-name
  2. Partial URI: projects/project-id/global/images/custom-image-name
  3. Short name: custom-image-name

How to find the custom image URI

gcloud Command

Run the following gcloud command to list the names of your custom images:

gcloud compute images list

Pass the name of your custom image to the following gcloud command to list the URI (selfLink) of your custom image:

gcloud compute images describe custom-image-name
...
name: custom-image-name
selfLink: https://www.googleapis.com/compute/v1/projects/project-id/global/images/custom-image-name
...

Console

  1. Open the Compute Engine→Images page in the GCP Console and click the image name. You can insert a query in the filter-images text box to limit the number of displayed images.
  2. The Images details page opens. Click Equivalent REST.
  3. The REST response lists additional information about the image, including the selfLink, which is the image URI.
    {
      ...
      "name": "my-custom-image",
      "selfLink": "projects/my-project-id/global/images/my-custom-image",
      "sourceDisk": ...,
      ...
    }
    

Creating a cluster with a custom image

You can create a cluster with master and worker nodes that use a custom image with the gcloud command-line tool, the Cloud Dataproc API, or the Google Cloud Platform Console.

gcloud Command

You can create a Cloud Dataproc cluster with a custom image using the dataproc clusters create command with the `--image` flag. Example:
gcloud dataproc clusters create cluster-name \
  --image=custom-image-URI \
  other flags

REST API

You can create a cluster with a custom image by specifying custom image URI in the InstanceGroupConfig.imageUri field in the `masterConfig`, `workerConfig`, and, if applicable, `secondaryWorkerConfig` object included in a cluster.create API request.

Example: REST request to create a standard Cloud Dataproc cluster (one master, two worker nodes) with a custom image.

POST /v1/projects/project-id/regions/global/clusters/
{
  "clusterName": "custom-name",
  "config": {
    "masterConfig": {
      "imageUri": "projects/project-id/global/images/custom-image-name"
    },
    "workerConfig": {
      "imageUri": "projects/project-id/global/images/custom-image-name"
    }
  }
}
  

Console

You can create a cluster that uses a custom image from the Cloud Dataproc [Create a cluster](https://console.cloud.google.com/dataproc/clustersAdd) page of the GCP Console.
  1. Open the Advanced options expander at the bottom of the form.
  2. In the Image section, click Change.
  3. Select the Custom Image tab, select the custom image to use for your Cloud Dataproc cluster, and then click Select.
When you submit the Create a cluster form, your cluster's VMs will be provisioned with the selected custom image.

How to create a cluster with an expired custom image

By default, custom images expire 30 days from the date of creation of the image. You can create a cluster that uses an expired custom image by completing the following steps.

  1. Attempt to create a Cloud Dataproc cluster with an expired custom image or a custom image that will expire within 10 days.

    gcloud dataproc clusters create cluster-name \
      --image=custom-image-name
      other flags
    

  2. The gcloud tool will issue an an error message that includes the cluster dataproc:dataproc.custom.image.expiration.token property name and token value.

    dataproc:dataproc.custom.image.expiration.token=token value
    
    Copy the "token value" string to the clipboard.

  3. Use the gcloud tool to create the Cloud Dataproc cluster again, adding the "token value" copied above as a cluster property.

    gcloud dataproc clusters create cluster-name \
      --image=custom-image-name \
      --properties dataproc:dataproc.custom.image.expiration.token=token value
      other flags
    

Cluster creation with the custom image should succeed.

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation