Create a Dataproc custom image

Dataproc clusters can be provisioned with a custom image that includes a user's pre-installed packages. The following steps explain how to create a custom image and install it on a Dataproc cluster.


  • The instructions in this document apply to Linux operating systems only. Other operating systems may be supported in future Dataproc releases.
  • Custom image builds require starting from a Dataproc base image ( Centos, Debian and Ubuntu base images are supported).
  • Using optional components: By default, custom images inherit all of the Dataproc optional components (OS packages and configs) from their base images, You can customize the default OS package versions and configs, but you must specify the optional component name when you create your cluster (for example, by running the gcloud dataproc clusters create --optional-components=component-name command—see Adding optional components). If the component name is not specified when you create the cluster, the component (including any custom OS packages and configs) will be deleted.

Before you begin

Set up your project

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Dataproc API, Compute Engine API, and Cloud Storage APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.
  6. Install Python 2.7+
  7. Prepare a customization script that installs custom packages and/or updates configurations, for example:
      #! /usr/bin/bash
      apt-get -y update
      apt-get install python-dev
      apt-get install python-pip
      pip install numpy

Create a Cloud Storage bucket in your project

  1. In the Cloud Console, go to the Cloud Storage Browser page.

    Go to Browser

  2. Click Create bucket.
  3. On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
    • For Name your bucket, enter a name that meets the bucket naming requirements.
    • For Choose where to store your data, do the following:
      • Select a Location type option.
      • Select a Location option.
    • For Choose a default storage class for your data, select a storage class.
    • For Choose how to control access to objects, select an Access control option.
    • For Advanced settings (optional), specify an encryption method, a retention policy, or bucket labels.
  4. Click Create.

Generating a custom image

You will use, a Python program to create a Dataproc custom image.

How it works

The program launches a temporary Compute Engine VM instance with the specified Dataproc base image, then runs the customization script inside the VM instance to install custom packages and/or update configurations. After the customization script finishes, it shuts down the VM instance and creates a Dataproc custom image from the disk of the VM instance. The temporary VM is deleted after the custom image is created. The custom image is saved and can be used to create Dataproc clusters.

The program uses either gcloud (the default) to run multi-step workflows on Compute Engine.

Running the code

Fork or clone the files on GitHub at Dataproc custom images. Then, run the program to have Dataproc generate and save your custom image.

python \
    --image-name custom_image_name \
    [--family custom_image_family_name] \
    --dataproc-version Dataproc version (example: "1.5.10-debian10") \
    --customization-script local path to your custom script \
    --zone Compute Engine zone to use for the location of the temporary VM \
    --gcs-bucket URI (gs://bucket-name) of a Cloud Storage bucket in your project \

Required flags

  • --image-name: the output name for your custom image. Note: the image name must match regex [a-z](?:[-a-z0-9]{0,61}[a-z0-9]) — for example, no underscores or spaces, less than 64 characters.
  • --dataproc-version: the Dataproc image version to use in your custom image. Specify the version in "x.y.z-os" or "x.y.z-rc-os" format, for example, "1.5.35-debian10".
  • --customization-script: a local path to your script that the tool will run to install your custom packages or perform other customizations. Note that this script is only run on the temporary VM used to create the custom image. You can specify a different initialization script for any other initialization actions you want to perform when you create a cluster with your custom image.
  • --zone: the Compute Engine zone where will create a temporary VM to use to create your custom image.
  • --gcs-bucket: a URI, in the format gs://bucket-name, which points to the Cloud Storage bucket that you created in Create a Cloud Storage bucket in your project. will write log files to this bucket.

Optional flags

  • --family: the image family for the image. Image families are used to group similar images together, and can be used when creating a cluster as a pointer to the most recent image in the family. For example, "custom-1-5-debian10".
  • --no-smoke-test: This is an optional flag that disables smoke testing the newly built custom image. The smoke test creates a Dataproc test cluster with the newly built image, runs a small job, and then deletes the cluster at the end of the test. The smoke test runs by default to verify that the newly built custom image can create a functional Dataproc cluster. Disabling this step by using the --no-smoke-test flag will speed up the custom image build process, but its use is not recommended.

For a listing of additional optional flags, see Optional Arguments on GithHub.

If the gcloud command is successful, the imageURI of the custom image will be listed in the terminal window output (the full imageUri is shown in bold below:

    clusterName: verify-image-20180614213641-8308a4cd
        zoneUri: zone

INFO:__main__:Successfully built Dataproc custom image: custom-image-name

  WARNING: DATAPROC CUSTOM IMAGE 'custom-image-name'
           WILL EXPIRE ON 2018-07-14 21:35:44.133000.

Custom image version labels for advanced users

When using Dataproc's standard custom image tool, the tool automatically sets a required goog-dataproc-version label on the created custom image. The label reflects the feature capabilities and protocols used by Dataproc to manage the software on the image.

Advanced users who using their own process to create a custom Dataproc image must add the label manually to their custom image, as follows:

  1. Extract the goog-dataproc-version label from the base Dataproc image used to create the custom image.

    gcloud compute images describe ${BASE_DATAPROC_IMAGE} \
        --project cloud-dataproc \

  2. Set the label on the custom image.

    gcloud compute images add-labels IMAGE_NAME --labels=[KEY=VALUE,...]

Using a custom image

You specify the custom image when you create a Dataproc cluster. A custom image is saved in Cloud Compute Images, and is valid to create a Dataproc cluster for 60 days from its creation date (see How to create a cluster with an expired custom image to use a custom image after its 60-day expiration date).

Custom image URI

You pass the imageUri of the custom image to the cluster create operation. This URI can be specified in one of three ways:

  1. Full URI:
  2. Partial URI: projects/project-id/global/images/custom-image-name
  3. Short name: custom-image-name

Custom images can also be specified by their family URI, which always chooses the most recent image within the image family.

  1. Full URI:
  2. Partial URI: projects/project-id/global/images/family/custom-image-family-name

How to find the custom image URI

gcloud Command

Run the following gcloud command to list the names of your custom images:

gcloud compute images list

Pass the name of your custom image to the following gcloud command to list the URI (selfLink) of your custom image:

gcloud compute images describe custom-image-name
name: custom-image-name


  1. Open the Compute Engine→Images page in the Cloud Console and click the image name. You can insert a query in the filter-images text box to limit the number of displayed images.
  2. The Images details page opens. Click Equivalent REST.
  3. The REST response lists additional information about the image, including the selfLink, which is the image URI.
      "name": "my-custom-image",
      "selfLink": "projects/my-project-id/global/images/my-custom-image",
      "sourceDisk": ...,

Creating a cluster with a custom image

You can create a cluster with master and worker nodes that use a custom image with the gcloud command-line tool, the Dataproc API, or the Google Cloud Console.

gcloud Command

You can create a Dataproc cluster with a custom image using the dataproc clusters create command with the --image flag. Example:
gcloud dataproc clusters create cluster-name \
    --image=custom-image-URI \
    --region=region \
    ... other flags ...


You can create a cluster with a custom image by specifying custom image URI in the InstanceGroupConfig.imageUri field in the masterConfig, workerConfig, and, if applicable, secondaryWorkerConfig object included in a cluster.create API request.

Example: REST request to create a standard Dataproc cluster (one master, two worker nodes) with a custom image.

POST /v1/projects/project-id/regions/global/clusters/
  "clusterName": "custom-name",
  "config": {
    "masterConfig": {
      "imageUri": "projects/project-id/global/images/custom-image-name"
    "workerConfig": {
      "imageUri": "projects/project-id/global/images/custom-image-name"


  1. In the Cloud Console, open the Dataproc Create a cluster page. The Set up cluster panel is selected.
  2. In the Versioning section, click CHANGE. Select the CUSTOM IMAGE tab, choose the custom image to use for your Dataproc cluster, then click SELECT.

When you submit the Create a cluster form, your cluster's VMs will be provisioned with the selected custom image.

Overriding dataproc cluster properties with a custom image

You can use custom images to overwrite any cluster properties set during cluster creation. If a user creates a cluster with your custom image but sets cluster properties different from those you set with your custom image, your custom image cluster property settings will take precedence.

To set cluster properties with your custom image:

  1. In your custom image customization script, create a file in /etc/google-dataproc, then set cluster property values in the file.
    • Sample file contents:

Sample customization script file-creation snippet to override two cluster properties:

     cat <<EOF >/etc/google-dataproc/

How to create a cluster with an expired custom image

By default, custom images expire 60 days from the date of creation of the image. You can create a cluster that uses an expired custom image by completing the following steps.

  1. Attempt to create a Dataproc cluster with an expired custom image or a custom image that will expire within 10 days.

    gcloud dataproc clusters create cluster-name \
        --image=custom-image-name \
        --region=region \
        ... other flags ...

  2. The gcloud tool will issue an error message that includes the cluster dataproc:dataproc.custom.image.expiration.token property name and token value.

    dataproc:dataproc.custom.image.expiration.token=token value
    Copy the "token value" string to the clipboard.

  3. Use the gcloud tool to create the Dataproc cluster again, adding the "token value" copied above as a cluster property.

    gcloud dataproc clusters create cluster-name \
        --image=custom-image-name \
        --properties dataproc:dataproc.custom.image.expiration.token=token value \
        --region=region \
        ... other flags ...

Cluster creation with the custom image should succeed.