Create a Cloud Dataproc Custom Image

Cloud Dataproc clusters can be provisioned with a custom image that includes a user's pre-installed packages. The following steps explain how to a create a custom image and install it on a Cloud Dataproc cluster.

Before you begin

Set up your project

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform project.

    Go to the Manage resources page

  3. Enable billing for your project.

    Enable billing

  4. Enable the Cloud Dataproc API, Compute Engine API, and Cloud Storage APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.
  6. If you are using an pre-existing gcloud command-tool installation and haven't updated it recently, run the following command to update the gcloud tool to include latest beta components:
    gcloud components install beta
  7. Install Python 2.7+
  8. Download Daisy, then change its attributes to make it executable by running:
    chmod +x daisy

Create a Cloud Storage bucket in your project

  1. In the Cloud Platform Console, go to the Cloud Storage browser.

    Go to the Cloud Storage browser

  2. Click Create bucket.
  3. In the Create bucket dialog, specify the following attributes:
  4. Click Create.

Generating a custom image

You will use, a Python program to create and stream a Compute Engine image file to a Cloud Storage bucket.

How it Works

The Python program uses Daisy, a tool for running multi-step workflows on Compute Engine, to create a custom image. The program launches a temporary Compute Engine VM instance, which it uses to create a Cloud Dataproc custom image that includes your custom packages. The custom image is saved and can be used to create Cloud Dataproc clusters. The temporary VM is deleted after the custom image is created.


Running the code

Fork or clone the files on GitHub at GoogleCloudPlatform/cloud-dataproc/custom-images/. Then, run the program to have Cloud Dataproc generate and save your custom image.

python \
    --image-name custom_image_name \
    --dataproc-version Cloud Dataproc version (example: "1.2.22") \
    --customization-script local path to your custom script \
    --daisy-path local path to the downloaded daisy binary \
    --zone Compute Engine zone to use for the location of the temporary VM \
    --gcs-bucket URI (gs://bucket-name) of a Cloud Storage bucket in your project \
    [--no-smoke-test: "true" or "false" (default: "false"]

Required Flags

  • --image-name: the output name for your custom image (no spaces; underscores are allowed).
  • --dataproc-version: the Cloud Dataproc image version to use in your custom image. Specify the version in "x.y.z" format, for example, "1.2.22".
  • --customization-script: a local path to your script that the tool will run to install your custom packages or perform other customizations. Note that this script is only run on the temporary VM used to create the custom image. You can specify a different initialization script for any other initialization actions you want to perform when you create a cluster with your custom image.
  • --daisy-path: a local path to the daisy binary that you downloaded in Before you begin.
  • --zone: the Compute Engine zone where will create a temporary VM to use to create your custom image.
  • --gcs-bucket: a URI, in the format gs://bucket-name, which points to the Cloud Storage bucket that you created in Create a Cloud Storage bucket in your project. will write log files to this bucket.

Optional Flags

  • --no-smoke-test: default value: "false". This is an optional flag that can be set to "true" to disable smoke testing the newly built custom image. The smoke test creates a Cloud Dataproc test cluster with the newly built image, runs a small job, and then deletes the cluster at the end of the test. The smoke test runs by default to verify that the newly built custom image can create a functional Cloud Dataproc cluster. Disabling this step will speed up the custom image build process, but is not recommended.

Using a custom image

Your custom image is saved by Cloud Dataproc and can be used to create Cloud Dataproc clusters.

Important Note: The custom image is saved in Cloud Compute Images, and is valid to create a Cloud Dataproc cluster for 30 days. You must re-create the custom image to reuse it after the 30-day period.

To create a Cloud Dataproc cluster with your custom image, you can run the following gcloud command.

gcloud beta dataproc clusters create cluster-name \
  other flags

Send feedback about...

Google Cloud Dataproc Documentation