Cloud Dataproc clusters can be provisioned with a custom image that includes a user's pre-installed packages. The following steps explain how to create a custom image and install it on a Cloud Dataproc cluster.
Before you begin
Set up your project
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
Select or create a GCP project.
Make sure that billing is enabled for your project.
- Enable the Cloud Dataproc API, Compute Engine API, and Cloud Storage APIs.
- Install and initialize the Cloud SDK.
- If you are using an pre-existing gcloud command-tool installation and haven't updated it recently,
run the following command to update the gcloud tool to include latest beta components:
gcloud components install beta
- Install Python 2.7+
- Download Daisy,
then change its attributes to make it executable by running:
chmod +x daisy
Create a Cloud Storage bucket in your project
- In the GCP Console, go to the Cloud Storage browser.
- Click Create bucket.
- In the Create bucket dialog, specify the following attributes:
- Click Create.
Generating a custom image
You will use generate_custom_image.py, a Python program to create and stream a Compute Engine image file to a Cloud Storage bucket.
How it Works
The generate_custom_image.py Python program uses Daisy, a tool for running multi-step workflows on Compute Engine, to create a custom image. The program launches a temporary Compute Engine VM instance, which it uses to create a Cloud Dataproc custom image that includes your custom packages. The custom image is saved and can be used to create Cloud Dataproc clusters. The temporary VM is deleted after the custom image is created.
Running the code
Fork or clone the files on GitHub at GoogleCloudPlatform/cloud-dataproc/custom-images/.
Then, run the
generate_custom_image.py program to have Cloud Dataproc generate
and save your custom image.
python generate_custom_image.py \ --image-name custom_image_name \ --dataproc-version Cloud Dataproc version (example: "1.2.22") \ --customization-script local path to your custom script \ --daisy-path local path to the downloaded
daisybinary \ --zone Compute Engine zone to use for the location of the temporary VM \ --gcs-bucket URI (gs://bucket-name) of a Cloud Storage bucket in your project \ [--no-smoke-test: "true" or "false" (default: "false"]
--image-name: the output name for your custom image (no spaces; underscores are allowed).
--dataproc-version: the Cloud Dataproc image version to use in your custom image. Specify the version in "x.y.z" format, for example, "1.2.22".
--customization-script: a local path to your script that the tool will run to install your custom packages or perform other customizations. Note that this script is only run on the temporary VM used to create the custom image. You can specify a different initialization script for any other initialization actions you want to perform when you create a cluster with your custom image.
--daisy-path: a local path to the
daisybinary that you downloaded in Before you begin.
--zone: the Compute Engine zone where
generate_custom_image.pywill create a temporary VM to use to create your custom image.
--gcs-bucket: a URI, in the format
gs://bucket-name, which points to the Cloud Storage bucket that you created in Create a Cloud Storage bucket in your project.
generate_custom_image.pywill write log files to this bucket.
--no-smoke-test: default value: "false". This is an optional flag that can be set to "true" to disable smoke testing the newly built custom image. The smoke test creates a Cloud Dataproc test cluster with the newly built image, runs a small job, and then deletes the cluster at the end of the test. The smoke test runs by default to verify that the newly built custom image can create a functional Cloud Dataproc cluster. Disabling this step will speed up the custom image build process, but is not recommended.
Using a custom image
Your custom image is saved by Cloud Dataproc and can be used to create Cloud Dataproc clusters.
Important Note: The custom image is saved in Cloud Compute Images, and is valid to create a Cloud Dataproc cluster for 30 days. You must re-create the custom image to reuse it after the 30-day period.
To create a Cloud Dataproc cluster with
your custom image, you can run the following
gcloud beta dataproc clusters create cluster-name \ --image=custom-image-name other flags