Initialization actions

When creating a Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. You can find frequently used and other sample initialization action scripts at the following locations:

Cluster initialization actions can be specified regardless of how you create a cluster:

Important things to know

There are a few important things to know when creating or using initialization actions:

  • Initialization actions run as the root user. This means you do not need to use sudo.
  • You should use absolute paths in initialization actions.
  • Your initialization actions should use a shebang line to indicate how the script should be interpreted (such as #!/bin/bash or #!/usr/bin/python).

Node selection

If you want to limit initialization actions to master or worker nodes, you can add simple node-selection logic to your executable or script, as shown below.

shell script

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  ... master specific actions ...
  ... worker specific actions ...

Using initialization actions

Google Cloud Platform Console

When creating a cluster with the Cloud Platform Console, you can specify one or more initialization actions in the Initialization actions field. To see this field, expand the Preemptible workers, bucket, network, version, initialization, & access options panel.

Enter the Cloud Storage location(s) (URI(s)) of each initialization action in this form. Each initialization action must be entered separately (Press <Enter> to create a new entry).

Command line

When creating a cluster with the gcloud dataproc clusters create command, specify the Cloud Storage location(s) (URI(s)) of the initialization executable(s) and/or script(s) with the --initialization-actions flag. The syntax for using this flag is shown below (obtained by running gcloud dataproc clusters create --help on the command line):

gcloud dataproc clusters create CLUSTER-NAME \
  [--initialization-actions [GCS_URI,...]] \
  [--initialization-action-timeout TIMEOUT; default="10m"] \
  ... other flags ...

Note that you can use the optional --initialization-action-timeout flag to specify a timeout for the initialization action (the default value is 10 minutes). If the initialization executable or script has not completed by the end of the timeout period, Cloud Dataproc will cancel the initialization action.

Samples and Examples

Example—Staging binaries

A common cluster initialization scenario is the staging of job binaries on a cluster to eliminate the need to stage the binaries each time a job is submitted. For example, assume the following initialization script is stored in gs://my-bucket/ (a Cloud Storage bucket location):

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  gsutil cp gs://my-bucket/jobs/sessionalize-logs-1.0.jar home/username

The location of this script can be passed to the gcloud dataproc clusters create command:

gcloud dataproc clusters create my-dataproc-cluster \
    --initialization-actions gs://my-bucket/

Cloud Dataproc will run this script on all nodes, and, as a consequence of the script's node-selection logic, will download the jar to the master node. Submitted jobs can then use the pre-staged jar:

gcloud dataproc jobs submit hadoop \
    --cluster my-dataproc-cluster \
    --jar file:///home/username/sessionalize-logs-1.0.jar

Sample initialization actions repository

Frequently used and other sample initialization actions scripts are provided in a shared Cloud Storage bucket (gs://dataproc-initialization-actions) and our GitHub repository. If you have a script you would like to contribute, please review the document, and file a pull request.

See How to install and run a Jupyter notebook in a Cloud Dataproc cluster for a tutorial that explains the steps to install and run an initialization script on a new Cloud Dataproc cluster.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Google Cloud Dataproc Documentation