Initialization actions

When creating a Cloud Dataproc cluster, you can specify initialization actions in executables or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. You can find frequently used and other sample initialization action scripts at the following locations:

Cluster initialization actions can be specified regardless of how you create a cluster:

Important things to know

There are a few important things to know when creating or using initialization actions:

  • Initialization actions run as the root user. This means you do not need to use sudo.
  • You should use absolute paths in initialization actions.
  • Your initialization actions should use a shebang line to indicate how the script should be interpreted (such as #!/bin/bash or #!/usr/bin/python).

Node selection

If you want to limit initialization actions to master or worker nodes, you can add simple node-selection logic to your executable or script, as shown below.

shell script

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  ... master specific actions ...
  ... worker specific actions ...

Using initialization actions

Gcloud command

When creating a cluster with the gcloud dataproc clusters create command, specify the Cloud Storage location/s (URI/s) of the initialization executable/s or script/s with the --initialization-actions flag. The syntax for using this flag is shown below, which you can view from the command line by running gcloud dataproc clusters create --help.

gcloud dataproc clusters create cluster-name \
  --initialization-actions Cloud Storage URI/s (gs://bucket/...) \
  --initialization-action-timeout timeout-value (default=10m) \
  ... other flags ...

You can use the optional --initialization-action-timeout flag to specify a timeout period for the initialization action. The default timeout value is 10 minutes. If the initialization executable or script has not completed by the end of the timeout period, Cloud Dataproc cancels the initialization action.


Specify the NodeInitializationAction script or executable as part a cluster.create API request.


POST /v1/projects/my-project-id/regions/global/clusters/
  "projectId": "my-project-id",
  "clusterName": "example-cluster",
  "config": {
    "configBucket": "",
    "gceClusterConfig": {
      "subnetworkUri": "default",
      "zoneUri": "us-central1-b"
    "masterConfig": {
      "numInstances": 1,
      "machineTypeUri": "n1-standard-4",
      "diskConfig": {
        "bootDiskSizeGb": 500,
        "numLocalSsds": 0
    "workerConfig": {
      "numInstances": 2,
      "machineTypeUri": "n1-standard-4",
      "diskConfig": {
        "bootDiskSizeGb": 500,
        "numLocalSsds": 0
    "initializationActions": [
        "executableFile": "gs://cloud-example-bucket/"


When creating a cluster with the GCP Console, you can specify one or more initialization actions in the Initialization actions field. To see this field, expand the Preemptible workers, bucket, network, version, initialization, & access options panel.
Enter the Cloud Storage location(s) each initialization action in this form. Click Browse to open the GCP Console Cloud Storage Browser page to select an initialization file. Each initialization file must be entered separately (Press <Enter> to add a new entry).


Staging binaries

A common cluster initialization scenario is the staging of job binaries on a cluster to eliminate the need to stage the binaries each time a job is submitted. For example, assume the following initialization script is stored in gs://my-bucket/, a Cloud Storage bucket location:

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  gsutil cp gs://my-bucket/jobs/sessionalize-logs-1.0.jar home/username

The location of this script can be passed to the gcloud dataproc clusters create command:

gcloud dataproc clusters create my-dataproc-cluster \
    --initialization-actions gs://my-bucket/

Cloud Dataproc will run this script on all nodes, and, as a consequence of the script's node-selection logic, will download the jar to the master node. Submitted jobs can then use the pre-staged jar:

gcloud dataproc jobs submit hadoop \
    --cluster my-dataproc-cluster \
    --jar file:///home/username/sessionalize-logs-1.0.jar

Initialization actions samples

Frequently used and other sample initialization actions scripts are located in gs://dataproc-initialization-actions, a public Cloud Storage bucket, and in a GitHub repository. To contribute a script, review the document, and then file a pull request.

What's Next

See Install and run a Jupyter notebook in a Cloud Dataproc cluster and Install and run a Cloud Datalab notebook on a Cloud Dataproc cluster, tutorials that exemplify the steps to install and run initialization scripts on a new Cloud Dataproc cluster.

Send feedback about...

Google Cloud Dataproc Documentation