Initialization actions

When creating a Cloud Dataproc cluster, you can specify initialization actions in executables or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.

You can find frequently used and other sample initialization action scripts at the following locations:

Important things to know

There are a few important things to know when creating or using initialization actions:

  • Initialization actions run as the root user. This means you do not need to use sudo.
  • You should use absolute paths in initialization actions.
  • Your initialization actions should use a shebang line to indicate how the script should be interpreted (such as #!/bin/bash or #!/usr/bin/python).
  • If an initialization action terminates with a non-zero exit code, the cluster create operation will report an "ERROR" status. To debug the initialization action, SSH into the cluster's VM instances and examine the logs. After fixing the initialization action problem, you can delete, then re-create the cluster.
  • If you create a Cloud Dataproc cluster with internal IP addresses only, attempts to access the Internet in an initialization action will fail unless you have configured routes to direct the traffic through a NAT or a VPN gateway. Without access to the Internet, you can enable Private Google Access, and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.
  • You can use Cloud Dataproc custom images instead of initialization actions to set up job dependencies.

Using initialization actions

Cluster initialization actions can be specified regardless of how you create a cluster:

Gcloud command

When creating a cluster with the gcloud dataproc clusters create command, specify the Cloud Storage location/s (URI/s) of the initialization executable/s or script/s with the --initialization-actions flag. The syntax for using this flag is shown below, which you can view from the command line by running gcloud dataproc clusters create --help.

gcloud dataproc clusters create cluster-name \
    --initialization-actions Cloud Storage URI(s) (gs://bucket/...) \
    --initialization-action-timeout timeout-value (default=10m) \
    ... other flags ...

You can use the optional --initialization-action-timeout flag to specify a timeout period for the initialization action. The default timeout value is 10 minutes. If the initialization executable or script has not completed by the end of the timeout period, Cloud Dataproc cancels the initialization action.

REST API

Specify one or more NodeInitializationAction script(s) or executable(s) in the ClusterConfig.initializationActions array as part a clusters.create API request.

Example

POST /v1/projects/my-project-id/regions/global/clusters/
{
  "projectId": "my-project-id",
  "clusterName": "example-cluster",
  "config": {
    "configBucket": "",
    "gceClusterConfig": {
      "subnetworkUri": "default",
      "zoneUri": "us-central1-b"
    },
    "masterConfig": {
      "numInstances": 1,
      "machineTypeUri": "n1-standard-4",
      "diskConfig": {
        "bootDiskSizeGb": 500,
        "numLocalSsds": 0
      }
    },
    "workerConfig": {
      "numInstances": 2,
      "machineTypeUri": "n1-standard-4",
      "diskConfig": {
        "bootDiskSizeGb": 500,
        "numLocalSsds": 0
      }
    },
    "initializationActions": [
      {
        "executableFile": "gs://cloud-example-bucket/my-init-action.sh"
      }
    ]
  }
}

Console

When creating a cluster with the GCP Console, you can specify one or more initialization actions in the Initialization actions field. To see this field, expand the Advanced options panel.
Enter the Cloud Storage location(s) each initialization action in this form. Click Browse to open the GCP Console Cloud Storage Browser page to select an initialization file. Each initialization file must be entered separately (Press <Enter> to add a new entry).

Passing arguments to initialization actions

Cloud Dataproc sets special metadata values for the instances that run in your clusters. You can set your own custom metadata as a way to pass arguments to initialization actions.

gcloud dataproc clusters create cluster-name \
    --initialization-actions Cloud Storage URI(s) (gs://bucket/...) \
    --metadata name1=value1,name2=value2... \
    ... other flags ...

Metadata values can be read within initialization actions as follows:

var1=$(/usr/share/google/get_metadata_value attributes/name1)

Node selection

If you want to limit initialization actions to master or worker nodes, you can add simple node-selection logic to your executable or script.

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  ... master specific actions ...
else
  ... worker specific actions ...
fi

Staging binaries

A common cluster initialization scenario is the staging of job binaries on a cluster to eliminate the need to stage the binaries each time a job is submitted. For example, assume the following initialization script is stored in gs://my-bucket/download-job-jar.sh, a Cloud Storage bucket location:

#!/bin/bash
ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  gsutil cp gs://my-bucket/jobs/sessionalize-logs-1.0.jar home/username
fi

The location of this script can be passed to the gcloud dataproc clusters create command:

gcloud dataproc clusters create my-dataproc-cluster \
    --initialization-actions gs://my-bucket/download-job-jar.sh

Cloud Dataproc will run this script on all nodes, and, as a consequence of the script's node-selection logic, will download the jar to the master node. Submitted jobs can then use the pre-staged jar:

gcloud dataproc jobs submit hadoop \
    --cluster my-dataproc-cluster \
    --jar file:///home/username/sessionalize-logs-1.0.jar

Initialization actions samples

Frequently used and other sample initialization actions scripts are located in gs://dataproc-initialization-actions, a public Cloud Storage bucket, and in a GitHub repository. To contribute a script, review the CONTRIBUTING.md document, and then file a pull request.

Logging

Output from the execution of each initialization action is logged for each instance in /var/log/dataproc-initialization-script-X.log, where X is the zero-based index of each successive initialization action script. For example, if your cluster has two initialization actions, the outputs will be logged in /var/log/dataproc-initialization-script-0.log and /var/log/dataproc-initialization-script-1.log.

What's Next

Explore these initialization actions:

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation