Initialization actions

When creating a Cloud Dataproc cluster, you can specify initialization actions in executables or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.

You can find frequently used and other sample initialization action scripts at the following locations:

Important considerations and guidelines

  • Don't create clusters that reference initialization actions located in the gs://dataproc-initialization-actions public bucket. These scripts are provided as reference implementations, and they are synchronized with ongoing GitHub repository changes—a new version of a public bucket initialization action may break your cluster creation. Instead, copy the public bucket initialization action into your bucket, as shown in the following example:
    gsutil cp gs://dataproc-initialization-actions/presto/ gs://my-bucket/
    Then, create the cluster by referencing the copy:
    gcloud dataproc clusters create cluster-name \
        --initialization-actions gs://my-bucket/ \
        ... other flags ...
    You can decide when to sync your copy of the initialization action with any changes to the initialization action that occur in the public bucket or GitHub repository.
  • Initialization actions are executed on each node during cluster creation. They are also executed on each newly added node when scaling or autoscaling clusters up.
  • Initialization actions run as the root user. This means you do not need to use sudo.
  • You should use absolute paths in initialization actions.
  • Your initialization actions should use a shebang line to indicate how the script should be interpreted (such as #!/bin/bash or #!/usr/bin/python).
  • If an initialization action terminates with a non-zero exit code, the cluster create operation will report an "ERROR" status. To debug the initialization action, SSH into the cluster's VM instances and examine the logs. After fixing the initialization action problem, you can delete, then re-create the cluster.
  • If you create a Dataproc cluster with internal IP addresses only, attempts to access the Internet in an initialization action will fail unless you have configured routes to direct the traffic through a NAT or a VPN gateway. Without access to the Internet, you can enable Private Google Access, and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.
  • You can use Dataproc custom images instead of initialization actions to set up job dependencies.

Using initialization actions

Cluster initialization actions can be specified regardless of how you create a cluster:

Gcloud command

When creating a cluster with the gcloud dataproc clusters create command, specify one or more comma separated Cloud Storage locations (URIs) of the initialization executables or scripts with the --initialization-actions flag. Note: Multiple consecutive "/"s in a Cloud Storage location URI after the initial "gs://", such as "gs://bucket/my//object//name", are not supported.

The syntax for using this flag is shown below, which you can view from the command line by running gcloud dataproc clusters create --help.

gcloud dataproc clusters create cluster-name \
    --initialization-actions Cloud Storage URI(s) (gs://bucket/...) \
    --initialization-action-timeout timeout-value (default=10m) \
    ... other flags ...

You can use the optional --initialization-action-timeout flag to specify a timeout period for the initialization action. The default timeout value is 10 minutes. If the initialization executable or script has not completed by the end of the timeout period, Cloud Dataproc cancels the initialization action.


Specify one or more NodeInitializationAction script(s) or executable(s) in the ClusterConfig.initializationActions array as part a clusters.create API request.


POST /v1/projects/my-project-id/regions/us-central1/clusters/
  "projectId": "my-project-id",
  "clusterName": "example-cluster",
  "config": {
    "configBucket": "",
    "gceClusterConfig": {
      "subnetworkUri": "default",
      "zoneUri": "us-central1-b"
    "masterConfig": {
      "numInstances": 1,
      "machineTypeUri": "n1-standard-4",
      "diskConfig": {
        "bootDiskSizeGb": 500,
        "numLocalSsds": 0
    "workerConfig": {
      "numInstances": 2,
      "machineTypeUri": "n1-standard-4",
      "diskConfig": {
        "bootDiskSizeGb": 500,
        "numLocalSsds": 0
    "initializationActions": [
        "executableFile": "gs://cloud-example-bucket/"


When creating a cluster with the Cloud Console, you can specify one or more initialization actions in the Initialization actions field. To see this field, expand the Advanced options panel.
Enter the Cloud Storage location(s) each initialization action in this form. Click Browse to open the Cloud Console Cloud Storage Browser page to select an initialization file. Each initialization file must be entered separately (Press <Enter> to add a new entry).

Passing arguments to initialization actions

Dataproc sets special metadata values for the instances that run in your clusters. You can set your own custom metadata as a way to pass arguments to initialization actions.

gcloud dataproc clusters create cluster-name \
    --initialization-actions Cloud Storage URI(s) (gs://bucket/...) \
    --metadata name1=value1,name2=value2... \
    ... other flags ...

Metadata values can be read within initialization actions as follows:

var1=$(/usr/share/google/get_metadata_value attributes/name1)

Node selection

If you want to limit initialization actions to master or worker nodes, you can add simple node-selection logic to your executable or script.

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  ... master specific actions ...
  ... worker specific actions ...

Staging binaries

A common cluster initialization scenario is the staging of job binaries on a cluster to eliminate the need to stage the binaries each time a job is submitted. For example, assume the following initialization script is stored in gs://my-bucket/, a Cloud Storage bucket location:

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  gsutil cp gs://my-bucket/jobs/sessionalize-logs-1.0.jar home/username

The location of this script can be passed to the gcloud dataproc clusters create command:

gcloud dataproc clusters create my-dataproc-cluster \
    --initialization-actions gs://my-bucket/

Cloud Dataproc will run this script on all nodes, and, as a consequence of the script's node-selection logic, will download the jar to the master node. Submitted jobs can then use the pre-staged jar:

gcloud dataproc jobs submit hadoop \
    --cluster my-dataproc-cluster \
    --jar file:///home/username/sessionalize-logs-1.0.jar

Initialization actions samples

Frequently used and other sample initialization actions scripts are located in gs://dataproc-initialization-actions, a public Cloud Storage bucket, and in a GitHub repository. To contribute a script, review the document, and then file a pull request.


Output from the execution of each initialization action is logged for each instance in /var/log/dataproc-initialization-script-X.log, where X is the zero-based index of each successive initialization action script. For example, if your cluster has two initialization actions, the outputs will be logged in /var/log/dataproc-initialization-script-0.log and /var/log/dataproc-initialization-script-1.log.

What's Next

Explore these initialization actions:

هل كانت هذه الصفحة مفيدة؟ يرجى تقييم أدائنا:

إرسال تعليقات حول...

Cloud Dataproc Documentation
هل تحتاج إلى مساعدة؟ انتقل إلى صفحة الدعم.