Initialization actions

When creating a Dataproc cluster, you can specify initialization actions in executables or scripts that Dataproc will run on all nodes in your Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.

You can find sample initialization action scripts at the following locations:

Important considerations and guidelines

  • Don't create production clusters that reference initialization actions located in the gs://goog-dataproc-initialization-actions-REGION public buckets. These scripts are provided as reference implementations, and they are synchronized with ongoing GitHub repository changes—a new version of a initialization action in public buckets may break your cluster creation. Instead, copy the initialization action from public buckets into your bucket, as shown in the following example:

    REGION=region
    
    gsutil cp gs://goog-dataproc-initialization-actions-${REGION}/tez/tez.sh gs://my-bucket
    
    Then, create the cluster by referencing the copy:
    gcloud dataproc clusters create cluster-name \
        --region=${REGION} \
        --initialization-actions=gs://my-bucket/tez.sh \
        ... other flags ...
    
    You can decide when to sync your copy of the initialization action with any changes to the initialization action that occur in the public bucket or GitHub repository.

  • Initialization actions are executed on each node during cluster creation. They are also executed on each newly added node when scaling or autoscaling clusters up.

  • Initialization actions run as the root user. This means you do not need to use sudo.

  • You should use absolute paths in initialization actions.

  • Your initialization actions should use a shebang line to indicate how the script should be interpreted (such as #!/bin/bash or #!/usr/bin/python).

  • If an initialization action terminates with a non-zero exit code, the cluster create operation will report an "ERROR" status. To debug the initialization action, SSH into the cluster's VM instances, and then examine the logs. After fixing the initialization action problem, you can delete, then re-create the cluster.

  • If you create a Dataproc cluster with internal IP addresses only, attempts to access github.com over the Internet in an initialization action will fail unless you have configured routes to direct the traffic through Cloud NAT or a Cloud VPN. Without access to the Internet, you can enable Private Google Access and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.

  • You can use Dataproc custom images instead of initialization actions to set up job dependencies.

  • Initialization processing:

    • Pre-2.0 image clusters:
      • Master: Master node initialization actions do not start until HDFS is writeable (until HDFS has exited safemode and at least two HDFS DataNodes have joined). This allows initialization actions run on masters to write files to HDFS.
      • Worker: If the user sets the dataproc:dataproc.worker.custom.init.actions.mode cluster property to RUN_BEFORE_SERVICES, each worker runs its initialization actions before it starts its HDFS datanode and YARN nodemanager daemons. Since Dataproc does not run master initialization actions until HDFS is writeable, which requires 2 HDFS datanode daemons to be running, setting this property may increase cluster creation time.
    • 2.0+ image clusters:

      • Master: Master node initialization actions may run before HDFS is writeable. If you run initialization actions that stage files in HDFS or depend on the availability of HDFS-dependent services, such as Ranger, set the dataproc.master.custom.init.actions.mode cluster property to RUN_AFTER_SERVICES. Note: since this property setting can increase cluster creation time—see the explanation for cluster creation delay for pre-2.0 image clusters workers—use it only when necessary (as a general practice, rely on the default RUN_BEFORE_SERVICES setting for this property).
      • Worker: The dataproc:dataproc.worker.custom.init.actions.mode cluster property is set to RUN_BEFORE_SERVICES and cannot be passed to the cluster when the cluster is created (cannot be changed by the user). Each worker runs its initialization actions before it starts its HDFS datanode and YARN nodemanager daemons. Since Dataproc does not wait for HDFS to be writeable before running master initialization actions, master and worker initialization actions run in parallel.
    • Recommendations:

      • Use metadata to determine a node's role to conditionally execute an initialization action on nodes (see Using cluster metadata).
      • Fork a copy of an initialization action to a Cloud Storage bucket for stability (see How initialization actions are used).
      • Add retries when you download from the Internet to help stabilize the initialization action.

Using initialization actions

Cluster initialization actions can be specified regardless of how you create a cluster:

Gcloud command

When creating a cluster with the gcloud dataproc clusters create command, specify one or more comma separated Cloud Storage locations (URIs) of the initialization executables or scripts with the --initialization-actions flag. Note: Multiple consecutive "/"s in a Cloud Storage location URI after the initial "gs://", such as "gs://bucket/my//object//name", are not supported.

The syntax for using this flag is shown below, which you can view from the command line by running gcloud dataproc clusters create --help.

gcloud dataproc clusters create cluster-name \
    --region=${REGION} \
    --initialization-actions=Cloud Storage URI(s) (gs://bucket/...) \
    --initialization-action-timeout=timeout-value (default=10m) \
    ... other flags ...
Notes:
  • Use the --initialization-action-timeout flag to specify a timeout period for the initialization action. The default timeout value is 10 minutes. If the initialization executable or script has not completed by the end of the timeout period, Dataproc cancels the initialization action.
  • Use the dataproc:dataproc.worker.custom.init.actions.mode cluster property to run the initialization action on primary workers before the node manager and datanode daemons are started.

REST API

Specify one or more scripts or executables in a ClusterConfig.initializationActions array as part a clusters.create API request.

Example

POST /v1/projects/my-project-id/regions/us-central1/clusters/
{
  "projectId": "my-project-id",
  "clusterName": "example-cluster",
  "config": {
    "configBucket": "",
    "gceClusterConfig": {
      "subnetworkUri": "default",
      "zoneUri": "us-central1-b"
    },
    "masterConfig": {
      "numInstances": 1,
      "machineTypeUri": "n1-standard-4",
      "diskConfig": {
        "bootDiskSizeGb": 500,
        "numLocalSsds": 0
      }
    },
    "workerConfig": {
      "numInstances": 2,
      "machineTypeUri": "n1-standard-4",
      "diskConfig": {
        "bootDiskSizeGb": 500,
        "numLocalSsds": 0
      }
    },
    "initializationActions": [
      {
        "executableFile": "gs://cloud-example-bucket/my-init-action.sh"
      }
    ]
  }
}

Console

  • Open the Dataproc Create a cluster page, then select the Customize cluster panel.
  • In the Initialization actions section, enter the Cloud Storage bucket location(s) of each initialization action in the Executable file field(s). Click BROWSE to open the Cloud Console Cloud Storage Browser page to select a script or executable file. Click ADD INITIALIZATION ACTION to add each new file.
  • Passing arguments to initialization actions

    Dataproc sets special metadata values for the instances that run in your clusters. You can set your own custom metadata as a way to pass arguments to initialization actions.

    gcloud dataproc clusters create cluster-name \
        --region=${REGION} \
        --initialization-actions=Cloud Storage URI(s) (gs://bucket/...) \
        --metadata=name1=value1,name2=value2... \
        ... other flags ...
    

    Metadata values can be read within initialization actions as follows:

    var1=$(/usr/share/google/get_metadata_value attributes/name1)
    

    Node selection

    If you want to limit initialization actions to master or worker nodes, you can add simple node-selection logic to your executable or script.

    ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
    if [[ "${ROLE}" == 'Master' ]]; then
      ... master specific actions ...
    else
      ... worker specific actions ...
    fi
    

    Staging binaries

    A common cluster initialization scenario is the staging of job binaries on a cluster to eliminate the need to stage the binaries each time a job is submitted. For example, assume the following initialization script is stored in gs://my-bucket/download-job-jar.sh, a Cloud Storage bucket location:

    #!/bin/bash
    ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
    if [[ "${ROLE}" == 'Master' ]]; then
      gsutil cp gs://my-bucket/jobs/sessionalize-logs-1.0.jar home/username
    fi
    

    The location of this script can be passed to the gcloud dataproc clusters create command:

    gcloud dataproc clusters create my-dataproc-cluster \
        --region=${REGION} \
        --initialization-actions=gs://my-bucket/download-job-jar.sh
    

    Dataproc will run this script on all nodes, and, as a consequence of the script's node-selection logic, will download the jar to the master node. Submitted jobs can then use the pre-staged jar:

    gcloud dataproc jobs submit hadoop \
        --cluster=my-dataproc-cluster \
        --region=${REGION} \
        --jar=file:///home/username/sessionalize-logs-1.0.jar
    

    Initialization actions samples

    Frequently used and other sample initialization actions scripts are located in gs://goog-dataproc-initialization-actions-<REGION>, a regional public Cloud Storage buckets, and in a GitHub repository. To contribute a script, review the CONTRIBUTING.md document, and then file a pull request.

    Logging

    Output from the execution of each initialization action is logged for each instance in /var/log/dataproc-initialization-script-X.log, where X is the zero-based index of each successive initialization action script. For example, if your cluster has two initialization actions, the outputs will be logged in /var/log/dataproc-initialization-script-0.log and /var/log/dataproc-initialization-script-1.log.

    What's Next

    Explore GitHub initialization actions.