When creating a Dataproc cluster, you can specify initialization actions in executables or scripts that Dataproc will run on all nodes in your Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.
You can find frequently used and other sample initialization action scripts at the following locations:
- GitHub repository
- Cloud Storage—in the regional
gs://goog-dataproc-initialization-actions-<REGION>
buckets
Important considerations and guidelines
- Don't create production clusters that reference initialization actions located
in the
gs://goog-dataproc-initialization-actions-<REGION>>
public buckets. These scripts are provided as reference implementations, and they are synchronized with ongoing GitHub repository changes—a new version of a initialization action in public buckets may break your cluster creation. Instead, copy the initialization action from public buckets into your bucket, as shown in the following example:REGION=region
gsutil cp gs://goog-dataproc-initialization-actions-${REGION}/tez/tez.sh gs://my-bucket/
Then, create the cluster by referencing the copy:gcloud dataproc clusters create cluster-name \ --region=${REGION} \ --initialization-actions=gs://my-bucket/tez.sh \ ... other flags ...
You can decide when to sync your copy of the initialization action with any changes to the initialization action that occur in the public bucket or GitHub repository. - Initialization actions are executed on each node during cluster creation. They are also executed on each newly added node when scaling or autoscaling clusters up.
- Initialization actions run as the
root
user. This means you **do not** need to usesudo
. - You should use absolute paths in initialization actions.
- Your initialization actions should use a
shebang line to indicate how the script
should be interpreted (such as
#!/bin/bash
or#!/usr/bin/python
). - If an initialization action terminates with a non-zero exit code, the cluster create operation will report an "ERROR" status. To debug the initialization action, SSH into the cluster's VM instances and examine the logs. After fixing the initialization action problem, you can delete, then re-create the cluster.
- If you create a Dataproc cluster with internal IP addresses only, attempts to access github.com over the Internet in an initialization action will fail unless you have configured routes to direct the traffic through Cloud NAT or a Cloud VPN. Without access to the Internet, you can enable Private Google Access, and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.
- You can use Dataproc custom images instead of initialization actions to set up job dependencies.
- Initialization processing:
- Master node initialization actions are not started until HDFS is writeable (until HDFS has exited safemode and at least two HDFS DataNodes have joined). This allows initialization actions run on masters to write files to HDFS.
- If the
dataproc:dataproc.worker.custom.init.actions.mode=RUN_BEFORE_SERVICES
cluster property is set, each worker will run its initialization actions before it starts its HDFS DataNode daemon. Note that master initialization actions will not be started until at least two workers have completed their initialization actions, with a probable increase in cluster creation time. - On each cluster node, multiple initialization actions run in the order specified in the cluster create command. However, initialization actions on separate nodes are processed independently: worker initialization actions may run concurrent with, before, or after master initialization actions.
- Optional Components selected by a user when a cluster is created are installed and activated on a cluster before initialization actions are run on the cluster.
- Recommendations:
- Use metadata to determine a node's role to conditionally execute an initialization action on nodes (see Using cluster metadata).
- Fork a copy of an initialization action to a Cloud Storage bucket for stability (see How initialization actions are used).
- Add retries when you download from the Internet to help stabilize the initialization action.
Using initialization actions
Cluster initialization actions can be specified regardless of how you create a cluster:
- Through the Google Cloud Console
- On the command line with the gcloud command-line tool
- Programmatically with the Dataproc clusters.create API (see NodeInitializationAction)
Gcloud command
When creating a cluster with the
gcloud dataproc clusters create
command, specify one or more comma separated Cloud Storage locations (URIs)
of the initialization executables or scripts with the
--initialization-actions
flag. Note: Multiple consecutive
"/"s in a Cloud Storage location URI after the initial "gs://", such as
"gs://bucket/my//object//name", are not supported.
The syntax for using this flag is shown below, which you can view from the
command line by running gcloud dataproc clusters create --help
.
gcloud dataproc clusters create cluster-name \ --region=${REGION} \ --initialization-actions=Cloud Storage URI(s) (gs://bucket/...) \ --initialization-action-timeout=timeout-value (default=10m) \ ... other flags ...Notes:
- Use the
--initialization-action-timeout
flag to specify a timeout period for the initialization action. The default timeout value is 10 minutes. If the initialization executable or script has not completed by the end of the timeout period, Dataproc cancels the initialization action. -
Use the
dataproc:dataproc.worker.custom.init.actions.mode
cluster property to run the initialization action on primary workers before the node manager and datanode daemons are started.
REST API
Specify one or more scripts or executables in a ClusterConfig.initializationActions array as part a clusters.create API request.
Example
POST /v1/projects/my-project-id/regions/us-central1/clusters/ { "projectId": "my-project-id", "clusterName": "example-cluster", "config": { "configBucket": "", "gceClusterConfig": { "subnetworkUri": "default", "zoneUri": "us-central1-b" }, "masterConfig": { "numInstances": 1, "machineTypeUri": "n1-standard-4", "diskConfig": { "bootDiskSizeGb": 500, "numLocalSsds": 0 } }, "workerConfig": { "numInstances": 2, "machineTypeUri": "n1-standard-4", "diskConfig": { "bootDiskSizeGb": 500, "numLocalSsds": 0 } }, "initializationActions": [ { "executableFile": "gs://cloud-example-bucket/my-init-action.sh" } ] } }
Console
Passing arguments to initialization actions
Dataproc sets special metadata values for the instances that run in your clusters. You can set your own custom metadata as a way to pass arguments to initialization actions.
gcloud dataproc clusters create cluster-name \ --region=${REGION} \ --initialization-actions=Cloud Storage URI(s) (gs://bucket/...) \ --metadata=name1=value1,name2=value2... \ ... other flags ...
Metadata values can be read within initialization actions as follows:
var1=$(/usr/share/google/get_metadata_value attributes/name1)
Node selection
If you want to limit initialization actions to master or worker nodes, you can add simple node-selection logic to your executable or script.
ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role) if [[ "${ROLE}" == 'Master' ]]; then ... master specific actions ... else ... worker specific actions ... fi
Staging binaries
A common cluster initialization scenario is the staging of job binaries on a
cluster to eliminate the need to stage the binaries each time a job is
submitted. For example, assume the following initialization script is stored in
gs://my-bucket/download-job-jar.sh
, a Cloud Storage bucket
location:
#!/bin/bash ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role) if [[ "${ROLE}" == 'Master' ]]; then gsutil cp gs://my-bucket/jobs/sessionalize-logs-1.0.jar home/username fi
The location of this script can be passed to the
gcloud dataproc clusters create
command:
gcloud dataproc clusters create my-dataproc-cluster \ --region=${REGION} \ --initialization-actions=gs://my-bucket/download-job-jar.sh
Dataproc will run this script on all nodes, and, as a consequence of the script's node-selection logic, will download the jar to the master node. Submitted jobs can then use the pre-staged jar:
gcloud dataproc jobs submit hadoop \ --cluster=my-dataproc-cluster \ --region=${REGION} \ --jar=file:///home/username/sessionalize-logs-1.0.jar
Initialization actions samples
Frequently used and other sample initialization actions scripts are located in
gs://goog-dataproc-initialization-actions-<REGION>
, a regional public Cloud
Storage buckets, and in a
GitHub repository.
To contribute a script, review the
CONTRIBUTING.md
document, and then file a pull request.
Logging
Output from the execution of each initialization action is logged for each
instance in /var/log/dataproc-initialization-script-X.log
, where X
is the
zero-based index of each successive initialization action script. For example, if your
cluster has two initialization actions, the outputs will be logged
in /var/log/dataproc-initialization-script-0.log
and
/var/log/dataproc-initialization-script-1.log
.
What's Next
Explore GitHiub initialization actions.