Google Cloud Platform

Using instance metadata in Cloud Dataproc initialization actions

Instance metadata is a powerful feature in Google Cloud Platform’s Compute Engine. Each Compute Engine instance comes with several metadata values that are set by default to provide useful information like the machine type, SSH keys, service accounts, tags, zones, and more.

Cloud Dataproc also sets special metadata values for the instances that run in its clusters, for example:

Metadata keyDescription
dataproc-bucketName of the cluster’s staging bucket
dataproc-regionRegion of the cluster’s endpoint
dataproc-worker-countNumber of worker nodes in the cluster. The value is 0 for single node clusters.
dataproc-cluster-nameName of the cluster
dataproc-cluster-uuidUUID of the cluster
dataproc-roleInstance’s role, either Master or Worker
dataproc-masterHostname of the first master node. The value is either [CLUSTER_NAME]-m in a standard or single node cluster, or [CLUSTER_NAME]-m-0 in a high-availability cluster, where [CLUSTER_NAME] is the name of your cluster.
dataproc-master-additionalComma-separated list of hostnames for the additional master nodes in a high-availability cluster, for example [CLUSTER_NAME]-m-1,[CLUSTER_NAME]-m-2 in a cluster that has 3 master nodes.

These values can be very useful, particularly in initialization actions, the scripts that Cloud Dataproc will run on all instances immediately after you spin up a cluster. One common use case is to check the dataproc-role metadata value to execute different commands on master and worker instances, for example:

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  ... master specific actions ...
else
  ... worker specific actions ...
fi

Another common use case is to install a daemon that does not support high-availability mode or that cannot run on multiple instances. For that use case, it is common practice to run that single daemon on the first master node:

FIRST_MASTER_HOSTNAME="$(/usr/share/google/get_metadata_value attributes/dataproc-master)"
if [[ "${HOSTNAME}" == "${FIRST_MASTER_HOSTNAME}" ]]; then
    ... first master's specific actions ...
fi

You can also use your own custom metadata as a way of passing parameters to your initialization actions. This can be done using the --metadata parameter to the gcloud dataproc clusters create command:

gcloud dataproc clusters create [CLUSTER_NAME] \
--metadata KEY=VALUE,[KEY=VALUE,…]

For example, you could pass these parameters:

gcloud dataproc clusters create [CLUSTER_NAME] \
--metadata mysql-client-port=3306,mysql-client-user=myuser \
--initialization-actions=gs://[YOUR_BUCKET]/my-init-action.sh

The above command sets the metadata keys mysql-client-port and mysql-client-user on all instances when the cluster is created. Your my-init-action.sh script can then retrieve the metadata values as follows:

MYSQL_CLIENT_PORT=$(/usr/share/google/get_metadata_value attributes/mysql-client-user || echo -n '3306') 

MYSQL_CLIENT_USER=$(/usr/share/google/get_metadata_value attributes/mysql-client-user || echo -n 'root')

cat << EOF > /etc/mysql/conf.d/myconfig.cnf
[client]
protocol = tcp
port = $MYSQL_CLIENT_PORT
user = $MYSQL_CLIENT_USER
EOF

For more examples, check out how the various initialization actions in the official repository make use of metadata values.

We hope this inspires you to find new ways of customizing your own initialization actions for Cloud Dataproc!