Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Using instance metadata in Cloud Dataproc initialization actions

Friday, July 13, 2018

By Julien Phalip, Solutions Architect

Instance metadata is a powerful feature in Google Cloud Platform’s Compute Engine. Each Compute Engine instance comes with several metadata values that are set by default to provide useful information like the machine type, SSH keys, service accounts, tags, zones, and more.

Cloud Dataproc also sets special metadata values for the instances that run in its clusters, for example:

Metadata key Description
dataproc-bucket Name of the cluster’s staging bucket
dataproc-region Region of the cluster’s endpoint
dataproc-worker-count Number of worker nodes in the cluster. The value is 0 for single node clusters.
dataproc-cluster-name Name of the cluster
dataproc-cluster-uuid UUID of the cluster
dataproc-role Instance’s role, either Master or Worker
dataproc-master Hostname of the first master node. The value is either [CLUSTER_NAME]-m in a standard or single node cluster, or [CLUSTER_NAME]-m-0 in a high-availability cluster, where [CLUSTER_NAME] is the name of your cluster.
dataproc-master-additional Comma-separated list of hostnames for the additional master nodes in a high-availability cluster, for example [CLUSTER_NAME]-m-1,[CLUSTER_NAME]-m-2 in a cluster that has 3 master nodes.

These values can be very useful, particularly in initialization actions, the scripts that Cloud Dataproc will run on all instances immediately after you spin up a cluster. One common use case is to check the dataproc-role metadata value to execute different commands on master and worker instances, for example:

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  ... master specific actions ...
  ... worker specific actions ...

Another common use case is to install a daemon that does not support high-availability mode or that cannot run on multiple instances. For that use case, it is common practice to run that single daemon on the first master node:

FIRST_MASTER_HOSTNAME="$(/usr/share/google/get_metadata_value attributes/dataproc-master)"
if [[ "${HOSTNAME}" == "${FIRST_MASTER_HOSTNAME}" ]]; then
    ... first master's specific actions ...

You can also use your own custom metadata as a way of passing parameters to your initialization actions. This can be done using the --metadata parameter to the gcloud dataproc clusters create command:

gcloud dataproc clusters create [CLUSTER_NAME] \
--metadata KEY=VALUE,[KEY=VALUE,…]

For example, you could pass these parameters:

gcloud dataproc clusters create [CLUSTER_NAME] \
--metadata mysql-client-port=3306,mysql-client-user=myuser \

The above command sets the metadata keys mysql-client-port and mysql-client-user on all instances when the cluster is created. Your script can then retrieve the metadata values as follows:

MYSQL_CLIENT_PORT=$(/usr/share/google/get_metadata_value attributes/mysql-client-user || echo -n '3306') 

MYSQL_CLIENT_USER=$(/usr/share/google/get_metadata_value attributes/mysql-client-user || echo -n 'root')

cat << EOF > /etc/mysql/conf.d/myconfig.cnf
protocol = tcp

For more examples, check out how the various initialization actions in the official repository make use of metadata values.

We hope this inspires you to find new ways of customizing your own initialization actions for Cloud Dataproc!

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.