Cluster properties

The open source components installed on Dataproc clusters contain many configuration files. For example, Apache Spark and Apache Hadoop have several XML and plain text configuration files. You can use the ‑‑properties flag of the gcloud dataproc clusters create command to modify many common configuration files when creating a cluster.

Formatting

The gcloud dataproc clusters create --properties flag accepts the following string format:

file_prefix1:property1=value1,file_prefix2:property2=value2,...
  • The file_prefix maps to a predefined configuration file as shown in the table below, and the property maps to a property within the file.

  • The default delimiter used to separate multiple cluster properties is the comma (,). However, if a comma is included in a property value, your must change the delimiter by specifying a "^delimiter^" at the beginning of the property list (see gcloud topic escaping for more information).

    • Example using a "#" delimiter:
      --properties ^#^file_prefix1:property1=part1,part2#file_prefix2:property2=value2
      

Examples

gcloud command

To change the spark.mastersetting in the spark-defaults.conf file, add the following gcloud dataproc clusters create --properties flag:

--properties 'spark:spark.master=spark://example.com'

You can change several properties at once, in one or more configuration files, by using a comma separator. Each property must be specified in the full file_prefix:property=value format. For example, to change the spark.master setting in the spark-defaults.conf file and the dfs.hosts setting in the hdfs-site.xml file, use the following --propertiesflag when creating a cluster:

--properties 'spark:spark.master=spark://example.com,hdfs:dfs.hosts=/foo/bar/baz'

REST API

To set spark.executor.memory to 10g, insert the following properties setting in the SoftwareConfig section of your clusters.create request:

"properties": {
  "spark:spark.executor.memory": "10g"
}

An easy way to see how to construct the JSON body of a Dataproc API clusters REST request is to initiate the equivalent gcloud command using the --log-http flag. Here is a sample gcloud dataproc clusters create command, which sets cluster properties with the --properties spark:spark.executor.memory=10g flag. The stdout log shows the resulting REST request body (the properties snippet is shown below):

gcloud dataproc clusters create my-cluster \
    --region=region \
    --properties=spark:spark.executor.memory=10g \
    --log-http \
    other args ...

Output:

...
== body start ==
{"clusterName": "my-cluster", "config": {"gceClusterConfig": ...
"masterConfig": {... "softwareConfig": {"properties": {"spark:spark.executor.memory": "10g"}},

... == body end == ...

Make sure to cancel the command after the JSON body appears in the output if you do not want the command to take effect.

Console

To change the spark.master setting in the spark-defaults.conf file:

  1. In the Cloud Console, open the Dataproc Create a cluster page. Click the Customize cluster panel, then scroll to the Cluster properties section.

  2. Click + ADD PROPERTIES. Select spark in the Prefix list, then add "spark.master" in the Key field and the setting in the Value field.

Cluster vs. Job Properties

The Apache Hadoop YARN, HDFS, Spark, and other file-prefixed properties are applied at the cluster level when you create a cluster. Many of these properties can also be applied to specific jobs. When applying a property to a job, the file prefix is not used.

Example:

Set Spark executor memory to 4g for a Spark job (spark: prefix omitted).

gcloud dataproc jobs submit spark \
    --region=region \
    --properties=spark.executor.memory=4g \
    ... other args ...

File-prefixed properties table

File prefix File File purpose
capacity-scheduler capacity-scheduler.xml Hadoop YARN Capacity Scheduler configuration
core core-site.xml Hadoop general configuration
distcp distcp-default.xml Hadoop Distributed Copy configuration
hadoop-env hadoop-env.sh Hadoop specific environment variables
hbase hbase-site.xml HBase configuration
hdfs hdfs-site.xml Hadoop HDFS configuration
hive hive-site.xml Hive configuration
mapred mapred-site.xml Hadoop MapReduce configuration
mapred-env mapred-env.sh Hadoop MapReduce specific environment variables
pig pig.properties Pig configuration
presto config.properties Presto configuration
presto-jvm jvm.config Presto specific JVM configuration
spark spark-defaults.conf Spark configuration
spark-env spark-env.sh Spark specific environment variables
yarn yarn-site.xml Hadoop YARN configuration
yarn-env yarn-env.sh Hadoop YARN specific environment variables
zeppelin zeppelin-site.xml Zeppelin configuration
zeppelin-env zeppelin-env.sh Zeppelin specific environment variables (Optional Component only)
zookeeper zoo.cfg Zookeeper configuration

Notes

  • Some properties are reserved and cannot be overridden because they impact the functionality of the Dataproc cluster. If you try to change a reserved property, you will receive an error message when creating your cluster.
  • You can specify multiple changes by separating each with a comma.
  • The --properties flag cannot modify configuration files not shown above.
  • Changes to properties will be applied before the daemons on your cluster start.
  • If the specified property exists, it will be updated. If the specified property does not exist, it will be added to the configuration file.

Dataproc service properties

The properties listed below are specific to Dataproc. These properties can be used to further configure the functionality of your Dataproc cluster. These cluster properties are specified at cluster creation. They cannot be specified or updated after cluster creation.

Formatting

The gcloud dataproc clusters create --properties flag accepts the following string format:

property_prefix1:property1=value1,property_prefix2:property2=value2,...
  • The default delimiter used to separate multiple cluster properties is the comma (,). However, if a comma is included in a property value, your must change the delimiter by specifying "^delimiter^" at the beginning of the property list (see gcloud topic escaping for more information).

    • Example using a "#" delimiter:
      --properties ^#^property_prefix1:property1=part1,part2#property_prefix2:property2=value2
      

Example:

Create a cluster and set Enhanced Flexibility Mode to Spark primary worker shuffle.

gcloud dataproc jobs submit spark \
    --region=region \
    --properties=dataproc:efm.spark.shuffle=primary-worker \
    ... other args ...

Dataproc service properties table

Property prefix Property Values Description
dataproc am.primary_only true or false Set this property to true to prevent application master from running on Dataproc cluster preemptible workers. Note: This feature is only available with Dataproc 1.2 and higher. The default value is false.
dataproc dataproc.allow.zero.workers true or false Set this SoftwareConfig property to true in a Dataproc clusters.create API request to create a Single node cluster, which changes default number of workers from 2 to 0, and places worker components on the master host. A Single node cluster can also be created from the Cloud Console or with the gcloud command-line tool by setting the number of workers to 0.
dataproc dataproc.alpha.master.nvdimm.size.gb 1500-6500 Setting a value creates a Dataproc master with Intel Optane DC Persistent memory. Note: Optane VMs can only be created in us-central1-f zones, only with n1-highmem-96-aep machine type and only under whitelisted projects.
dataproc: dataproc.alpha.worker.nvdimm.size.gb 1500-6500 Setting a value creates a Dataproc worker with Intel Optane DC Persistent memory. Note: Optane VMs can only be created in us-central1-f zones, only with n1-highmem-96-aep machine type and only under whitelisted projects.
dataproc dataproc.conscrypt.provider.enable true or false Enables (true) or disables (false) Conscrypt as the primary Java security provider. Note: Conscrypt is enabled by default in Dataproc 1.2 and higher, but disabled in 1.0/1.1.
dataproc dataproc.cooperative.multi-tenancy.user.mapping user-to-service account mappings This property takes a list of comma-separated user-to-service account mappings. If a cluster is created with this property set, when a user submits a job, the cluster will attempt to impersonate the corresponding service account when accessing Cloud Storage through the Cloud Storage connector. This feature requires Cloud Storage connector version 2.1.4 or higher (default: empty).
dataproc dataproc.localssd.mount.enable true or false Whether to mount local SSDs as Hadoop/Spark temp directories and HDFS data directories (default: true).
dataproc dataproc.logging.stackdriver.enable true or false Enables (true) or disables (false) Logging (default: true).
dataproc dataproc.logging.stackdriver.job.driver.enable true or false Enables (true) or disables (false) Dataproc job driver logs in Logging (default: false).
dataproc dataproc.logging.stackdriver.job.yarn.container.enable true or false Enables (true) or disables (false) YARN container logs in Logging. (default: false).
dataproc dataproc.monitoring.stackdriver.enable true or false Enables (true) or disables (false) the Monitoring Agent.
dataproc dataproc.scheduler.driver-size-mb number The average driver memory footprint, which determines the maximum number of concurrent jobs a cluster will run. The default value is 1GB. A smaller value, such as 256, may be appropriate for Spark jobs.
dataproc dataproc.worker.custom.init.actions.mode RUN_BEFORE_SERVICES (default: not enabled ). If specified, during cluster creation, when a primary worker vm is first booted, its initialization actions will be run before the node manager and datanode daemons are started. See Intitialization Actions—Important considerations and guidelines.
dataproc efm.mapreduce.shuffle hcfs If set to hcfs, Spark shuffle data is preserved in HDFS. See Dataproc Enhanced Flexibility Mode for more information. Note: Currently, this feature is only available with Dataproc 1.4 and 1.5.
dataproc efm.spark.shuffle primary-worker or hcfs If set to primary-worker, mappers write data to primary workers (available to, and recommended for, Spark jobs). If set to hcfs, Spark shuffle data is preserved in HDFS. See Dataproc Enhanced Flexibility Mode for more information. Note: Currently, this feature is only available with Dataproc 1.4 and 1.5.
dataproc job.history.to-gcs.enabled true or false Allows persisting MapReduce and Spark history files to the Dataproc temp bucket (default: true for image versions 1.5+). Users can overwrite the locations of job history file persistence through the following properties: mapreduce.jobhistory.done-dir, mapreduce.jobhistory.intermediate-done-dir, spark.eventLog.dir, and spark.history.fs.logDirectory
dataproc jobs.file-backed-output.enable true or false Configures Dataproc jobs to pipe their output to temporary files in the /var/log/google-dataproc-job directory. Must be set to true to enable job driver logging in Logging (default: true).
dataproc jupyter.listen.all.interfaces true or false To reduce the risk of remote code execution over unsecured notebook server APIs, the default setting for image versions 1.3+ is false, which restricts connections to localhost (127.0.0.1) when Component Gateway is enabled. This default setting can be overridden by setting this property to true to allow all connections.
dataproc jupyter.notebook.gcs.dir gs://<dir-path> Location in Cloud Storage to save Jupyter notebooks.
dataproc kerberos.beta.automatic-config.enable true or false When set to true, users do not need to specify the Kerberos root principal password with the --kerberos-root-principal-password and --kerberos-kms-key-uri flags (default: false). See Enabling Hadoop Secure Mode via Kerberos for more information.
dataproc kerberos.cross-realm-trust.admin-server hostname/address hostname/address of remote admin server (often the same as the KDC server).
dataproc kerberos.cross-realm-trust.kdc hostname/address hostname/address of remote KDC.
dataproc kerberos.cross-realm-trust.realm realm name Realm names can consist of any UPPERCASE ASCII string. Usually, the realm name is the same as your DNS domain name (in UPPERCASE). Example: If machines are named "machine-id.example.west-coast.mycompany.com", the associated realm may be designated as "EXAMPLE.WEST-COAST.MYCOMPANY.COM".
dataproc kerberos.cross-realm-trust.shared-password.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted shared password.
dataproc kerberos.kdc.db.key.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted file containing the KDC database master key.
dataproc kerberos.key.password.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted file that contains the password of the key in the keystore file.
dataproc kerberos.keystore.password.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted file containing the keystore password.
dataproc kerberos.keystore.uri1 gs://<dir-path> Location in Cloud Storage of the keystore file containing the wildcard certificate and the private key used by cluster nodes.
dataproc kerberos.kms.key.uri KMS key URI The URI of the KMS key used to decrypt root password, for example projects/project-id/locations/region/keyRings/key-ring/cryptoKeys/key (see Key resource ID).
dataproc kerberos.root.principal.password.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted password for Kerberos root principal.
dataproc kerberos.tgt.lifetime.hours hours Max life time of the ticket granting ticket.
dataproc kerberos.truststore.password.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted file that contains the password to the truststore file.
dataproc kerberos.truststore.uri2 gs://<dir-path> Location in Cloud Storage of the KMS-encrypted trust store file containing trusted certificates.
dataproc ranger.kms.key.uri KMS key URI The URI of the KMS key used to decrypt Ranger admin user password, for example projects/project-id/locations/region/keyRings/key-ring/cryptoKeys/key (see Key resource ID).
dataproc ranger.admin.password.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted password for Ranger admin user.
dataproc ranger.db.admin.password.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted password for Ranger database admin user.
dataproc ranger.cloud-sql.instance.connection.name cloud sql instance connection name The connection name of the Cloud SQL instance, for example project-id:region:name.
dataproc ranger.cloud-sql.root.password.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted password for the root user of the Cloud SQL instance.
dataproc ranger.cloud-sql.use-private-ip true or false Whether the communication between cluster instances and the Cloud SQL instance should be over private IP (default value is false)
dataproc solr.gcs.path gs://<dir-path> Cloud Storage path to act as Solr home directory.
dataproc startup.component.service-binding-timeout.hadoop-hdfs-namenode seconds The amount of time the Dataproc startup script will wait for the hadoop-hdfs-namenode to bind to ports before deciding that its startup has succeeded. The maximum recognized value is 1800 seconds (30 minutes).
dataproc startup.component.service-binding-timeout.hive-metastore seconds The amount of time the Dataproc startup script will wait for the hive-metastore service to bind to ports before deciding that its startup has succeeded. The maximum recognized value is 1800 seconds (30 minutes).
dataproc startup.component.service-binding-timeout.hive-server2 seconds The amount of time the Dataproc startup script will wait for the hive-server2 to bind to ports before deciding that its startup has succeeded. The maximum recognized value is 1800 seconds (30 minutes).
dataproc yarn.log-aggregation.enabled true or false Allows (true) turning on YARN log aggregation to a Dataproc temporary bucket. The bucket name is of the following form: dataproc-temp-<REGION>-<PROJECT_NUMBER>-<RANDOM_STRING>. (default: true for image versions 1.5+). Users can also set the location of aggregated YARN logs by overwriting the yarn.nodemanager.remote-app-log-dir YARN property.
knox gateway.host ip address To reduce the risk of remote code execution over unsecured notebook server APIs, the default setting for image versions 1.3+ is 127.0.0.1, which restricts connections to localhost when Component Gateway is enabled. The default setting can be overridden, for example by setting this property to 0.0.0.0 to allow all connections.
zeppelin zeppelin.notebook.gcs.dir gs://<dir-path> Location in Cloud Storage to save Zeppelin notebooks.
zeppelin zeppelin.server.addr ip address To reduce the risk of remote code execution over unsecured notebook server APIs, the default setting for image versions 1.3+ is 127.0.0.1, which restricts connections to localhost when Component Gateway is enabled. This default setting can be overridden, for example by setting this property to 0.0.0.0 to allow all connections.

1Keystore file: The keystore file contains the SSL certificate. It should be in Java KeyStore (JKS) format. When copied to VMs, it is renamed to keystore.jks. The SSL certificate should be a wildcard certificate that applies to each node in the cluster.

2Truststore file: The truststore file should be in Java KeyStore (JKS) format. When copied to VMs, it is renamed to truststore.jks.