Cluster properties

The open source components installed on Cloud Dataproc clusters contain many configuration files. For example, Apache Spark and Apache Hadoop have several XML and plain text configuration files. From time to time, you may need to update or add to these configuration files. You can use the ‑‑properties flag of the gcloud dataproc clusters create command in the Cloud SDK to modify many common configuration files when creating a cluster.

How the properties flag works

To make updating files and properties easy, the gcloud dataproc clusters create --properties flag uses a special format to specify the configuration file, and the property and value within the file, that should be updated.

Formatting

The --properties flag requires a string of text in the following format:

file_prefix1:property1=value1,file_prefix2:property2=value2,...

Notes:

  • The --properties flag is ued to modify a specific set of commonly used configuration files. The file_prefix maps to a predefined set of configuration files.

  • The default delimiter used to separate multiple properties is the comma (,). However, if a comma is included in a property value, you must change the delimiter by specifying "^DELIMITER^" at the beginning of the property list.

    Example to use a "#" delimiter:

    --properties ^#^file_prefix1:property1=part1,part2#file_prefix2:property2=value2
    
    See gcloud topic escaping for more information.

file_prefix File Purpose of file
capacity-scheduler capacity-scheduler.xml Hadoop YARN Capacity Scheduler configuration
core core-site.xml Hadoop general configuration
distcp distcp-default.xml Hadoop Distributed Copy configuration
hadoop-env hadoop-env.sh Hadoop specific environment variables
hdfs hdfs-site.xml Hadoop HDFS configuration
hive hive-site.xml Hive configuration
mapred mapred-site.xml Hadoop MapReduce configuration
mapred-env mapred-env.sh Hadoop MapReduce specific environment variables
pig pig.properties Pig configuration
presto config.properties Presto configuration
presto-jvm jvm.config Presto specific JVM configuration
spark spark-defaults.conf Spark configuration
spark-env spark-env.sh Spark specific environment variables
yarn yarn-site.xml Hadoop YARN configuration
yarn-env yarn-env.sh Hadoop YARN specific environment variables
zeppelin zeppelin-site.xml Zeppelin configuration
zeppelin-env zeppelin-env.sh Zeppelin specific environment variables (Optional Component only)
zookeeper zoo.cfg Zookeeper configuration

Important notes

  • Some properties are reserved and cannot be overridden because they impact the functionality of the Cloud Dataproc cluster. If you try to change a reserved property, you will receive an error message when creating your cluster.
  • You can specify multiple changes by separating each with a comma.
  • The --properties flag cannot modify configuration files not shown above.
  • Changing properties when creating clusters in the Google Cloud Platform Console is currently not supported.
  • Changes to properties will be applied before the daemons on your cluster start.
  • If the specified property exists, it will be updated. If the specified property does not exist, it will be added to the configuration file.

Cloud Dataproc service properties

These are additional properties specific to Cloud Dataproc that are not included in the files listed above. These properties can be used to further configure the functionality of your Cloud Dataproc cluster. Note: The following cluster properties are specified at cluster creation. They cannot be specified or updated after cluster creation.

Property Values Function
dataproc:am.primary_only true or false Set this property to true to prevent application master from running on Cloud Dataproc cluster preemptible workers. Note: This feature is only available with Cloud Dataproc 1.2 and higher. The default value is false.
dataproc:dataproc.allow.zero.workers true or false Set this SoftwareConfig property to true in a Cloud Dataproc clusters.create API request to create a Single node cluster, which changes default number of workers from 2 to 0, and places worker components on the master host. A Single node cluster can also be created from the GCP Console or with the gcloud command-line tool by setting the number of workers to 0.
dataproc:dataproc.conscrypt.provider.enable true or false Enables (true) or disables (false) Conscrypt as the primary Java security provider. Note: Conscrypt is enabled by default in Dataproc 1.2 and higher, but disabled in 1.0/1.1.
dataproc:dataproc.localssd.mount.enable true or false Whether to mount local SSDs as Hadoop/Spark temp directories and HDFS data directories (default: true).
dataproc:dataproc.logging.stackdriver.enable true or false Enables (true) or disables (false) logging to Stackdriver (default: true).
dataproc:dataproc.logging.stackdriver.job.driver.enable true or false Enables (true) or disables (false) Cloud Dataproc job driver logs in Logging (default: false).
dataproc:dataproc.logging.stackdriver.job.yarn.container.enable true or false Enables (true) or disables (false) YARN container logs in Logging. (default: false).
dataproc:dataproc.monitoring.stackdriver.enable true or false Enables (true) or disables (false) the Stackdriver Monitoring Agent.
dataproc:jobs.file-backed-output.enable true or false Configures Cloud Dataproc jobs to pipe their output to temporary files in the /var/log/google-dataproc-job directory. Must be set to true to enable job driver logging in Logging (default: true).
dataproc:jupyter.notebook.gcs.dir gs://<dir-path> Location in Cloud Storage to save Jupyter notebooks.
dataproc:kerberos.cross-realm-trust.admin-server hostname/address hostname/address of remote admin server (often the same as the KDC server).
dataproc:kerberos.cross-realm-trust.kdc hostname/address hostname/address of remote KDC.
dataproc:kerberos.cross-realm-trust.realm realm name Realm names can consist of any UPPERCASE ASCII string. Usually, the realm name is the same as your DNS domain name (in UPPERCASE). Example: If machines are named "machine-id.example.west-coast.mycompany.com", the associated realm may be designated as "EXAMPLE.WEST-COAST.MYCOMPANY.COM".
dataproc:kerberos.cross-realm-trust.shared-password.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted shared password.
dataproc:kerberos.kdc.db.key.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted file containing the KDC database master key.
dataproc:kerberos.key.password.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted file that contains the password of the key in the keystore file.
dataproc:kerberos.keystore.password.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted file containing the keystore password.
dataproc:kerberos.keystore.uri1 gs://<dir-path> Location in Cloud Storage of the keystore file containing the wildcard certificate and the private key used by cluster nodes.
dataproc:kerberos.kms.key.uri KMS key URI The URI of the KMS key used to decrypt root password, for example projects/project-id/locations/region/keyRings/key-ring/cryptoKeys/key (see Key resource ID).
dataproc:kerberos.root.principal.password.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted password for Kerberos root principal.
dataproc:kerberos.tgt.lifetime.hours hours Max life time of the ticket granting ticket.
dataproc:kerberos.truststore.password.uri gs://<dir-path> Location in Cloud Storage of the KMS-encrypted file that contains the password to the truststore file.
dataproc:kerberos.truststore.uri2 gs://<dir-path> Location in Cloud Storage of the KMS-encrypted trust store file containing trusted certificates.
zeppelin:zeppelin.notebook.gcs.dir gs://<dir-path> Location in Cloud Storage to save Zeppelin notebooks.

1Keystore file: The keystore file contains the SSL certificate. It should be in Java KeyStore (JKS) format. When copied to VMs, it is renamed to keystore.jks. The SSL certificate should be a wildcard certificate that applies to each node in the cluster.

2Truststore file: The truststore file should be in Java KeyStore (JKS) format. When copied to VMs, it is renamed to truststore.jks.

Examples

gcloud command

To change the spark.mastersetting in the spark-defaults.conf file, you can do so by adding the following properties flag when creating a new cluster on the command line:
--properties 'spark:spark.master=spark://example.com'
You can change several properties at once, in one or more configuration files, by using a comma separator. Each property must be specified in the full file_prefix:property=value format. For example, to change the spark.master setting in the spark-defaults.conf file and the dfs.hosts setting in the hdfs-site.xml file, you can use the following flag when creating a cluster:
--properties 'spark:spark.master=spark://example.com,hdfs:dfs.hosts=/foo/bar/baz'

REST API

To set spark.executor.memory to 10g, insert the following properties setting in the SoftwareConfig section of your clusters.create request:
"properties": {
  "spark:spark.executor.memory": "10g"
}

Console

Currently, adding cluster properties from the Cloud Dataproc Create a cluster GCP Console page is not supported.
Var denne side nyttig? Giv os en anmeldelse af den:

Send feedback om...

Cloud Dataproc Documentation
Har du brug for hjælp? Besøg vores supportside.