The open source components installed on Dataproc clusters contain many
configuration files. For example, Apache Spark and Apache Hadoop have several XML
and plain text configuration files. From time to time, you may need to update or
add to these configuration files. You can use the
‑‑properties
flag of the
gcloud dataproc clusters create
command in the Cloud SDK to modify many common
configuration files when creating a cluster.
How the properties
flag works
To make updating files and properties easy, the gcloud dataproc clusters create --properties
flag uses a special format to specify the configuration file, and the property and value within the file, that should be updated.
Formatting
The --properties
flag requires a string of text in the following format:
file_prefix1:property1=value1,file_prefix2:property2=value2,...
Notes:
The
--properties
flag is ued to modify a specific set of commonly used configuration files. Thefile_prefix
maps to a predefined set of configuration files.The default delimiter used to separate multiple properties is the comma (,). However, if a comma is included in a property value, you must change the delimiter by specifying "^DELIMITER^" at the beginning of the property list.
Example to use a "#" delimiter:
--properties ^#^file_prefix1:property1=part1,part2#file_prefix2:property2=value2
See gcloud topic escaping for more information.
Apache Hadoop YARN, HDFS, Spark, and related properties
file_prefix | File | Purpose of file |
---|---|---|
capacity-scheduler | capacity-scheduler.xml | Hadoop YARN Capacity Scheduler configuration |
core | core-site.xml | Hadoop general configuration |
distcp | distcp-default.xml | Hadoop Distributed Copy configuration |
hadoop-env | hadoop-env.sh | Hadoop specific environment variables |
hdfs | hdfs-site.xml | Hadoop HDFS configuration |
hive | hive-site.xml | Hive configuration |
mapred | mapred-site.xml | Hadoop MapReduce configuration |
mapred-env | mapred-env.sh | Hadoop MapReduce specific environment variables |
pig | pig.properties | Pig configuration |
presto | config.properties | Presto configuration |
presto-jvm | jvm.config | Presto specific JVM configuration |
spark | spark-defaults.conf | Spark configuration |
spark-env | spark-env.sh | Spark specific environment variables |
yarn | yarn-site.xml | Hadoop YARN configuration |
yarn-env | yarn-env.sh | Hadoop YARN specific environment variables |
zeppelin | zeppelin-site.xml | Zeppelin configuration |
zeppelin-env | zeppelin-env.sh | Zeppelin specific environment variables (Optional Component only) |
zookeeper | zoo.cfg | Zookeeper configuration |
Important notes
- Some properties are reserved and cannot be overridden because they impact the functionality of the Dataproc cluster. If you try to change a reserved property, you will receive an error message when creating your cluster.
- You can specify multiple changes by separating each with a comma.
- The
--properties
flag cannot modify configuration files not shown above. - Changes to properties will be applied before the daemons on your cluster start.
- If the specified property exists, it will be updated. If the specified property does not exist, it will be added to the configuration file.
Dataproc service properties
These are additional properties specific to Dataproc that are not included in the files listed above. These properties can be used to further configure the functionality of your Dataproc cluster. Note: The following cluster properties are specified at cluster creation. They cannot be specified or updated after cluster creation.
Property | Values | Function |
---|---|---|
dataproc:am.primary_only | true or false |
Set this property to true to prevent application master from running on Cloud Dataproc cluster preemptible workers. Note: This feature is only available with Cloud Dataproc 1.2 and higher. The default value is false . |
dataproc:dataproc.allow.zero.workers | true or false |
Set this SoftwareConfig property to true in a Dataproc clusters.create API request to create a Single node cluster, which changes default number of workers from 2 to 0, and places worker components on the master host. A Single node cluster can also be created from the Cloud Console or with the gcloud command-line tool by setting the number of workers to 0 . |
dataproc:dataproc.alpha.master.nvdimm.size.gb | 1500-6500 | Setting a value creates a Dataproc master with Intel Optane DC Persistent memory. Note: Optane VMs can only be created in us-central1-f zones, only with n1-highmem-96-aep machine type and only under whitelisted projects. |
dataproc:dataproc.alpha.worker.nvdimm.size.gb | 1500-6500 | Setting a value creates a Dataproc worker with Intel Optane DC Persistent memory. Note: Optane VMs can only be created in us-central1-f zones, only with n1-highmem-96-aep machine type and only under whitelisted projects. |
dataproc:dataproc.conscrypt.provider.enable | true or false |
Enables (true ) or disables (false ) Conscrypt as the primary Java security provider. Note: Conscrypt is enabled by default in Dataproc 1.2 and higher, but disabled in 1.0/1.1. |
dataproc:dataproc.localssd.mount.enable | true or false |
Whether to mount local SSDs as Hadoop/Spark temp directories and HDFS data directories (default: true ). |
dataproc:dataproc.logging.stackdriver.enable | true or false |
Enables (true ) or disables (false ) logging to Stackdriver (default: true ). |
dataproc:dataproc.logging.stackdriver.job.driver.enable | true or false |
Enables (true ) or disables (false ) Dataproc job driver logs in Logging (default: false ). |
dataproc:dataproc.logging.stackdriver.job.yarn.container.enable | true or false |
Enables (true ) or disables (false ) YARN container logs in Logging. (default: false ). |
dataproc:dataproc.monitoring.stackdriver.enable | true or false |
Enables (true ) or disables (false ) the Stackdriver Monitoring Agent. |
dataproc:jobs.file-backed-output.enable | true or false |
Configures Dataproc jobs to pipe their output to temporary files in the /var/log/google-dataproc-job directory. Must be set to true to enable job driver logging in Logging (default: true ). |
dataproc:jupyter.notebook.gcs.dir | gs://<dir-path> |
Location in Cloud Storage to save Jupyter notebooks. |
dataproc:kerberos.cross-realm-trust.admin-server | hostname/address |
hostname/address of remote admin server (often the same as the KDC server). |
dataproc:kerberos.cross-realm-trust.kdc | hostname/address |
hostname/address of remote KDC. |
dataproc:kerberos.cross-realm-trust.realm | realm name |
Realm names can consist of any UPPERCASE ASCII string. Usually, the realm name is the same as your DNS domain name (in UPPERCASE). Example: If machines are named "machine-id.example.west-coast.mycompany.com", the associated realm may be designated as "EXAMPLE.WEST-COAST.MYCOMPANY.COM". |
dataproc:kerberos.cross-realm-trust.shared-password.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted shared password. |
dataproc:kerberos.kdc.db.key.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted file containing the KDC database master key. |
dataproc:kerberos.key.password.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted file that contains the password of the key in the keystore file. |
dataproc:kerberos.keystore.password.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted file containing the keystore password. |
dataproc:kerberos.keystore.uri1 | gs://<dir-path> |
Location in Cloud Storage of the keystore file containing the wildcard certificate and the private key used by cluster nodes. |
dataproc:kerberos.kms.key.uri | KMS key URI |
The URI of the KMS key used to decrypt root password, for example projects/project-id/locations/region/keyRings/key-ring/cryptoKeys/key (see Key resource ID). |
dataproc:kerberos.root.principal.password.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted password for Kerberos root principal. |
dataproc:kerberos.tgt.lifetime.hours | hours |
Max life time of the ticket granting ticket. |
dataproc:kerberos.truststore.password.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted file that contains the password to the truststore file. |
dataproc:kerberos.truststore.uri2 | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted trust store file containing trusted certificates. |
zeppelin:zeppelin.notebook.gcs.dir | gs://<dir-path> |
Location in Cloud Storage to save Zeppelin notebooks. |
1Keystore file: The keystore file contains the SSL certificate. It should be
in Java KeyStore (JKS) format. When copied to VMs, it is renamed to keystore.jks
.
The SSL certificate should be a wildcard certificate that applies to each node
in the cluster.
2Truststore file: The truststore file should
be in Java KeyStore (JKS) format. When copied to VMs, it is renamed to
truststore.jks
.
Examples
gcloud command
To change thespark.master
setting in the
spark-defaults.conf
file, you can do so by adding the following
properties
flag when creating a new cluster on the command line:
--properties 'spark:spark.master=spark://example.com'You can change several properties at once, in one or more configuration files, by using a comma separator. Each property must be specified in the full
file_prefix:property=value
format. For example, to change the
spark.master
setting in the spark-defaults.conf
file
and the dfs.hosts
setting in the hdfs-site.xml
file,
you can use the following flag when creating a cluster:
--properties 'spark:spark.master=spark://example.com,hdfs:dfs.hosts=/foo/bar/baz'
REST API
To setspark.executor.memory
to 10g
, insert the
following properties
setting in the
SoftwareConfig
section of your
clusters.create request:
"properties": { "spark:spark.executor.memory": "10g" }
Console
To change thespark.master
setting in the
spark-defaults.conf
file:
- In the Cloud Console, open the Dataproc Create a cluster page. Click Advanced options at the bottom of the page to view the Cluster properties section.
- Click + Add cluster property,
select spark in the left drop-down list, then add
"spark.master" in the property field and the setting in the value field.