Apache Hadoop YARN, HDFS, Spark, and related properties
The open source components installed on Dataproc clusters contain many
configuration files. For example, Apache Spark and Apache Hadoop have several XML
and plain text configuration files. You can use the
‑‑properties
flag of the
gcloud dataproc clusters create
command to modify many common configuration files when creating a cluster.
Formatting
The gcloud dataproc clusters create --properties
flag accepts the following
string format:
file_prefix1:property1=value1,file_prefix2:property2=value2,...
The file_prefix maps to a predefined configuration file as shown in the table below, and the property maps to a property within the file.
The default delimiter used to separate multiple cluster properties is the comma (,). However, if a comma is included in a property value, your must change the delimiter by specifying a "^delimiter^" at the beginning of the property list (see gcloud topic escaping for more information).
- Example using a "#" delimiter:
--properties ^#^file_prefix1:property1=part1,part2#file_prefix2:property2=value2
- Example using a "#" delimiter:
Examples
gcloud command
To change the spark.master
setting in the
spark-defaults.conf
file, add the following
gcloud dataproc clusters create --properties
flag:
--properties 'spark:spark.master=spark://example.com'
You can change several properties at once, in one or more configuration files,
by using a comma separator. Each property must be specified in the full
file_prefix:property=value
format. For example, to change the
spark.master
setting in the spark-defaults.conf
file
and the dfs.hosts
setting in the hdfs-site.xml
file,
use the following --properties
flag when creating a cluster:
--properties 'spark:spark.master=spark://example.com,hdfs:dfs.hosts=/foo/bar/baz'
REST API
To set spark.executor.memory
to 10g
, insert the
following properties
setting in the
SoftwareConfig
section of your
clusters.create request:
"properties": { "spark:spark.executor.memory": "10g" }
An easy way to see how to construct the JSON body of a
Dataproc API clusters REST request is to initiate the
equivalent gcloud
command using the --log-http
flag.
Here is a sample gcloud dataproc clusters create
command, which sets cluster
properties with the --properties spark:spark.executor.memory=10g
flag.
The stdout log shows the resulting REST request body (the properties
snippet is shown below):
gcloud dataproc clusters create my-cluster \ --region=region \ --properties=spark:spark.executor.memory=10g \ --log-http \ other args ...
Output:
... == body start == {"clusterName": "my-cluster", "config": {"gceClusterConfig": ... "masterConfig": {... "softwareConfig": {"properties": {"spark:spark.executor.memory": "10g"}},
... == body end == ...
Make sure to cancel the command after the JSON body appears in the output if you do not want the command to take effect.
Console
To change the spark.master
setting in the
spark-defaults.conf
file:
- In the Cloud Console, open the Dataproc Create a cluster page. Click the Customize cluster panel, then scroll to the Cluster properties section.
- Click + ADD PROPERTIES. Select spark in the Prefix list, then add "spark.master" in the Key field and the setting in the Value field.
Cluster vs. Job Properties
The Apache Hadoop YARN, HDFS, Spark, and other file-prefixed properties are applied at the cluster level when you create a cluster. Many of these properties can also be applied to specific jobs. When applying a property to a job, the file prefix is not used.
Example:
Set Spark executor memory to 4g for a Spark job (spark:
prefix omitted).
gcloud dataproc jobs submit spark \ --region=region \ --properties=spark.executor.memory=4g \ ... other args ...
File-prefixed properties table
File prefix | File | File purpose |
---|---|---|
capacity-scheduler | capacity-scheduler.xml | Hadoop YARN Capacity Scheduler configuration |
core | core-site.xml | Hadoop general configuration |
distcp | distcp-default.xml | Hadoop Distributed Copy configuration |
hadoop-env | hadoop-env.sh | Hadoop specific environment variables |
hbase | hbase-site.xml | HBase configuration |
hdfs | hdfs-site.xml | Hadoop HDFS configuration |
hive | hive-site.xml | Hive configuration |
mapred | mapred-site.xml | Hadoop MapReduce configuration |
mapred-env | mapred-env.sh | Hadoop MapReduce specific environment variables |
pig | pig.properties | Pig configuration |
presto | config.properties | Presto configuration |
presto-jvm | jvm.config | Presto specific JVM configuration |
spark | spark-defaults.conf | Spark configuration |
spark-env | spark-env.sh | Spark specific environment variables |
yarn | yarn-site.xml | Hadoop YARN configuration |
yarn-env | yarn-env.sh | Hadoop YARN specific environment variables |
zeppelin | zeppelin-site.xml | Zeppelin configuration |
zeppelin-env | zeppelin-env.sh | Zeppelin specific environment variables (Optional Component only) |
zookeeper | zoo.cfg | Zookeeper configuration |
Notes
- Some properties are reserved and cannot be overridden because they impact the functionality of the Dataproc cluster. If you try to change a reserved property, you will receive an error message when creating your cluster.
- You can specify multiple changes by separating each with a comma.
- The
--properties
flag cannot modify configuration files not shown above. - Changes to properties will be applied before the daemons on your cluster start.
- If the specified property exists, it will be updated. If the specified property does not exist, it will be added to the configuration file.
Dataproc service properties
The properties listed below are specific to Dataproc. These properties can be used to further configure the functionality of your Dataproc cluster. These cluster properties are specified at cluster creation. They cannot be specified or updated after cluster creation.
Formatting
The gcloud dataproc clusters create --properties
flag accepts the following
string format:
property_prefix1:property1=value1,property_prefix2:property2=value2,...
The default delimiter used to separate multiple cluster properties is the comma (,). However, if a comma is included in a property value, your must change the delimiter by specifying "^delimiter^" at the beginning of the property list (see gcloud topic escaping for more information).
- Example using a "#" delimiter:
--properties ^#^property_prefix1:property1=part1,part2#property_prefix2:property2=value2
- Example using a "#" delimiter:
Example:
Create a cluster and set Enhanced Flexibility Mode to Spark primary worker shuffle.
gcloud dataproc jobs submit spark \ --region=region \ --properties=dataproc:efm.spark.shuffle=primary-worker \ ... other args ...
Dataproc service properties table
Property prefix | Property | Values | Description |
---|---|---|---|
dataproc | am.primary_only | true or false |
Set this property to true to prevent application master from running on Dataproc cluster preemptible workers. Note: This feature is only available with Dataproc 1.2 and higher. The default value is false . |
dataproc | dataproc.allow.zero.workers | true or false |
Set this SoftwareConfig property to true in a Dataproc clusters.create API request to create a Single node cluster, which changes default number of workers from 2 to 0, and places worker components on the master host. A Single node cluster can also be created from the Cloud Console or with the gcloud command-line tool by setting the number of workers to 0 . |
dataproc | dataproc.alpha.master.nvdimm.size.gb | 1500-6500 | Setting a value creates a Dataproc master with Intel Optane DC Persistent memory. Note: Optane VMs can only be created in us-central1-f zones, only with n1-highmem-96-aep machine type and only under whitelisted projects. |
dataproc: | dataproc.alpha.worker.nvdimm.size.gb | 1500-6500 | Setting a value creates a Dataproc worker with Intel Optane DC Persistent memory. Note: Optane VMs can only be created in us-central1-f zones, only with n1-highmem-96-aep machine type and only under whitelisted projects. |
dataproc: | dataproc.beta.secure.multi-tenancy.user.mapping | user-to-service account mappings |
This property takes a list of user-to-service account mappings. Mapped users can submit interactive workloads to the cluster with isolated user identities (see Dataproc Service Account Based Secure Multi-tenancy). |
dataproc | dataproc.conscrypt.provider.enable | true or false |
Enables (true ) or disables (false ) Conscrypt as the primary Java security provider. Note: Conscrypt is enabled by default in Dataproc 1.2 and higher, but disabled in 1.0/1.1. |
dataproc | dataproc.cooperative.multi-tenancy.user.mapping | user-to-service account mappings |
This property takes a list of comma-separated user-to-service account mappings. If a cluster is created with this property set, when a user submits a job, the cluster will attempt to impersonate the corresponding service account when accessing Cloud Storage through the Cloud Storage connector. This feature requires Cloud Storage connector version 2.1.4 or higher. For more information, see Dataproc cooperative multi-tenancy. (default: empty ). |
dataproc | dataproc.localssd.mount.enable | true or false |
Whether to mount local SSDs as Hadoop/Spark temp directories and HDFS data directories (default: true ). |
dataproc | dataproc.logging.stackdriver.enable | true or false |
Enables (true ) or disables (false ) Logging (default: true ). |
dataproc | dataproc.logging.stackdriver.job.driver.enable | true or false |
Enables (true ) or disables (false ) Dataproc job driver logs in Logging (default: false ). |
dataproc | dataproc.logging.stackdriver.job.yarn.container.enable | true or false |
Enables (true ) or disables (false ) YARN container logs in Logging. (default: false ). |
dataproc | dataproc.monitoring.stackdriver.enable | true or false |
Enables (true ) or disables (false ) the Monitoring Agent. |
dataproc | dataproc.scheduler.driver-size-mb | number |
The average driver memory footprint, which determines the maximum number of concurrent jobs a cluster will run. The default value is 1 GB. A smaller value, such as 256 , may be appropriate for Spark jobs. |
dataproc | dataproc.worker.custom.init.actions.mode | RUN_BEFORE_SERVICES |
For pre-2.0 image clusters, RUN_BEFORE_SERVICES is not set, but can be set by the user when the cluster is created. For 2.0+ image clusters, RUN_BEFORE_SERVICES is set, and the property cannot be passed to the cluster (it cannot be changed by the user). For information on the effect of this setting, see Important considerations and guidelines—Initialization processing. |
dataproc | efm.mapreduce.shuffle | hcfs |
If set to hcfs , Spark shuffle data is preserved in HDFS. See Dataproc Enhanced Flexibility Mode for more information. Note: Currently, this feature is only available with Dataproc 1.4 and 1.5. |
dataproc | efm.spark.shuffle | primary-worker or hcfs |
If set to primary-worker , mappers write data to primary workers (available to, and recommended for, Spark jobs). If set to hcfs , Spark shuffle data is preserved in HDFS. See Dataproc Enhanced Flexibility Mode for more information. Note: Currently, this feature is only available with Dataproc 1.4 and 1.5. |
dataproc | job.history.to-gcs.enabled | true or false |
Allows persisting MapReduce and Spark history files to the Dataproc temp bucket (default: true for image versions 1.5+). Users can overwrite the locations of job history file persistence through the following properties: mapreduce.jobhistory.done-dir , mapreduce.jobhistory.intermediate-done-dir , spark.eventLog.dir , and spark.history.fs.logDirectory |
dataproc | jobs.file-backed-output.enable | true or false |
Configures Dataproc jobs to pipe their output to temporary files in the /var/log/google-dataproc-job directory. Must be set to true to enable job driver logging in Logging (default: true ). |
dataproc | jupyter.listen.all.interfaces | true or false |
To reduce the risk of remote code execution over unsecured notebook server APIs, the default setting for image versions 1.3+ is false , which restricts connections to localhost (127.0.0.1 ) when Component Gateway is enabled. This default setting can be overridden by setting this property to true to allow all connections. |
dataproc | jupyter.notebook.gcs.dir | gs://<dir-path> |
Location in Cloud Storage to save Jupyter notebooks. |
dataproc | kerberos.beta.automatic-config.enable | true or false |
When set to true , users do not need to specify the Kerberos root principal password with the --kerberos-root-principal-password and --kerberos-kms-key-uri flags (default: false ). See Enabling Hadoop Secure Mode via Kerberos for more information. |
dataproc | kerberos.cross-realm-trust.admin-server | hostname/address |
hostname/address of remote admin server (often the same as the KDC server). |
dataproc | kerberos.cross-realm-trust.kdc | hostname/address |
hostname/address of remote KDC. |
dataproc | kerberos.cross-realm-trust.realm | realm name |
Realm names can consist of any UPPERCASE ASCII string. Usually, the realm name is the same as your DNS domain name (in UPPERCASE). Example: If machines are named "machine-id.example.west-coast.mycompany.com", the associated realm may be designated as "EXAMPLE.WEST-COAST.MYCOMPANY.COM". |
dataproc | kerberos.cross-realm-trust.shared-password.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted shared password. |
dataproc | kerberos.kdc.db.key.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted file containing the KDC database master key. |
dataproc | kerberos.key.password.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted file that contains the password of the key in the keystore file. |
dataproc | kerberos.keystore.password.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted file containing the keystore password. |
dataproc | kerberos.keystore.uri1 | gs://<dir-path> |
Location in Cloud Storage of the keystore file containing the wildcard certificate and the private key used by cluster nodes. |
dataproc | kerberos.kms.key.uri | KMS key URI |
The URI of the KMS key used to decrypt root password, for example projects/project-id/locations/region/keyRings/key-ring/cryptoKeys/key (see Key resource ID). |
dataproc | kerberos.root.principal.password.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted password for Kerberos root principal. |
dataproc | kerberos.tgt.lifetime.hours | hours |
Max life time of the ticket granting ticket. |
dataproc | kerberos.truststore.password.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted file that contains the password to the truststore file. |
dataproc | kerberos.truststore.uri2 | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted trust store file containing trusted certificates. |
dataproc | ranger.kms.key.uri | KMS key URI |
The URI of the KMS key used to decrypt Ranger admin user password, for example projects/project-id/locations/region/keyRings/key-ring/cryptoKeys/key (see Key resource ID). |
dataproc | ranger.admin.password.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted password for Ranger admin user. |
dataproc | ranger.db.admin.password.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted password for Ranger database admin user. |
dataproc | ranger.cloud-sql.instance.connection.name | cloud sql instance connection name |
The connection name of the Cloud SQL instance, for example project-id:region:name. |
dataproc | ranger.cloud-sql.root.password.uri | gs://<dir-path> |
Location in Cloud Storage of the KMS-encrypted password for the root user of the Cloud SQL instance. |
dataproc | ranger.cloud-sql.use-private-ip | true or false |
Whether the communication between cluster instances and the Cloud SQL instance should be over private IP (default value is false ). |
dataproc | solr.gcs.path | gs://<dir-path> |
Cloud Storage path to act as Solr home directory. |
dataproc | startup.component.service-binding-timeout.hadoop-hdfs-namenode | seconds |
The amount of time the Dataproc startup script will wait for the hadoop-hdfs-namenode to bind to ports before deciding that its startup has succeeded. The maximum recognized value is 1800 seconds (30 minutes). |
dataproc | startup.component.service-binding-timeout.hive-metastore | seconds |
The amount of time the Dataproc startup script will wait for the hive-metastore service to bind to ports before deciding that its startup has succeeded. The maximum recognized value is 1800 seconds (30 minutes). |
dataproc | startup.component.service-binding-timeout.hive-server2 | seconds |
The amount of time the Dataproc startup script will wait for the hive-server2 to bind to ports before deciding that its startup has succeeded. The maximum recognized value is 1800 seconds (30 minutes). |
dataproc | user-attribution.enabled | true or false |
Set this property to true to attribute a Dataproc job to the identity of the user who submitted it. Note: This property does not apply to clusters with Kerberos enabled (default value is false ). |
dataproc | yarn.log-aggregation.enabled | true or false |
Allows (true ) turning on YARN log aggregation to a Dataproc temporary bucket. The bucket name is of the following form: dataproc-temp-<REGION>-<PROJECT_NUMBER>-<RANDOM_STRING> . (default: true for image versions 1.5+). Users can also set the location of aggregated YARN logs by overwriting the yarn.nodemanager.remote-app-log-dir YARN property. |
knox | gateway.host | ip address |
To reduce the risk of remote code execution over unsecured notebook server APIs, the default setting for image versions 1.3+ is 127.0.0.1 , which restricts connections to localhost when Component Gateway is enabled. The default setting can be overridden, for example by setting this property to 0.0.0.0 to allow all connections. |
zeppelin | zeppelin.notebook.gcs.dir | gs://<dir-path> |
Location in Cloud Storage to save Zeppelin notebooks. |
zeppelin | zeppelin.server.addr | ip address |
To reduce the risk of remote code execution over unsecured notebook server APIs, the default setting for image versions 1.3+ is 127.0.0.1 , which restricts connections to localhost when Component Gateway is enabled. This default setting can be overridden, for example by setting this property to 0.0.0.0 to allow all connections. |
1Keystore file: The keystore file contains the SSL certificate. It should be
in Java KeyStore (JKS) format. When copied to VMs, it is renamed to keystore.jks
.
The SSL certificate should be a wildcard certificate that applies to each node
in the cluster.
2Truststore file: The truststore file should
be in Java KeyStore (JKS) format. When copied to VMs, it is renamed to
truststore.jks
.