When you enable Dataproc cluster caching, the cluster caches Cloud Storage data frequently accessed by your Spark jobs.
Benefits
- Improved performance: Caching can improve job performance by reducing the amount of time spent retrieving data from storage.
- Reduced storage costs: Since hot data is cached on local disk, fewer API calls are made to storage to retrieve data.
Limitations and requirements
- Caching applies to Dataproc Spark jobs only.
- Only Cloud Storage data is cached.
- Caching only applies only to clusters that meet the following requirements:
- The cluster has one master and
n
workers (High Availability (HA) and single node clusters are not supported). - This feature is available in Dataproc on Compute Engine
image versions
2.0.72+ or 2.1.20+
. - Each cluster node must have local SSDs attached with the NVME (Non-Volatile Memory Express) interface (Persistent Disks (PDs) are not supported). Data is cached on NVME local SSDs only.
- The cluster uses the default VM service account for authentication. Custom VM service accounts are not supported.
- The cluster has one master and
Enable cluster caching
You can enable cluster caching when you create a Dataproc cluster using the Google Cloud CLI or the Dataproc API.
Console
Currently, enabling cluster caching from the Google Cloud console is not supported.
gcloud CLI
Run the gcloud dataproc clusters create
command locally in a terminal window or in
Cloud Shell
using the dataproc:dataproc.cluster.caching=true
cluster property.
Example:
gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION \ --properties dataproc:dataproc.cluster.caching.enabled=true \ --num-master-local-ssds=2 \ --master-local-ssd-interface=NVME \ --num-worker-local-ssds=2 \ --worker-local-ssd-interface=NVME \ other args ...
REST API
Set SoftwareConfig.properties
to include the "dataproc:dataproc.cluster.caching": "true"
cluster property
as part of a
clusters.create
request.