When you enable Dataproc cluster caching, the cluster caches Cloud Storage data frequently accessed by your Spark jobs.
Benefits
- Improved performance: Caching can improve job performance by reducing the amount of time spent retrieving data from storage.
- Reduced storage costs: Since hot data is cached on local disk, fewer API calls are made to storage to retrieve data.
Limitations and requirements
- Caching applies to Dataproc Spark jobs only.
- Only Cloud Storage data is cached.
- Caching only applies only to clusters that meet the following requirements:
- The cluster has one master and
n
workers (High Availability (HA) and single node clusters are not supported). - This feature is available in Dataproc on Compute Engine
image versions
2.0.72+ or 2.1.20+
. - Each cluster node must have local SSDs attached with the NVME (Non-Volatile Memory Express) interface (Persistent Disks (PDs) are not supported). Data is cached on NVME local SSDs only.
- The cluster uses the default VM service account for authentication. Custom VM service accounts are not supported.
- The cluster has one master and
Enable cluster caching
You can enable cluster caching when you create a Dataproc cluster using the Google Cloud console, Google Cloud CLI, or the Dataproc API.
Google Cloud console
- Open the Dataproc Create a cluster on Compute Engine page in the Google Cloud console.
- The Set up cluster panel is selected. In the Spark performance enhancements section, select Enable Google Cloud Storage caching.
- After confirming and specifying cluster details in the cluster create panels, click Create.
gcloud CLI
Run the gcloud dataproc clusters create
command locally in a terminal window or in
Cloud Shell
using the dataproc:dataproc.cluster.caching.enabled=true
cluster property.
Example:
gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION \ --properties dataproc:dataproc.cluster.caching.enabled=true \ --num-master-local-ssds=2 \ --master-local-ssd-interface=NVME \ --num-worker-local-ssds=2 \ --worker-local-ssd-interface=NVME \ other args ...
REST API
Set SoftwareConfig.properties
to include the "dataproc:dataproc.cluster.caching.enabled": "true"
cluster property
as part of a
clusters.create
request.