Cluster caching

When you enable Dataproc cluster caching, the cluster caches Cloud Storage data frequently accessed by your Spark jobs.

Benefits

Improved performance: Caching can improve job performance by reducing the amount of time spent retrieving data from storage.
Reduced storage costs: Since hot data is cached on local disk, fewer API calls are made to storage to retrieve data.
Spark job applicability: When cluster caching is enabled on a cluster, it applies to all Spark jobs run on the cluster, whether submitted to the Dataproc service or run independently on the cluster.

Limitations and requirements

Caching applies to Dataproc Spark jobs only.
Only Cloud Storage data is cached.
Caching only applies to clusters that meet the following requirements:
- The cluster has one master and n workers (High Availability (HA) and single node clusters are not supported).
- This feature is available in Dataproc on Compute Engine image versions 2.0.72+, 2.1.20+, and 2.2.0+.
- Each cluster node must have local SSDs attached with the NVME (Non-Volatile Memory Express) interface (Persistent Disks (PDs) are not supported). Data is cached on NVME local SSDs only.
- The cluster uses the default VM service account for authentication. Custom VM service accounts are not supported.

Enable cluster caching

You can enable cluster caching when you create a Dataproc cluster using the Google Cloud console, Google Cloud CLI, or the Dataproc API.

Google Cloud console

Open the Dataproc Create a cluster on Compute Engine page in the Google Cloud console.
The Set up cluster panel is selected. In the Spark performance enhancements section, select Enable Google Cloud Storage caching.
After confirming and specifying cluster details in the cluster create panels, click Create.

gcloud CLI

Run the gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell using the dataproc:dataproc.cluster.caching.enabled=true cluster property.

Example:

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --properties dataproc:dataproc.cluster.caching.enabled=true \
    --num-master-local-ssds=2 \
    --master-local-ssd-interface=NVME \
    --num-worker-local-ssds=2 \
    --worker-local-ssd-interface=NVME \
    other args ...

REST API

Set SoftwareConfig.properties to include the "dataproc:dataproc.cluster.caching.enabled": "true" cluster property as part of a clusters.create request.