Cluster caching

When you enable Dataproc cluster caching, the cluster caches Cloud Storage data frequently accessed by your Spark jobs.

Benefits

  • Improved performance: Caching can improve job performance by reducing the amount of time spent retrieving data from storage.
  • Reduced storage costs: Since hot data is cached on local disk, fewer API calls are made to storage to retrieve data.

Limitations and requirements

Enable cluster caching

You can enable cluster caching when you create a Dataproc cluster using the Google Cloud CLI or the Dataproc API.

Console

Currently, enabling cluster caching from the Google Cloud console is not supported.

gcloud CLI

Run the gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell using the dataproc:dataproc.cluster.caching=true cluster property.

Example:

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --properties dataproc:dataproc.cluster.caching.enabled=true \
    --num-master-local-ssds=2 \
    --master-local-ssd-interface=NVME \
    --num-worker-local-ssds=2 \
    --worker-local-ssd-interface=NVME \
    other args ...
  

REST API

Set SoftwareConfig.properties to include the "dataproc:dataproc.cluster.caching": "true" cluster property as part of a clusters.create request.