Dataproc staging and temp buckets

When you create a cluster, HDFS is used as the default filesystem. You can override this behavior by setting the defaultFS as a Cloud Storage bucket. By default, Dataproc also creates a Cloud Storage staging and a Cloud Storage temp bucket in your project or reuses existing Dataproc-created staging and temp buckets from previous cluster creation requests.

  • Staging bucket: Used to stage cluster job dependencies, job driver output, and cluster config files. Also receives output from the gcloud CLI gcloud dataproc clusters diagnose command.

  • Temp bucket: Used to store ephemeral cluster and jobs data, such as Spark and MapReduce history files.

If you do not specify a staging or temp bucket when you create a cluster, Dataproc sets a Cloud Storage location in US, ASIA, or EU for your cluster's staging and temp buckets according to the Compute Engine zone where your cluster is deployed, and then creates and manages these project-level, per-location buckets. Dataproc-created staging and temp buckets are shared among clusters in the same region, and are created with a Cloud Storage soft delete retention duration set to 0 seconds.

The temp bucket contains ephemeral data, and has a TTL of 90 days. The staging bucket, which can contain configuration data and dependency files needed by multiple clusters, does not have a TTL. However, you can apply a lifecycle rule to your dependency files (files with a ".jar" filename extension located in the staging bucket folder) to schedule the removal of your dependency files when they are no longer needed by your clusters.

Create your own staging and temp buckets

Instead of relying on the creation of a default staging and temp bucket, you can specify existing Cloud Storage buckets that Dataproc will use as your cluster's staging and temp bucket.

gcloud command

Run the gcloud dataproc clusters create command with the --bucket and/or --temp-bucket flags locally in a terminal window or in Cloud Shell to specify your cluster's staging and/or temp bucket.

gcloud dataproc clusters create cluster-name \
    --region=region \
    --bucket=bucket-name \
    --temp-bucket=bucket-name \
    other args ...

REST API

Use the ClusterConfig.configBucket and ClusterConfig.tempBucket fields in a clusters.create request to specify your cluster's staging and temp buckets.

Console

In the Google Cloud console, open the Dataproc Create a cluster page. Select the Customize cluster panel, then use the File storage field to specify or select the cluster's staging bucket.

Note: Currently, specifying a temp bucket using the Google Cloud console is not supported.

Dataproc uses a defined folder structure for Cloud Storage buckets attached to clusters. Dataproc also supports attaching more than one cluster to a Cloud Storage bucket. The folder structure used for saving job driver output in Cloud Storage is:

cloud-storage-bucket-name
  - google-cloud-dataproc-metainfo
    - list of cluster IDs
        - list of job IDs
          - list of output logs for a job

You can use the gcloud command line tool, Dataproc API, or Google Cloud console to list the name of a cluster's staging and temp buckets.

Console

  • \View cluster details, which includeas the name of the cluster's staging bucket, on the Dataproc Clusters page in the Google Cloud console.
  • On the Google Cloud console Cloud Storage Browser page, filter results that contain "dataproc-temp-".

gcloud command

Run the gcloud dataproc clusters describe command locally in a terminal window or in Cloud Shell. The staging and temp buckets associated with your cluster are listed in the output.

gcloud dataproc clusters describe cluster-name \
    --region=region \
...
clusterName: cluster-name
clusterUuid: daa40b3f-5ff5-4e89-9bf1-bcbfec ...
config:
    configBucket: dataproc-...
    ...
    tempBucket: dataproc-temp...

REST API

Call clusters.get to list the cluster details, including the name of the cluster's staging and temp buckets.

{
 "projectId": "vigilant-sunup-163401",
 "clusterName": "cluster-name",
 "config": {
  "configBucket": "dataproc-...",
...
  "tempBucket": "dataproc-temp-...",
}

defaultFS

You can set core:fs.defaultFS to a bucket location in Cloud Storage (gs://defaultFS-bucket-name) to set Cloud Storage as the default filesystem. This also sets core:fs.gs.reported.permissions, the reported permission returned by the Cloud Storage connector for all files, to 777.

If Cloud Storage is not set as the default filesystem, HDFS will be used, and the core:fs.gs.reported.permissions property will return 700, the default value.

gcloud dataproc clusters create cluster-name \
    --properties=core:fs.defaultFS=gs://defaultFS-bucket-name \
    --region=region \
    --bucket=staging-bucket-name \
    --temp-bucket=temp-bucket-name \
    other args ...