Dataproc staging bucket

When you create a cluster, by default, Dataproc will create a Cloud Storage staging bucket in your project or reuse an existing Dataproc-created staging bucket from a previous cluster creation request. This bucket is used to stage cluster job dependencies, job driver output, and cluster config files. Instead of relying on the creation of a default staging bucket, you can specify an existing Cloud Storage bucket that Dataproc will use as your cluster's staging bucket.

gcloud command

Run the gcloud dataproc clusters create command with the --bucket flag locally in a terminal window or in Cloud Shell to specify your cluster's staging bucket.

gcloud dataproc clusters create  cluster-name \
    --bucket=bucket URI (for example, gs://mybucket-name) \
    other args ...

REST API

Use the ClusterConfig.configBucket field in a clusters.create request to specify your cluster's staging bucket.

Console

Use the Cloud Storage staging bucket field in the Create a cluster→Advanced options panel of the Google Cloud Console to specify or select your cluster's staging bucket.

A separate bucket is used in each geographical region, as determined by the Compute Engine zone of the cluster (a Dataproc-created staging bucket is shared among clusters in the same region). Staging buckets are used to stage miscellaneous configuration and control files that are needed by your cluster. Staging buckets also receive output from the Cloud SDK gcloud dataproc clusters diagnose command.

Dataproc uses a defined folder structure for Cloud Storage buckets attached to clusters. Cloud Dataproc also supports attaching more than one cluster to a Cloud Storage bucket. The folder structure used for saving job driver output in Cloud Storage is:

cloud-storage-bucket-name
  - google-cloud-dataproc-metainfo
    - list of cluster IDs
        - list of job IDs
          - list of output logs for a job

You can use the gcloud command line tool, Dataproc API, or Google Cloud Console to list the name of a cluster's staging bucket.

gcloud command

Run the gcloud dataproc clusters describe command locally in a terminal window or in Cloud Shell. The staging bucket associated with your cluster is listed in the output.

gcloud dataproc clusters describe cluster-name
clusterName: cluster-name
clusterUuid: daa40b3f-5ff5-4e89-9bf1-bcbfec ...
config:
    configBucket: dataproc-edc9d85f-12f9-4905...
    ...

REST API

Call clusters.get to list the cluster details, including the name of the cluster's staging bucket.

{
 "projectId": "vigilant-sunup-163401",
 "clusterName": "cluster-name",
 "config": {
  "configBucket": "dataproc-a8cd0...",
...
}

Console

View cluster details, including the name o the cluster's staging bucket, in the Cloud Console.

Kunde den här sidan hjälpa dig? Berätta:

Skicka feedback om ...

Cloud Dataproc Documentation
Behöver du hjälp? Besök vår supportsida.