Dataproc Personal Cluster Authentication

When you create a Dataproc cluster, you can enable Dataproc Personal Cluster Authentication to allow interactive workloads on the cluster to securely run as your user identity. This means that interactions with other Google Cloud resources such as Cloud Storage will be authenticated as yourself instead of the cluster service account.

Considerations

  • When you create a cluster with Personal Cluster Authentication enabled, the cluster will only be usable by your identity. Other users will not be able to run jobs on the cluster or access Component Gateway endpoints on the cluster.

  • Clusters with Personal Cluster Authentication enabled block SSH access and Compute Engine features such as startup scripts on all VMs in the cluster.

  • Clusters with Personal Cluster Authentication enabled automatically enable and configure Kerberos on the cluster for secure intra-cluster communication. However, all Kerberos identities on the cluster will interact with Google Cloud resources as the same user.

  • Dataproc Personal Cluster Authentication currently does not support Dataproc workflows.

  • Dataproc Personal Cluster Authentication is intended only for interactive jobs run by an individual (human) user. Long-running jobs and operations should configure and use an appropriate service account identity.

  • The propagated credentials are downscoped with a Credential Access Boundary. The default access boundary is limited to reading and writing Cloud Storage objects in Cloud Storage buckets owned by the same project that contains the cluster. You can define a non-default access boundary when you enable_an_interactive_session.

Objectives

  • Create a Dataproc cluster with Dataproc Personal Cluster Authentication enabled.

  • Start credential propagation to the cluster.

  • Use a Jupyter notebook on the cluster to run Spark jobs that authenticate with your credentials.

Before You Begin

Create a Project

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Dataproc API.

    Enable the API

  5. Install the Google Cloud CLI.
  6. To initialize the gcloud CLI, run the following command:

    gcloud init
  7. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  8. Make sure that billing is enabled for your Google Cloud project.

  9. Enable the Dataproc API.

    Enable the API

  10. Install the Google Cloud CLI.
  11. To initialize the gcloud CLI, run the following command:

    gcloud init

Configure the Environment

Configure the environment from Cloud Shell or a local terminal:

Cloud Shell

  1. Start a Cloud Shell session.

Local terminal

  1. Run gcloud auth login to obtain valid user credentials.

Create a cluster and enable an interactive session

  1. Find the email address of your active account in gcloud.

    gcloud auth list --filter=status=ACTIVE --format="value(account)"
    

  2. Create a cluster.

    gcloud dataproc clusters create cluster-name \
        --properties=dataproc:dataproc.personal-auth.user=your-email-address \
        --enable-component-gateway \
        --optional-components=ANACONDA,JUPYTER,ZEPPELIN \
        --region=region
    

  3. Enable a credential propagation session for the cluster to start using your personal credentials when interacting with Google Cloud resources.

    gcloud dataproc clusters enable-personal-auth-session \
        --region=region \
        cluster-name
    

    Sample output:

    Injecting initial credentials into the cluster cluster-name...done.
    Periodically refreshing credentials for cluster cluster-name. This will continue running until the command is interrupted...
    

    1. Downscoped access boundary example: The following example enables a personal auth session that is more restrictive than the default downscoped credential access boundary. It restricts access to the Dataproc cluster's staging bucket (see Downscope with Credential Access Boundaries for more information).
gcloud dataproc clusters enable-personal-auth-session \
    --project=PROJECT_ID \
    --region=REGION \
    --access-boundary=<(echo -n "{ \
  \"access_boundary\": { \
    \"accessBoundaryRules\": [{ \
      \"availableResource\": \"//storage.googleapis.com/projects/_/buckets/$(gcloud dataproc clusters describe --project=PROJECT_ID --region=REGION CLUSTER_NAME --format="value(config.configBucket)")\", \
      \"availablePermissions\": [ \
        \"inRole:roles/storage.objectViewer\", \
        \"inRole:roles/storage.objectCreator\", \
        \"inRole:roles/storage.objectAdmin\", \
        \"inRole:roles/storage.legacyBucketReader\" \
      ] \
    }] \
  } \
}") \
   CLUSTER_NAME
  1. Keep the command running and switch to a new Cloud Shell tab or terminal session. The client will refresh the credentials while the command is running.

  2. Type Ctrl-C to end the session.

The following example creates a cluster with a downscoped credential access boundary.

Access Jupyter on the cluster

gcloud

  1. Get cluster details.
    gcloud dataproc clusters describe cluster-name --region=region
    

    The Jupyter Web interface URL is listed in cluster details.

    ...
    JupyterLab: https://UUID-dot-us-central1.dataproc.googleusercontent.com/jupyter/lab/
    ...
    
  2. Copy the URL into your local browser to launch the Jupyter UI.
  3. Check that personal cluster authentication was successful.
    1. Start a Jupyter terminal.
    2. Run gcloud auth list
    3. Verify that your username is the only active account.
  4. In a Jupyter terminal, enable Jupyter to authenticate with Kerberos and submit Spark jobs.
    kinit -kt /etc/security/keytab/dataproc.service.keytab dataproc/$(hostname -f)
    
    1. Run klist to verify that Jupyter obtained a valid TGT.
  5. In a Juypter terminal, use the gsutil to create a rose.txt file in a Cloud Storage bucket in your project.
    echo "A rose by any other name would smell as sweet" > /tmp/rose.txt
    

    gsutil cp /tmp/rose.txt gs://bucket-name/rose.txt
    
    1. Mark the file as private so that only your user account can read from or write to it. Jupyter will use your personal credentials when interacting with Cloud Storage.
      gsutil acl set private gs://bucket-name/rose.txt
      
    2. Verify your private access.
      gsutil acl get gs://$BUCKET/rose.txt
      

      [
      {
      "email": "$USER",
      "entity": "user-$USER",
      "role": "OWNER"
      }
      ]
      

Console

  1. Click the Component Gateway Jupyter link to launch the Jupyter UI.
  2. Check that personal cluster authentication was successful.
    1. Start a Jupyter terminal
    2. Run gcloud auth list
    3. Verify that your username is the only active account.
  3. In a Jupyter terminal, enable Jupyter to authenticate with Kerberos and submit Spark jobs.
    kinit -kt /etc/security/keytab/dataproc.service.keytab dataproc/$(hostname -f)
    
    1. Run klist to verify that Jupyter obtained a valid TGT.
  4. In a Jupyter terminal, use the gsutil to create a rose.txt file in a Cloud Storage bucket in your project.
    echo "A rose by any other name would smell as sweet" > /tmp/rose.txt
    

    gsutil cp /tmp/rose.txt gs://bucket-name/rose.txt
    
    1. Mark the file as private so that only your user account can read from or write to it. Jupyter will use your personal credentials when interacting with Cloud Storage.
      gsutil acl set private gs://bucket-name/rose.txt
      
    2. Verify your private access.
      gsutil acl get gs://bucket-name/rose.txt
      
      [
      {
      "email": "$USER",
      "entity": "user-$USER",
      "role": "OWNER"
      }
      ]
      

Run a PySpark job from Jupyter

  1. Navigate to a folder, then create a PySpark notebook.
  2. Run a basic word count job against the rose.txt file you created above.

    text_file = sc.textFile("gs://bucket-name/rose.txt")
    counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
    print(counts.collect())
    

    Spark is able to read the rose.txt file in Cloud Storage because it runs with your user credentials.

    You can also check the Cloud Storage Bucket Audit Logs to verify that the job is accessing Cloud Storage with your identity (see Cloud Audit Logs with Cloud Storage for more information).

Cleanup

  1. Delete the Dataproc cluster.
    gcloud dataproc clusters delete cluster-name --region=region