Dataproc on Google Kubernetes Engine

Overview

This feature allows you to submit Spark jobs to a running Google Kubernetes Engine cluster from the Dataproc Jobs API.

Use this feature to:

  • Deploy unified resource management
  • Isolate Spark jobs to accelerate the analytics life cycle

Running Dataproc jobs on GKE

  1. Create a GKE cluster

    A running GKE cluster is needed as the deployment platform for Dataproc components.

    gcloud

    Set environment variables, then run the gcloud beta container clusters create command locally or in Cloud Shell to create a GKE cluster that can act as a Dataproc cluster.

    1. Set environment variables.
      • Set gke-cluster-name. Use lowercase alphanumerics and "-" only.
      • Set the cluster region, for example, "us-central1".
      • Set the cluster zone to a zone in the selected region, for example, "us-central1-a".
      GKE_CLUSTER=gke-cluster-name \
        GCE_REGION=region
      
    2. Run the gcloud command.
      • Set --scopes to "cloud-platform" to use the cluster service account as the permissions mechanism.
      • Set --workload-metadata to "GCE_METADATA" to use Compute Engine VM authentication.
      • Set --machine-type to "n1-standard-4" (a minimum of 4 CPUs are recommended).
      gcloud beta container clusters create "${GKE_CLUSTER}" \
          --scopes=cloud-platform \
          --workload-metadata=GCE_METADATA \
          --machine-type=n1-standard-4 \
          --region="${GCE_REGION}"
      
  2. Create a Dataproc-on-GKE cluster

    This step allocates a Dataproc cluster onto an existing GKE cluster, deploying components that link the GKE cluster to the Dataproc service to allow for Spark job submission.

    gcloud

    Set environment variables, then run the gcloud beta dataproc clusters create command locally or in Cloud Shell to create a Dataproc-on-GKE cluster.

    1. Set environment variables.
      • Use lowercase alphanumerics and "-" only for dataproc-cluster-name.
      • Specify a Dataproc-on-GKE version, for example, "1.4.27-beta".
      • Specify a Cloud Storage bucket URI for staging artifacts.

      DATAPROC_CLUSTER=dataproc-cluster-name \
        VERSION=version \
        BUCKET=bucket-name
      

    2. Run the gcloud command.

      gcloud beta dataproc clusters create "${DATAPROC_CLUSTER}" \
          --gke-cluster="${GKE_CLUSTER}" \
          --region="${GCE_REGION}" \
          --image-version="${VERSION}" \
          --bucket="${BUCKET}"
      

      Notes:

      • The above command will generate a namespace within the GKE cluster automatically, but you can specify the namespace by adding the --gke-cluster-namespace argument.
      • You must give Dataproc's service accounts access. Dataproc's project-specific service account must be granted the Kubernetes Engine Admin IAM role. This service account is of the form: service-{project-number}@dataproc-accounts.iam.gserviceaccount.com).

        • For the 1.4.23-beta version only, you must also grant the Dataproc installation account, service-51209575642@gcp-sa-saasmanagement.iam.gserviceaccount.com access to your Kubernetes cluster via the Kubernetes Engine Admin IAM role.

  3. Submit a Spark job

    Submit Spark jobs to the Dataproc cluster using gcloud dataproccommands, the Dataproc Jobs API, and Cloud Console (see Submit a job).

    Spark job example:

    gcloud dataproc jobs submit spark \
        --cluster="${DATAPROC_CLUSTER}" \
        --region="${GCE_REGION}" \
        --class=org.apache.spark.examples.SparkPi \
        --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar
    

    PySpark job example:

    gcloud dataproc jobs submit pyspark \
        --cluster="${DATAPROC_CLUSTER}" foo.py \
        --region="${GCE_REGION}"
    

    SparkR job example:

    gcloud dataproc jobs submit spark-r \
        --cluster="${DATAPROC_CLUSTER}" file:/usr/lib/spark/examples/src/main/r/dataframe.R \
        --region="${GCE_REGION}"
    

Versions

Creating a Dataproc on GKE cluster requires a version number specification. A version number corresponds to a specific bundle of required and optional components, which will be installed on the cluster in the Dataproc cluster-specific namespace. The following table lists availableversions.

Version Components Default Images Notes
1.4.23-beta Dataproc Job Agent, Spark Operator Spark: gcr.io/cloud-dataproc/spark:1.4.23-deb9-beta
  • Supports Spark, PySpark, and SparkR job types
  • Requires service-51209575642@gcp-sa-saasmanagement.iam.gserviceaccount.com service account to be granted the Kubernetes Engine Admin IAM role
1.4.27-beta Dataproc Job Agent, Spark Operator Spark: gcr.io/cloud-dataproc/spark:1.4.27-deb9-beta
  • Supports Spark, PySpark, and SparkR job types

Dataproc Docker Image

Appropriate default Dataproc Docker images are automatically used by the cluster based on the version specified when the Dataproc cluster was created. The image's default configuration configures Spark with the Cloud Storage Connector. You can view the default images in the version documentation.

Logging

Job driver and executor logs are available in Stackdriver Logging under the GKE cluster and namespace.

Troubleshooting

  • Stalled Drivers Spark 2.4.1+ has a known issue, SPARK-27812, where drivers (particularly PySpark drivers) stall due to a Kubernetes client thread. To work around this issue:

    1. Stop your SparkSession or SparkContext by calling spark.stop() on your SparkSession or sc.stop() on your SparkContext.
    2. Use a Spark 2.4.0-based image, such as gcr.io/dataproc-2f10d78d114f6aaec7646/spark/spark.

Cleaning up

To delete the allocated resources, use the following gcloud delete commands. To avoid errors, delete the Dataproc cluster before deleting the GKE cluster.

Delete Dataproc cluster resources.

gcloud beta dataproc clusters delete "${DATAPROC_CLUSTER}" \
    --region=${GCE_REGION}

Delete the GKE cluster.

gcloud beta container clusters delete "${GKE_CLUSTER}"  \
  --region="${GCE_REGION}"