Dataproc on Google Kubernetes Engine

Whitelist access to this feature. Contact join-dataproc-k8s-alpha@google.com. List your project id with your request to access this whitelisted feature.

Overview

This feature allows you to submit Spark jobs to a running Google Kubernetes Engine cluster from the Dataproc Jobs API.

Use this feature to:

  • Deploy unified resource management
  • Isolate Spark jobs to accelerate the analytics life cycle
  • Build resilient infrastructure

Running Dataproc jobs on GKE

  1. Create a GKE cluster

    GCLOUD COMMAND

    Run the following gcloud beta container clusters create command locally or in Cloud Shell to create a GKE cluster that authenticates with Dataproc. This command:

    CLUSTER=cluster-name (lower-case alphanumerics and "-" only)
    
    GCE_ZONE=zone (example: us-central1-a)
    
    gcloud beta container clusters create $CLUSTER \
        --scopes cloud-platform \
        --workload-metadata-from-node EXPOSED \
        --machine-type n1-standard-4 \
        --zone $GCE_ZONE
    

    Notes:

  2. Set up Helm

    1. Add Helm to your environment.

      Command-line

      Run the following commands locally or in Cloud Shell to unpack the Helm 3, beta 2 binary and add it to your PATH.

      wget https://get.helm.sh/helm-v3.0.0-beta.2-linux-amd64.tar.gz
      
      mkdir ~/helm3
      
      tar xf helm-v3.0.0-beta.2-linux-amd64.tar.gz --strip 1 --directory ~/helm3
      
      PATH_TO_HELM=$HOME/helm3
      

      Notes:

      • See the GitHub Helm Release page to install other versions (earlier or later releases; installs for other processors).
      • You can install the helm binary into a shared directory, such as /usr/local.
      • Follow the best practices in Securing your Helm Installation to keep your Helm setup secure.
    2. Add the Dataproc Helm Chart repository to Helm.

      Command-line

      Run the following commands locally or in Cloud Shell to add the Dataproc Helm Chart repository to Helm.

      alias helm=$PATH_TO_HELM/helm
      
      helm repo add dataproc http://storage.googleapis.com/dataproc-helm-charts
      
  3. Register the GKE cluster with Dataproc

    To allow Spark job submission on a GKE cluster, install the dataproc-sparkoperator on the cluster. This operator registers itself as a Dataproc cluster. After registration, you can submit Dataproc a Spark job on your GKE cluster.

    The dataproc-sparkoperator operator has the same base functionality as the open source Spark Operator (spark-on-k8s-operator), but only Spark jobs submitted to Dataproc using the dataproc-sparkoperator running on a GKE cluster will tracked as Dataproc jobs.

    1. Run helm install to register the GKE cluster with Dataproc.

      Command-line

      Set environment variables, then run helm install locally or in Cloud Shell to register the GKE cluster with Dataproc.

      1. Set the project ID where the cluster will be registered. The following command sets the variable to your default project, but you can set the variable to a different project.
        PROJECT=$(gcloud config get-value core/project)
        
      2. Set the name of an existing Cloud Storage bucket for staging temp files. Do not include gs:// in the bucket name.
        BUCKET=bucket-name
        
      3. Set a Dataproc region for the Dataproc cluster. To avoid additional networking charges, specify the region associated with the $GCE_ZONE of your GKE cluster
        REGION=region (example: us-central1)
        
      4. Set the name of the Helm deployment.
        DEPLOYMENT=deployment-name (example: my-spark-deployment
        
      5. Run helm install. Note: For simplicity, the --clusterName flag sets uses the previously set name for the GKE as the name of the Dataproc cluster resource name. You can set a different name for the Dataproc cluster if you prefer.
        helm install  "${DEPLOYMENT}" dataproc/dataproc-sparkoperator \
            --set sparkJobNamespace=default \
            --set projectId="${PROJECT}" \
            --set dataprocRegion="${REGION}" \
            --set bucket="${BUCKET}" \
            --set clusterName="${CLUSTER}"
        
    2. Check deployment status

      Go to Google Kubernetes Engine Workloads in the Google Cloud Console to view deployment status. The deployment is titled "${DEPLOYMENT}"-dataproc-sparkoperator. After the status changes from "Does not have minimum availability" to "Ok", you can view cluster details from the Cloud Console or by running the following gcloud command:

      gcloud dataproc clusters describe --region "${REGION}" "${CLUSTER}"
      

      The cluster's dataproc:alpha.is-kubernetes-cluster property should be set to true.

      ...
          properties:
            dataproc:alpha.is-kubernetes-cluster: 'true'
      ...
      
  4. Submit a Spark job

    Submit Spark jobs to the Dataproc cluster using gcloud dataproccommands, the Dataproc Jobs API, and Cloud Console (see Submit a job).

    Spark job example:

    gcloud dataproc jobs submit spark \
        --cluster "${CLUSTER}" \
        --class org.apache.spark.examples.SparkPi \
        --jars file:///usr/lib/spark/examples/jars/spark-examples.jar
    

    PySpark job example:

    gcloud dataproc jobs submit pyspark \
        --cluster "${CLUSTER}" foo.py
    

    SparkR job example:

    gcloud dataproc jobs submit spark-r \
        --cluster "${CLUSTER}" file:/usr/lib/spark/examples/src/main/r/dataframe.R
    

Accessing the Spark Operator

Since the dataproc-sparkoperator is based on and supplements the core functionality of the Kubernetes Operator for Apache Spark (spark-on-k8s-operator)., you can directly submit Spark applications to this operator with sparkctl create or kubectl apply. Note, however, that these applications will not be surfaced in the Dataproc jobs API.

Example:

git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
cd spark-on-k8s-operator
kubectl apply -f examples/spark-pi.yaml

To view your Spark Applications:

kubectl get sparkapplications

Dataproc Docker Image

A default Dataproc Docker image is available in the cluster. The image's default configuration configures Spark with the Cloud Storage Connector.

Logging

Job driver and executor logs are available in StackDriver logging under the GKE cluster and namespace.

Cleaning up

To delete all Kubernetes resources, delete the GKE cluster:

gcloud beta container clusters delete $CLUSTER

To keep the GKE cluster and remove other resources:

  1. Delete the deployment:
    helm delete "${DEPLOYMENT}"
    
  2. Delete the Dataproc cluster resource (this resource does not incur Dataproc billing charges during alpha):
  3. gcloud dataproc clusters delete "${CLUSTER}"
    

Troubleshooting

  • Cluster does not appear in the API

    1. Run kubectl get pod to see if the sparkoperator pod has ImagePullBackOff status, which indicates that the serviceaccount you are using for the GKE VMs.
    2. If the pod has RUNNING status, get the dataproc agent container logs by running kubectl logs POD_NAME dataproc-agent, and then send the logs to the alpha mailing list.
  • Hung Drivers Spark 2.4.1+ has a known issue, SPARK-27812, where drivers (particularly PySpark drivers) hang due to a Kubernetes client thread. To work around this issue:

    1. Stop your SparkSession or SparkContext by calling spark.stop() on your SparkSession or sc.stop() on your SparkContext.
    2. Use a Spark 2.4.0-based image, such as gcr.io/dataproc-2f10d78d114f6aaec7646/spark/spark.
¿Te sirvió esta página? Envíanos tu opinión:

Enviar comentarios sobre…

Cloud Dataproc Documentation
¿Necesitas ayuda? Visita nuestra página de asistencia.