Whitelist access to this feature. Contact join-dataproc-k8s-alpha@google.com. List your project id with your request to access this whitelisted feature.
Overview
This feature allows you to submit Spark jobs to a running Google Kubernetes Engine cluster from the Dataproc Jobs API.
Use this feature to:
- Deploy unified resource management
- Isolate Spark jobs to accelerate the analytics life cycle
- Build resilient infrastructure
Running Dataproc jobs on GKE
Create a GKE cluster
GCLOUD COMMAND
Run the following
gcloud beta container clusters create
command locally or in Cloud Shell to create a GKE cluster that authenticates with Dataproc. This command:- sets
--workload-metadata-from-node
toEXPOSED
to use Compute Engine VM authentication. - sets
--scopes cloud-platform
to obtain permission to register the cluster. - sets the name of the cluster to
$CLUSTER
and sets the cluster zone to$GCE_ZONE
.
CLUSTER=cluster-name (lower-case alphanumerics and "-" only)
GCE_ZONE=zone (example: us-central1-a)
gcloud beta container clusters create $CLUSTER \ --scopes cloud-platform \ --workload-metadata-from-node EXPOSED \ --machine-type n1-standard-4 \ --zone $GCE_ZONE
Notes:
- Service account: This example uses the default Compute Engine service account
for authentication. If you set a custom service account, it must be granted the
Dataproc Editor
andDataproc Worker
roles in the project. - Machine type: 4 vCPUs are recommended
- See Security overview to protect you cluster.
- sets
Set up Helm
- Add Helm to your environment.
Command-line
Run the following commands locally or in Cloud Shell to unpack the Helm 3, beta 2 binary and add it to your
PATH
.wget https://get.helm.sh/helm-v3.0.0-beta.2-linux-amd64.tar.gz
mkdir ~/helm3
tar xf helm-v3.0.0-beta.2-linux-amd64.tar.gz --strip 1 --directory ~/helm3
PATH_TO_HELM=$HOME/helm3
Notes:
- See the GitHub Helm Release page to install other versions (earlier or later releases; installs for other processors).
- You can install the helm binary into a shared directory, such as
/usr/local.
- Follow the best practices in Securing your Helm Installation to keep your Helm setup secure.
- Add the Dataproc Helm Chart repository to Helm.
Command-line
Run the following commands locally or in Cloud Shell to add the Dataproc Helm Chart repository to Helm.
alias helm=$PATH_TO_HELM/helm
helm repo add dataproc http://storage.googleapis.com/dataproc-helm-charts
- Add Helm to your environment.
Register the GKE cluster with Dataproc
To allow Spark job submission on a GKE cluster, install the
dataproc-sparkoperator
on the cluster. This operator registers itself as a Dataproc cluster. After registration, you can submit Dataproc a Spark job on your GKE cluster.The
dataproc-sparkoperator
operator has the same base functionality as the open source Spark Operator (spark-on-k8s-operator
), but only Spark jobs submitted to Dataproc using thedataproc-sparkoperator
running on a GKE cluster will tracked as Dataproc jobs.- Run
helm install
to register the GKE cluster with Dataproc.Command-line
Set environment variables, then run
helm install
locally or in Cloud Shell to register the GKE cluster with Dataproc.- Set the project ID where the cluster will be registered.
The following command sets the variable to your default project,
but you can set the variable to a different project.
PROJECT=$(gcloud config get-value core/project)
- Set the name of an existing Cloud Storage bucket for staging temp files.
Do not include
gs://
in the bucket name.BUCKET=bucket-name
- Set a Dataproc region
for the Dataproc cluster. To avoid additional networking
charges, specify the region associated with the
$GCE_ZONE
of your GKE clusterREGION=region (example: us-central1)
- Set the name of the Helm deployment.
DEPLOYMENT=deployment-name (example: my-spark-deployment
- Run
helm install
. Note: For simplicity, the--clusterName
flag sets uses the previously set name for the GKE as the name of the Dataproc cluster resource name. You can set a different name for the Dataproc cluster if you prefer.
helm install "${DEPLOYMENT}" dataproc/dataproc-sparkoperator \ --set sparkJobNamespace=default \ --set projectId="${PROJECT}" \ --set dataprocRegion="${REGION}" \ --set bucket="${BUCKET}" \ --set clusterName="${CLUSTER}"
- Set the project ID where the cluster will be registered.
The following command sets the variable to your default project,
but you can set the variable to a different project.
Check deployment status
Go to Google Kubernetes Engine Workloads in the Google Cloud Console to view deployment status. The deployment is titled
"${DEPLOYMENT}"-dataproc-sparkoperator
. After the status changes from "Does not have minimum availability" to "Ok", you can view cluster details from the Cloud Console or by running the followinggcloud command
:gcloud dataproc clusters describe --region "${REGION}" "${CLUSTER}"
The cluster's
dataproc:alpha.is-kubernetes-cluster
property should be set to true.... properties: dataproc:alpha.is-kubernetes-cluster: 'true' ...
- Run
Submit a Spark job
Submit Spark jobs to the Dataproc cluster using
gcloud dataproc
commands, the DataprocJobs
API, and Cloud Console (see Submit a job).Spark job example:
gcloud dataproc jobs submit spark \ --cluster "${CLUSTER}" \ --class org.apache.spark.examples.SparkPi \ --jars file:///usr/lib/spark/examples/jars/spark-examples.jar
PySpark job example:
gcloud dataproc jobs submit pyspark \ --cluster "${CLUSTER}" foo.py
SparkR job example:
gcloud dataproc jobs submit spark-r \ --cluster "${CLUSTER}" file:/usr/lib/spark/examples/src/main/r/dataframe.R
Accessing the Spark Operator
Since the dataproc-sparkoperator
is based on and supplements the core
functionality of the
Kubernetes Operator for Apache Spark (spark-on-k8s-operator
)., you can directly submit Spark
applications to this operator with sparkctl create
or
kubectl apply
. Note, however, that these applications will not be
surfaced in the Dataproc jobs API.
Example:
git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
cd spark-on-k8s-operator
kubectl apply -f examples/spark-pi.yaml
To view your Spark Applications:
kubectl get sparkapplications
Dataproc Docker Image
A default Dataproc Docker image is available in the cluster. The image's default configuration configures Spark with the Cloud Storage Connector.
Logging
Job driver and executor logs are available in StackDriver logging under the GKE cluster and namespace.
Cleaning up
To delete all Kubernetes resources, delete the GKE cluster:
gcloud beta container clusters delete $CLUSTER
To keep the GKE cluster and remove other resources:
- Delete the deployment:
helm delete "${DEPLOYMENT}"
- Delete the Dataproc cluster resource (this resource does not incur Dataproc billing charges during alpha):
gcloud dataproc clusters delete "${CLUSTER}"
Troubleshooting
Cluster does not appear in the API
- Run
kubectl get pod
to see if the sparkoperator pod hasImagePullBackOff
status, which indicates that the serviceaccount you are using for the GKE VMs. - If the pod has
RUNNING
status, get the dataproc agent container logs by runningkubectl logs POD_NAME dataproc-agent
, and then send the logs to the alpha mailing list.
- Run
Hung Drivers Spark 2.4.1+ has a known issue, SPARK-27812, where drivers (particularly PySpark drivers) hang due to a Kubernetes client thread. To work around this issue:
- Stop your SparkSession or SparkContext by calling
spark.stop()
on your SparkSession orsc.stop()
on your SparkContext. - Use a Spark 2.4.0-based image, such as gcr.io/dataproc-2f10d78d114f6aaec7646/spark/spark.
- Stop your SparkSession or SparkContext by calling