Overview
This feature allows you to submit Spark jobs to a running Google Kubernetes Engine cluster from the Dataproc Jobs API.
Use this feature to:
- Deploy unified resource management
- Isolate Spark jobs to accelerate the analytics life cycle
Running Dataproc jobs on GKE
Create a GKE cluster
A running GKE cluster is needed as the deployment platform for Dataproc components.
gcloud
Set environment variables, then run the
gcloud beta container clusters create
command locally or in Cloud Shell to create a GKE cluster that can act as a Dataproc cluster.- Set environment variables.
- Set gke-cluster-name. Use lowercase alphanumerics and "-" only.
- Set the cluster region, for example, "us-central1".
- Set the cluster zone to a zone in the selected region, for example, "us-central1-a".
GKE_CLUSTER=gke-cluster-name \ GCE_REGION=region
- Run the
gcloud
command.- Set
--scopes
to "cloud-platform" to use the cluster service account as the permissions mechanism. - Set
--workload-metadata
to "GCE_METADATA" to use Compute Engine VM authentication. - Set
--machine-type
to "n1-standard-4" (a minimum of 4 CPUs are recommended).
gcloud beta container clusters create "${GKE_CLUSTER}" \ --scopes=cloud-platform \ --workload-metadata=GCE_METADATA \ --machine-type=n1-standard-4 \ --region="${GCE_REGION}"
- Set
- Set environment variables.
Create a Dataproc-on-GKE cluster
This step allocates a Dataproc cluster onto an existing GKE cluster, deploying components that link the GKE cluster to the Dataproc service to allow for Spark job submission.
gcloud
Set environment variables, then run the
gcloud beta dataproc clusters create
command locally or in Cloud Shell to create a Dataproc-on-GKE cluster.- Set environment variables.
- Use lowercase alphanumerics and "-" only for dataproc-cluster-name.
- Specify a Dataproc-on-GKE version, for example, "1.4.27-beta".
- Specify a Cloud Storage bucket URI for staging artifacts.
DATAPROC_CLUSTER=dataproc-cluster-name \ VERSION=version \ BUCKET=bucket-name
- Run the
gcloud
command.gcloud beta dataproc clusters create "${DATAPROC_CLUSTER}" \ --gke-cluster="${GKE_CLUSTER}" \ --region="${GCE_REGION}" \ --image-version="${VERSION}" \ --bucket="${BUCKET}"
Notes:
- The above command will generate a namespace within the GKE cluster automatically, but you can
specify the namespace by adding the
--gke-cluster-namespace
argument. You must give Dataproc's service accounts access. Dataproc's project-specific service account must be granted the
Kubernetes Engine Admin
IAM role. This service account is of the form:service-{project-number}@dataproc-accounts.iam.gserviceaccount.com
).- For the
1.4.23-beta
version only, you must also grant the Dataproc installation account,service-51209575642@gcp-sa-saasmanagement.iam.gserviceaccount.com
access to your Kubernetes cluster via theKubernetes Engine Admin
IAM role.
- For the
- The above command will generate a namespace within the GKE cluster automatically, but you can
specify the namespace by adding the
- Set environment variables.
Submit a Spark job
Submit Spark jobs to the Dataproc cluster using
gcloud dataproc
commands, the DataprocJobs
API, and Cloud Console (see Submit a job).Spark job example:
gcloud dataproc jobs submit spark \ --cluster="${DATAPROC_CLUSTER}" \ --region="${GCE_REGION}" \ --class=org.apache.spark.examples.SparkPi \ --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar
PySpark job example:
gcloud dataproc jobs submit pyspark \ --cluster="${DATAPROC_CLUSTER}" foo.py \ --region="${GCE_REGION}"
SparkR job example:
gcloud dataproc jobs submit spark-r \ --cluster="${DATAPROC_CLUSTER}" file:/usr/lib/spark/examples/src/main/r/dataframe.R \ --region="${GCE_REGION}"
Versions
Creating a Dataproc on GKE cluster requires a version number specification. A version number corresponds to a specific bundle of required and optional components, which will be installed on the cluster in the Dataproc cluster-specific namespace. The following table lists available versions.
Version | Components | Default Images | Notes |
---|---|---|---|
1.4.23-beta | Dataproc Job Agent, Spark Operator | Spark: gcr.io/cloud-dataproc/spark:1.4.23-deb9-beta |
|
1.4.27-beta | Dataproc Job Agent, Spark Operator | Spark: gcr.io/cloud-dataproc/spark:1.4.27-deb9-beta |
|
Dataproc Docker Image
Appropriate default Dataproc Docker images are automatically used by the cluster based on the version specified when the Dataproc cluster was created. The image's default configuration configures Spark with the Cloud Storage Connector. You can view the default images in the version documentation.
Logging
Job driver and executor logs are available in Stackdriver Logging under the GKE cluster and namespace.
Troubleshooting
- Stalled Drivers
Spark 2.4.1+ has a known issue,
SPARK-27812,
where drivers (particularly PySpark drivers) stall due to a Kubernetes client
thread. To work around this issue stop your SparkSession or SparkContext by
calling
spark.stop()
on yourSparkSession
orsc.stop()
on yourSparkContext
.
Cleaning up
To delete the allocated resources, use the following gcloud
delete commands. To avoid errors, delete the Dataproc cluster before
deleting the GKE cluster.
Delete Dataproc cluster resources.
gcloud beta dataproc clusters delete "${DATAPROC_CLUSTER}" \ --region=${GCE_REGION}
Delete the GKE cluster.
gcloud beta container clusters delete "${GKE_CLUSTER}" \ --region="${GCE_REGION}"