GDC provides a Dataproc Container for Spark. That is an
Apache Spark environment for data processing. For more information about Apache
Spark, see https://spark.apache.org/
. Use containers of Dataproc Container for Spark to
run new or existing Spark applications within a GDC
Kubernetes cluster with minimal alteration. If you are familiar with
Spark tools, you can keep using them.
Define your Spark application in a YAML file, and GDC allocates the resources for you. The Dataproc Container for Spark container starts in seconds. Spark executors scale up or shut down according to your needs.
Configure containers from Dataproc Container for Spark on GDC to use specialized hardware, such as specialty hardware nodes or GPUs.
Deploy the Dataproc Container for Spark service
The Platform Administrator (PA) must install Marketplace services for you before you can use the services. Contact your PA if you need Dataproc Container for Spark. See Install a GDC Marketplace software package for more information.
Prerequisites for running Spark applications
You must have a service account in user clusters to use the Dataproc Container for Spark service. The Dataproc Container for Spark creates a Spark driver pod to run a Spark application. A Spark driver pod needs a Kubernetes service account in the namespace of the pod with permissions to do the following actions:
- Create, get, list, and delete executor pods.
- Create a Kubernetes headless service for the driver.
Before running a Spark application, complete the following steps to ensure you
have a service account with the previous permissions in the foo
namespace:
Create a service account for a Spark driver pod to use in the
foo
namespace:kubectl create serviceaccount spark --kubeconfig AO_USER_KUBECONFIG --namespace=foo
Create a role for granting permissions to create, get, list, and delete executor pods, and create a Kubernetes headless service for the driver in the
foo
namespace:kubectl create role spark-driver --kubeconfig AO_USER_KUBECONFIG --verb=* \ --resource=pods,services,configmaps,persistentvolumeclaims \ --namespace=foo
Create a role binding for granting the service account role access in the
foo
namespace:kubectl create --kubeconfig AO_USER_KUBECONFIG \ rolebinding spark-spark-driver \ --role=spark-driver --serviceaccount=foo:spark \ --namespace=foo
Run sample Spark 3 applications
Containerizing Spark applications simplifies running big data applications on
your premises using GDC. As an Application Operator
(AO), run Spark applications specified in GKE objects of the
SparkApplication
custom resource type.
To run and use an Apache Spark 3 application on GDC, complete the following steps:
Examine the
spark-operator
image in your project to find the$DATAPROC_IMAGE
to reference in your Spark application:export DATAPROC_IMAGE=$(kubectl get pod --kubeconfig AO_USER_KUBECONFIG \ --selector app.kubernetes.io/name=spark-operator -n foo \ -o=jsonpath='{.items[*].spec.containers[0].image}' \ | sed 's/spark-operator/dataproc/')
For example:
export DATAPROC_IMAGE=10.200.8.2:10443/dataproc-service/private-cloud-devel/dataproc:3.1-dataproc-17
Write a
SparkApplication
specification and store it in a YAML file. For more information, see the Write a Spark application specification section.Submit, run, and monitor your Spark application as configured in a
SparkApplication
specification on the GKE cluster with thekubectl
command. For more information, see the Application examples section.Review the status of the application.
Optional: Review the application logs. For more information, see the View the logs of a Spark 3 application section.
Use the Spark application to collect and surface the status of the driver and executors to the user.
Write a Spark application specification
A SparkApplication
specification includes the following components:
- The
apiVersion
field. - The
kind
field. - The
metadata
field. - The
spec
section.
For more information, see the
Writing a SparkApplication Spec on GitHub: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#writing-a-sparkapplication-spec
.
Application examples
This section includes the following examples with their corresponding
SparkApplication
specifications to run Spark applications:
Spark Pi
This section contains an example to run a compute-intensive Spark Pi application that estimates 𝛑 (pi) by throwing darts in a circle.
Work through the following steps to run Spark Pi:
Apply the following
SparkApplication
specification example in the user cluster:apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-pi namespace: foo spec: type: Python pythonVersion: "3" mode: cluster image: "${DATAPROC_IMAGE?}" imagePullPolicy: IfNotPresent mainApplicationFile: "local:///usr/lib/spark/examples/src/main/python/pi.py" sparkVersion: "3.1.3" restartPolicy: type: Never driver: cores: 1 coreLimit: "1000m" memory: "512m" serviceAccount: spark executor: cores: 1 instances: 1 memory: "512m"
Verify that the
SparkApplication
specification example runs and completes in 1-2 minutes using the following command:kubectl --kubeconfig AO_USER_KUBECONFIG get SparkApplication spark-pi -n foo
View the Driver Logs to see the result:
kubectl --kubeconfig AO_USER_KUBECONFIG logs spark-pi-driver -n foo | grep "Pi is roughly"
An output is similar to the following:
Pi is roughly 3.1407357036785184
For more information, see the following resources:
- For the application code, see the article
Pi estimation from the Apache Spark documentation:
https://spark.apache.org/examples.html
. - For a sample Spark Pi YAML file, see Write a Spark application specification.
Spark SQL
Work through the following steps to run Spark SQL:
To run a Spark SQL application that selects the
1
value, use the following query:select 1;
Apply the following
SparkApplication
specification example in the user cluster:apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: pyspark-sql-arrow namespace: foo spec: type: Python mode: cluster image: "${DATAPROC_IMAGE?}" imagePullPolicy: IfNotPresent mainApplicationFile: "local:///usr/lib/spark/examples/src/main/python/sql/arrow.py" sparkVersion: "3.1.3" restartPolicy: type: Never driver: cores: 1 coreLimit: "1000m" memory: "512m" serviceAccount: spark executor: cores: 1 instances: 1 memory: "512m"
Verify that the
SparkApplication
specification example runs and completes in less than one minute using the following command:kubectl --kubeconfig AO_USER_KUBECONFIG get SparkApplication pyspark-sql-arrow -n foo
Spark MLlib
Work through the following steps to run Spark MLlib:
Use the following Scala example to run a Spark MLlib instance that performs statistical analysis and prints a result to the console:
import org.apache.spark.ml.linalg.{Matrix, Vectors} import org.apache.spark.ml.stat.Correlation import org.apache.spark.sql.Row val data = Seq( Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))), Vectors.dense(4.0, 5.0, 0.0, 3.0), Vectors.dense(6.0, 7.0, 0.0, 8.0), Vectors.sparse(4, Seq((0, 9.0), (3, 1.0))) ) val df = data.map(Tuple1.apply).toDF("features") val Row(coeff1: Matrix) = Correlation.corr(df, "features").head println(s"Pearson correlation matrix:\n $coeff1") val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head println(s"Spearman correlation matrix:\n $coeff2")
Apply the following
SparkApplication
specification example in the user cluster:apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-ml namespace: foo spec: type: Scala mode: cluster image: "${DATAPROC_IMAGE?}" imagePullPolicy: IfNotPresent mainClass: org.apache.spark.examples.ml.SummarizerExample mainApplicationFile: "local:///usr/lib/spark/examples/jars/spark-examples_2.12-3.1.3.jar" sparkVersion: "3.1.3" restartPolicy: type: Never driver: cores: 1 coreLimit: "1000m" memory: "512m" serviceAccount: spark executor: cores: 1 instances: 1 memory: "512m"
Verify that the
SparkApplication
specification example runs and completes in less than one minute using the following command:kubectl --kubeconfig AO_USER_KUBECONFIG get SparkApplication spark-ml -n foo
SparkR
Work through the following steps to run SparkR:
Use the following example code to run a SparkR instance that loads a bundled dataset and prints the first line:
library(SparkR) sparkR.session() df <- as.DataFrame(faithful) head(df)
Apply the following
SparkApplication
specification example in the user cluster:apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-r-dataframe namespace: foo spec: type: R mode: cluster image: "${DATAPROC_IMAGE?}" imagePullPolicy: Always mainApplicationFile: "local:///usr/lib/spark/examples/src/main/r/dataframe.R" sparkVersion: "3.1.3" restartPolicy: type: Never driver: cores: 1 coreLimit: "1000m" memory: "512m" serviceAccount: spark executor: cores: 1 instances: 1 memory: "512m"
Verify that the
SparkApplication
specification example runs and completes in less than one minute using the following command:kubectl --kubeconfig AO_USER_KUBECONFIG get SparkApplication spark-r-dataframe -n foo
View the logs of a Spark 3 application
Spark has the following two log types that you can visualize:
Use the terminal to run commands.
Driver logs
Work through the following steps to view the driver logs of your Spark application:
Find your Spark driver pod:
kubectl -n spark get pods
Open the logs from the Spark driver pod:
kubectl -n spark logs DRIVER_POD
Replace
DRIVER_POD
with the name of the Spark driver pod that you found in the previous step.
Event logs
You can find event logs at the path specified in the YAML file of the
SparkApplication
specification.
Work through the following steps to view the event logs of your Spark application:
- Open the YAML file of the
SparkApplication
specification. - Locate the
spec
field in the file. - Locate the
sparkConf
field nested in thespec
field. - Locate the value of the
spark.eventLog.dir
field nested in thesparkConf
section. - Open the path to view event logs.
For a sample YAML file of the SparkApplication
specification, see
Write a Spark application specification.
Contact your account manager for more information.