GDC provides a Dataproc Container for Spark. That is an
Apache Spark environment for data processing. For more information about Apache
Spark, see https://spark.apache.org/. Use containers of Dataproc Container for Spark to
run new or existing Spark applications within a GDC
Kubernetes cluster with minimal alteration. If you are familiar with
Spark tools, you can keep using them.
Define your Spark application in a YAML file, and GDC allocates the resources for you. The Dataproc Container for Spark container starts in seconds. Spark executors scale up or shut down according to your needs.
Configure containers from Dataproc Container for Spark on GDC to use specialized hardware, such as specialty hardware nodes or GPUs.
Prerequisites for running Spark applications
Before running a Spark application, ask your Platform Administrator (PA) to
grant you access to the Spark Operator (mkt-spark-operator) role in the
mkt-system namespace.
Run sample Spark 3 applications
Containerizing Spark applications simplifies running big data applications on
your premises using GDC. As an Application Operator
(AO), run Spark applications specified in GKE objects of the
SparkApplication custom resource type.
To run and use an Apache Spark 3 application on GDC, complete the following steps:
Examine the
spark-operatorimage in your project to find the$DATAPROC_IMAGEto reference in your Spark application:export DATAPROC_IMAGE=$(kubectl get pod --kubeconfig INFRA_CLUSTER_KUBECONFIG \ --selector app.kubernetes.io/name=spark-operator -n mkt-system \ -o=jsonpath='{.items[*].spec.containers[0].image}' \ | sed 's/spark-operator/dataproc/')For example:
export DATAPROC_IMAGE=10.200.8.2:10443/dataproc-service/private-cloud-devel/dataproc:3.1-dataproc-17Write a
SparkApplicationspecification and store it in a YAML file. For more information, see the Write a Spark application specification section.Submit, run, and monitor your Spark application as configured in a
SparkApplicationspecification on the GKE cluster with thekubectlcommand. For more information, see the Application examples section.Review the status of the application.
Optional: Review the application logs. For more information, see the View the logs of a Spark 3 application section.
Use the Spark application to collect and surface the status of the driver and executors to the user.
Write a Spark application specification
A SparkApplication specification includes the following components:
- The
apiVersionfield. - The
kindfield. - The
metadatafield. - The
specsection.
For more information, see the Writing a SparkApplication Spec on GitHub: https://github.com/kubeflow/spark-operator/blob/gh-pages/docs/user-guide.md#writing-a-sparkapplication-spec
Application examples
This section includes the following examples with their corresponding
SparkApplication specifications to run Spark applications:
Spark Pi
This section contains an example to run a compute-intensive Spark Pi application that estimates 𝛑 (pi) by throwing darts in a circle.
Work through the following steps to run Spark Pi:
Apply the following
SparkApplicationspecification example in the org infrastructure cluster:apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-pi namespace: mkt-system spec: type: Python pythonVersion: "3" mode: cluster image: "${DATAPROC_IMAGE?}" imagePullPolicy: IfNotPresent mainApplicationFile: "local:///usr/lib/spark/examples/src/main/python/pi.py" sparkVersion: "3.1.3" restartPolicy: type: Never driver: cores: 1 coreLimit: "1000m" memory: "512m" serviceAccount: dataproc-addon-spark executor: cores: 1 instances: 1 memory: "512m"Verify that the
SparkApplicationspecification example runs and completes in 1-2 minutes using the following command:kubectl --kubeconfig INFRA_CLUSTER_KUBECONFIG get SparkApplication spark-pi -n mkt-systemView the Driver Logs to see the result:
kubectl --kubeconfig INFRA_CLUSTER_KUBECONFIG logs spark-pi-driver -n mkt-system | grep "Pi is roughly"An output is similar to the following:
Pi is roughly 3.1407357036785184
For more information, see the following resources:
- For the application code, see the article
Pi estimation from the Apache Spark documentation:
https://spark.apache.org/examples.html. - For a sample Spark Pi YAML file, see Write a Spark application specification.
Spark SQL
Work through the following steps to run Spark SQL:
To run a Spark SQL application that selects the
1value, use the following query:select 1;Apply the following
SparkApplicationspecification example in the org infrastructure cluster:apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: pyspark-sql-arrow namespace: mkt-system spec: type: Python mode: cluster image: "${DATAPROC_IMAGE?}" imagePullPolicy: IfNotPresent mainApplicationFile: "local:///usr/lib/spark/examples/src/main/python/sql/arrow.py" sparkVersion: "3.1.3" restartPolicy: type: Never driver: cores: 1 coreLimit: "1000m" memory: "512m" serviceAccount: dataproc-addon-spark executor: cores: 1 instances: 1 memory: "512m"Verify that the
SparkApplicationspecification example runs and completes in less than one minute using the following command:kubectl --kubeconfig INFRA_CLUSTER_KUBECONFIG get SparkApplication pyspark-sql-arrow -n mkt-system
Spark MLlib
Work through the following steps to run Spark MLlib:
Use the following Scala example to run a Spark MLlib instance that performs statistical analysis and prints a result to the console:
import org.apache.spark.ml.linalg.{Matrix, Vectors} import org.apache.spark.ml.stat.Correlation import org.apache.spark.sql.Row val data = Seq( Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))), Vectors.dense(4.0, 5.0, 0.0, 3.0), Vectors.dense(6.0, 7.0, 0.0, 8.0), Vectors.sparse(4, Seq((0, 9.0), (3, 1.0))) ) val df = data.map(Tuple1.apply).toDF("features") val Row(coeff1: Matrix) = Correlation.corr(df, "features").head println(s"Pearson correlation matrix:\n $coeff1") val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head println(s"Spearman correlation matrix:\n $coeff2")Apply the following
SparkApplicationspecification example in the org infrastructure cluster:apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-ml namespace: mkt-system spec: type: Scala mode: cluster image: "${DATAPROC_IMAGE?}" imagePullPolicy: IfNotPresent mainClass: org.apache.spark.examples.ml.SummarizerExample mainApplicationFile: "local:///usr/lib/spark/examples/jars/spark-examples_2.12-3.1.3.jar" sparkVersion: "3.1.3" restartPolicy: type: Never driver: cores: 1 coreLimit: "1000m" memory: "512m" serviceAccount: dataproc-addon-spark executor: cores: 1 instances: 1 memory: "512m"Verify that the
SparkApplicationspecification example runs and completes in less than one minute using the following command:kubectl --kubeconfig INFRA_CLUSTER_KUBECONFIG get SparkApplication spark-ml -n mkt-system
SparkR
Work through the following steps to run SparkR:
Use the following example code to run a SparkR instance that loads a bundled dataset and prints the first line:
library(SparkR) sparkR.session() df <- as.DataFrame(faithful) head(df)Apply the following
SparkApplicationspecification example in the org infrastructure cluster:apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-r-dataframe namespace: mkt-system spec: type: R mode: cluster image: "${DATAPROC_IMAGE?}" imagePullPolicy: Always mainApplicationFile: "local:///usr/lib/spark/examples/src/main/r/dataframe.R" sparkVersion: "3.1.3" restartPolicy: type: Never driver: cores: 1 coreLimit: "1000m" memory: "512m" serviceAccount: dataproc-addon-spark executor: cores: 1 instances: 1 memory: "512m"Verify that the
SparkApplicationspecification example runs and completes in less than one minute using the following command:kubectl --kubeconfig INFRA_CLUSTER_KUBECONFIG get SparkApplication spark-r-dataframe -n mkt-system
View the logs of a Spark 3 application
Spark has the following two log types that you can visualize:
Use the terminal to run commands.
Driver logs
Work through the following steps to view the driver logs of your Spark application:
Find your Spark driver pod:
kubectl --kubeconfig INFRA_CLUSTER_KUBECONFIG get pods -n mkt-systemOpen the logs from the Spark driver pod:
kubectl --kubeconfig INFRA_CLUSTER_KUBECONFIG logs DRIVER_POD -n mkt-systemReplace
DRIVER_PODwith the name of the Spark driver pod that you found in the previous step.
Event logs
You can find event logs at the path specified in the YAML file of the
SparkApplication specification.
Work through the following steps to view the event logs of your Spark application:
- Open the YAML file of the
SparkApplicationspecification. - Locate the
specfield in the file. - Locate the
sparkConffield nested in thespecfield. - Locate the value of the
spark.eventLog.dirfield nested in thesparkConfsection. - Open the path to view event logs.
For a sample YAML file of the SparkApplication specification, see
Write a Spark application specification.
Contact your account manager for more information.