Write and run Spark Scala jobs on a Cloud Dataproc cluster

This tutorial illustrates different ways to create and submit a Spark Scala job to a Cloud Dataproc cluster, including how to:

  • write and compile a Spark Scala "Hello World" app on a local machine from the command line using the Scala REPL (Read-Evaluate-Print-Loop or interactive interpreter), the SBT build tool, or the Eclipse IDE using the Scala IDE plugin for Eclipse
  • package compiled Scala classes into a jar file with a manifest
  • submit the Scala jar to a Spark job that runs on your Cloud Dataproc cluster
  • examine Scala job output from the Google Cloud Platform Console

This tutorial also shows you how to:

  • write and run a Spark Scala "WordCount" mapreduce job directly on a Cloud Dataproc cluster using the spark-shell REPL

  • run pre-installed Apache Spark and Hadoop examples on a cluster

Set up a Google Cloud Platform project

If you haven't already done so:

  1. Set up a project
  2. Create a Cloud Storage bucket
  3. Create a Cloud Dataproc cluster

Write and compile Scala code locally

As a simple exercise for this tutorial, write a "Hello World" Scala app using the Scala REPL, the SBT command line interface, or the Scala IDE plugin for Eclipse locally on your development machine.

Using Scala

  1. Download the Scala binaries from the Scala Install page
  2. Unpack the file, set the SCALA_HOME environment variable, and add it to your path, as shown in the Scala Install instructions. For example:

    export SCALA_HOME=/usr/local/share/scala
    export PATH=$PATH:$SCALA_HOME/
    

  3. Launch the Scala REPL

    $ scala
    Welcome to Scala version ...
    Type in expressions to have them evaluated.
    Type :help for more information.
    scala>
    

  4. Copy and paste HelloWorld code into Scala REPL

    object HelloWorld {
      def main(args: Array[String]): Unit = {
        println("Hello, world!")
      }
    }
    
    

  5. Save HelloWorld.scala and exit the REPL

    scala> :save HelloWorld.scala
    scala> :q
    

  6. Compile with scalac

    $ scalac HelloWorld.scala
    

  7. List the compiled .class files

    $ ls HelloWorld*.class
    HelloWorld$.class   HelloWorld.class
    

Using SBT

  1. Download SBT

  2. Create a "HelloWorld" project, as shown below

    $ mkdir hello
    $ cd hello
    $ echo \
      'object HelloWorld {def main(args: Array[String]) = println("Hello, world!")}' > \
      HelloWorld.scala
    

  3. Create an sbt.build config file to set the artifactName (the name of the jar file that you will generate, below) to "HelloWorld.jar" (see Modifying default artifacts)

    echo \
      'artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) =>
      "HelloWorld.jar" }' > \
      build.sbt
    

  4. Launch SBT and run code

    $ sbt
    [info] Set current project to hello ...
    > run
    ... Compiling 1 Scala source to .../hello/target/scala-.../classes...
    ... Running HelloWorld
    Hello, world!
    [success] Total time: 3 s ...
    

  5. Package code into a jar file with a manifest that specifies the main class entry point (HelloWorld), then exit

    > package
    ... Packaging .../hello/target/scala-.../HelloWorld.jar ...
    ... Done packaging.
    [success] Total time: ...
    > exit
    

Using the SBT plugin for Eclipse

  1. Install the SBT plugin for Eclipse

  2. Select/create a workspace and start a new Scala project

  3. On the Create a Scala project page, name the "HelloWorld" project, then click Finish to accept the default settings
  4. Select the src folder in the left pane (Package Explorer), then select New→Scala Object
  5. Insert "HelloWorld" as the Scala object name in the Create New File dialog, then click Finish
  6. The HelloWorld.scala file is created with an HelloWorld object template. Fill in the template to complete the HelloWorld object definition, as follows:
    object HelloWorld {
      def main(args: Array[String]): Unit = {
        println("Hello, world!")
      }
    }
    
    
  7. Click in the HelloWorld.scala document (or select the document name in the left pane (Package Explorer), then right-click and select Run As→Scala Application to build and run the app
  8. The console shows that the app built and ran successfully
  9. The compiled class files are located in the your project's bin folder in the Eclipse workspace Use these files to create a jar—see Create a jar.

Create a jar

Create a jar file with SBT or using the jar command.

Using SBT

The SBT package command creates a jar file (see Using SBT).

Manual jar creation

  1. Change directory (cd) into the directory that contains your compiled HelloWorld*.class files, then run the following command to package the class files into a jar with a manifest that specifies the main class entry point (HelloWorld).
    $ jar cvfe HelloWorld.jar HelloWorld HelloWorld*.class
    added manifest
    adding: HelloWorld$.class(in = 637) (out= 403)(deflated 36%)
    adding: HelloWorld.class(in = 586) (out= 482)(deflated 17%)
    

Copy jar to Cloud Storage

  1. Use the gsutil command to copy the jar to a Cloud Storage bucket in your project
$ gsutil cp HelloWorld.jar gs://<bucket-name>/
Copying file://HelloWorld.jar [Content-Type=application/java-archive]...
Uploading   gs://bucket-name/HelloWorld.jar:         1.46 KiB/1.46 KiB

Submit jar to a Cloud Dataproc Spark job

  1. Use the GCP Console to submit the jar file to your Cloud Dataproc Spark job

  2. Fill in the fields on the Submit a job page as follows:

    • Cluster: Select your cluster's name from the cluster list
    • Job type: Spark
    • Main class or jar: Specify the Cloud Storage URI path to your HelloWorld jar (gs://your-bucket-name/HelloWorld.jar). If your jar does not include a manifest that specifies the entry point to your code ("Main-Class: HelloWorld"), the "Main class or jar" field should state the name of your Main class ("HelloWorld"), and you should fill in the "Jar files" field with the URI path to your jar file (gs://your-bucket-name/HelloWorld.jar).
  3. Click Submit to start the job. Once the job starts, it is added to the Jobs list

  4. Click the Job ID to open the Jobs page, where you can view the job's driver output

Write and run Spark Scala code using the cluster's spark-shell REPL

You may want to develop Scala apps directly on your Cloud Dataproc cluster. Hadoop and Spark are pre-installed on Cloud Dataproc clusters, and they are configured with the Cloud Storage connector, which allows your code to read and write data directly from and to Cloud Storage.

This example shows you how to SSH into your project's Cloud Dataproc cluster master node, then use the spark-shell REPL to create and run a Scala wordcount mapreduce application.

  1. SSH into the Cloud Dataproc cluster's master node
    1. Go to your project's Cloud Dataproc Clusters page in the GCP Console, then click on the name of your cluster
    2. On the cluster detail page, select the VM Instances tab, then click the SSH selection that appears to the right of the name of your cluster's master node
      A browser window opens at your home directory on the master node
      Connected, host fingerprint: ssh-rsa 2048 ...
      ...
      user@clusterName-m:~$
      
  2. Launch the spark-shell
    $ spark-shell
    ...
    Using Scala version ...
    Type in expressions to have them evaluated.
    Type :help for more information.
    ...
    Spark context available as sc.
    ...
    SQL context available as sqlContext.
    scala>
    
  3. Create an RDD (Resilient Distributed Dataset) from a Shakespeare text snippet located in public Cloud Storage
    scala> val text_file = sc.textFile("gs://pub/shakespeare/rose.txt")
    
  4. Run a wordcount mapreduce on the text, then display the wordcounts result
    scala> val wordCounts = text_file.flatMap(line => line.split(" ")).map(word =>
      (word, 1)).reduceByKey((a, b) => a + b)
    scala> wordCounts.collect
    ... Array((call,1), (What's,1), (sweet.,1), (we,1), (as,1), (name?,1), (any,1), (other,1),
    (rose,1), (smell,1), (name,1), (a,2), (would,1), (in,1), (which,1), (That,1), (By,1))
    
  5. Save the counts in <bucket-name>/wordcounts-out in Cloud Storage, then exit the scala-shell
    scala> wordCounts.saveAsTextFile("gs://<bucket-name>/wordcounts-out/")
    scala> exit
    
  6. Use gsutil to list the output files and display the file contents
    $ gsutil ls gs://<bucket-name>/wordcounts-out/
    gs://spark-scala-demo-bucket/wordcounts-out/
    gs://spark-scala-demo-bucket/wordcounts-out/_SUCCESS
    gs://spark-scala-demo-bucket/wordcounts-out/part-00000
    gs://spark-scala-demo-bucket/wordcounts-out/part-00001
    
  7. Check gs://<bucket-name>/wordcounts-out/part-00000 contents
    $ gsutil cat gs://<bucket-name>/wordcounts-out/part-00000
    (call,1)
    (What's,1)
    (sweet.,1)
    (we,1)
    (as,1)
    (name?,1)
    (any,1)
    (other,1)
    

Running Pre-Installed Example code

The Cloud Dataproc master node contains runnable jar files with standard Apache Hadoop and Spark examples.

Jar Type Master node /usr/lib/ location GitHub Source Apache Docs
Hadoop hadoop-mapreduce/hadoop-mapreduce-examples.jar source link MapReduce Tutorial
Spark spark/lib/spark-examples.jar source link Spark Examples

Submitting examples to your cluster from the command line

Examples can be submitted from your local development machine using the Cloud SDK gcloud command-line tool (see Using the Google Cloud Platform Console to submit jobs from the GCP Console).

Hadoop WordCount example

gcloud dataproc jobs submit hadoop --cluster <cluster-name> \
  --jar file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
  --class org.apache.hadoop.examples.WordCount \
  <URI of input file> <URI of output file>

Spark WordCount example

gcloud dataproc jobs submit spark --cluster <cluster-name> \
  --jar file:///usr/lib/spark/lib/spark-examples.jar \
  --class org.apache.spark.examples.JavaWordCount
  <URI of input file>

Shutdown your cluster

To avoid ongoing charges, shutdown your cluster and delete the Cloud Storage resources (Cloud Storage bucket and files) used for this tutorial.

To shutdown a cluster:

gcloud dataproc clusters delete <cluster-name>

To delete the Cloud Storage jar file:

gsutil rm gs://<bucket-name>/HelloWorld.jar

You can delete a bucket and all of its folders and files with the following command:

gsutil rm -r gs://<bucket-name>/

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation