Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Testing future Apache Spark releases and changes on Google Kubernetes Engine and Cloud Dataproc

Wednesday, March 28, 2018

By Holden Karau, Google Cloud Developer Advocate for Open Source Software and Big Data

Do you want to try out a new version of Apache Spark without waiting on the entire release process? Does testing bleeding-edge builds on production data sound fun to you? (Hint: it’s safer not to.) Then this is the blog post for you, my friend!

We’ll help you experiment with code that hasn't even been reviewed yet. If you’re a little cautious, following my advice might sound like a bad idea, and often it is, but if you need to ensure that a pull request (PR) really fixes your bug, or your application will keep running after the release candidate (RC) process is finished, this post will help you try out new versions of Spark with a minimum amount of fuss. But please don't run this in production without a backup and a very fancy support contract for when things go sideways! (Ask me how I know.)

If you’ve ever had the pain of an Apache Spark update breaking (or simply slowing down) your pipeline, then being able to quickly run your pipelines against the release candidate can provide invaluable information—to both yourself and the development community, depending on where the problems are.

Before we start to dive into the details of running custom versions of Spark, it's important to note if all you need to do is run a “supported” version of Spark on Google Cloud Dataproc or Spark on Kubernetes there are much easier options and guides out there for you. Also, as we alluded above, it’s important to make sure your experiments don’t destroy your production data, so consider using a sub-account with more restrictive permissions.

If there is an off-the-shelf version of Spark you want to test, you can go ahead and download it here, and if you’re interested in testing the new release candidate the download instructions are posted to the dev mailing list. Or if you’d rather try out a specific patch you can check out the pull request to your local machine with git fetch origin pull/ID/head:BRANCHNAME where ID is the PR number. You may then follow the directions to build Spark. Remember to include the -P components you want or need, including your cluster manager of choice—YARN or Kubernetes.

Custom Spark on YARN with Cloud Dataproc

If you’re old school, or a hipster who believes that Apache Hadoop YARN has been around long enough to be cool again, Dataproc gives us a nice YARN cluster with all of the services we need to run many of the common Spark benchmark tools (like sql-perf and ml-perf).

My friend Joey covered in YARN with stuffed animals from our Debugging Spark Talk at Strata Singapore 2018

To transfer our packages onto our virtual machines you’ll need to have ssh access. In my case I use a “jump” machine plus my old grand boss’s sshuttle project (who also has since joined Google), but you can also set up a public IP and enable ssh access to your test cluster if you're comfortable with that.

Once we’ve decided on a version of Spark we want to test, one of the simplest complete options besides localmode is Cloud Dataproc. You can go ahead and launch your Dataproc cluster as you normally would, then you’re going to want to upload the desired Spark version onto the master machine (I tend to use scp) and ssh over.

scp spark.tar hkarau@[master-machine]:/~

And then ssh into the machine, extract the Spark installation and attempt to run it against the YARN master that Dataproc has:

ssh hkarau@[master-machine]

./bin/spark-shell  --master yarn

This, however, results in sadness:

Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.

But it’s OK, this error message tells us that some environment variables need to be set for Spark to find the YARN and Hadoop configurations. You can add them to your ~/.bashrc or set them manually with:

export HADOOP_CONF_DIR="/etc/hadoop/conf"
export HADOOP_HDFS_HOME="/usr/lib/hadoop/../hadoop-hdfs"
export HADOOP_HOME="/usr/lib/hadoop"
export HADOOP_MAPRED_HOME="/usr/lib/hadoop/../hadoop-mapreduce"
export HADOOP_YARN_HOME="/usr/lib/hadoop/../hadoop-yarn"

Now you’re ready to run the shell and try out the new cool functionality—in this case I wanted to try the vectorized UDFs—but it turns out that sadly some of the dependencies required for vectorized UDFs aren’t included in the base image. For now I can use the experimental nteract coffee-boat package, but we’ll come back to that in more detail in the next installment of this series of posts.

So that helps me when I want to test Spark quickly on a YARN cluster, without having to set one up myself, but one of the coolest new features in Spark 2.3 (besides vectorized UDFs) is that the Kubernetes integration has finally made it into the mainstream release (sorry, hipsters).

Two tips:

  • Want to access data stored in buckets? Use spark-packages (e.g. --packages to bring in the required connectors.
  • You aren’t limited to the shell of course: you can use spark-submit to try out your existing Spark jobs.

Custom Spark on Kubernetes Engine

Kubernetes expert Kris Nova appears here as a hipster

Now instead of scping tarballs around like it’s the mid-90s, we’ll build a container image and upload it to Google’s built-in registry. If you remember the early 90s, you can pretend this is almost like shipping a PXE boot image—bear with me, I miss the 90s! So, go back to your favourite download or build of Spark and prepare the magic incantation:

shopt -s expand_aliases && alias docker="gcloud docker --"

This step lets us talk to the Google Cloud registry, then you can use the built in docker-image-tool from Spark as normal, in my case it ends up being:

export PROJECTNAME=boos-demo-projects-are-rad
export SPARK_VERSION=`git rev-parse HEAD`
./bin/ -r $DOCKER_REPO -t $SPARK_VERSION build
./bin/ -r $DOCKER_REPO -t $SPARK_VERSION push

Now you are ready to kick off your super-fancy K8s Spark cluster & fetch credentials:

gcloud container clusters create  mySparkCluster --zone us-east1-b --project $PROJECTNAME
gcloud container clusters get-credentials mySparkCluster --zone us-east1-b --project $PROJECTNAME
# Start a local proxy to access the K8s APIs. Optional, but simplifies some of the shell scripts
kubectl proxy &
echo $! >

The output of this will tell us where we can submit our job to, but unlike YARN mode the Kubernetes (K8s, for short) mode does not currently add local dependencies to the distributed cache. Since everyone loves a word count example (and it happens to be built in to the Spark examples), we can upload the example JAR to Google Cloud storage and make it accessible to our K8s cluster:

export JARBUCKETNAME="mybucket";
export MYJAR=spark-examples_2.11-2.3.0.jar
gsutil cp spark-examples_2.11-2.3.0.jar gs://$JARBUCKETNAME/

For now, though, we don’t have the JARs installed to access our Google Cloud Storage bucket, and Spark on K8s doesn’t currently support spark-packages, so we need a way to make it accessible. The simplest (and least secure) way is to make our JAR public with the following command:

gsutil acl ch -u AllUsers:R gs://$JARBUCKETNAME/spark-examples_2.11-2.3.0.jar
export JAR_URL=$JARBUCKETNAME/spark-examples_2.11-2.3.0.jar

This allows anyone to download our JARs over HTTP so we don’t need the Google Cloud Storage JARs to bootstrap.

Note: please only do this with OSS code, and remember to remove it after!

If you want a more secure distribution of your JARs you can create a signed URLs:

gcloud iam service-accounts create signer --display-name "signer"
gcloud projects add-iam-policy-binding $PROJECTNAME \
--member serviceAccount:signer@$ \
--role roles/storage.objectViewer

Get the key:

gcloud iam service-accounts keys create ~/key.json \
--iam-account signer@$

Sign the URL:

export JAR_URL=`gsutil signurl -d 24 -m GET ~/key.json \
gs://$JARBUCKETNAME/$MYJAR | cut  -f 4 | tail -n 1`

Before we kick off our Spark job we need to make a service account for Spark that will have permission to edit the cluster:

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role \
--clusterrole=edit --serviceaccount=default:spark --namespace=default

And now we can finally run our default word count example:

./bin/spark-submit --master k8s://  \
 --deploy-mode cluster \
 --conf spark.kubernetes.container.image=$DOCKER_REPO/spark:$SPARK_VERSION \
 --conf spark.executor.instances=1 \
 --class org.apache.spark.examples.JavaWordCount \
 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
 --name wordcount \ $JAR_URL \/opt/spark/RELEASE

And we can verify the output with:

kubectl logs [podname-from-spark-submit]

Handling dependencies in Spark K8s (or accessing your data and JARs):

This leaves something to be desired though: we had to either make our data and code public, or set up secure links for our JARs. Thankfully, there is a Spark-on-Google-Kubernetes-Engine solutions article for non-custom versions, and we can follow the same approach they do in their docker file to add the necessary JARs.

mkdir /tmp/build && echo “FROM $DOCKER_REPO/spark:$SPARK_VERSION
RUN rm \$SPARK_HOME/jars/guava-14.0.1.jar
ENTRYPOINT [ '/opt/' ]” > /tmp/build/dockerfiledocker build -t $DOCKER_REPO/spark:$SPARK_VERSION-with-gcs -f /tmp/build/dockerfile /tmp/build

Push to our registry:

docker push $DOCKER_REPO/spark:$SPARK_VERSION-with-gcs

And launch using our Google Cloud Storage JAR rather than HTTP:

./bin/spark-submit --master k8s:// \
  --deploy-mode cluster \
  --conf spark.kubernetes.container.image=$DOCKER_REPO/spark:$SPARK_VERSION-with-gcs \
  --conf spark.executor.instances=1 \
  --class org.apache.spark.examples.JavaWordCount \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --name wordcount \
  gcs://$JARBUCKETNAME/spark-examples_2.11-2.3.0.jar \

Wrapping up

Now that it’s all done remember to shut down any processes (e.g. kill `cat`) and cleanup any artifacts you don’t need anymore (especially if you made them public).

I hope this helps you feel more comfortable upgrading Spark releases frequently, or maybe even helping the Spark development team out, testing for possible performance regressions during code reviews & release candidates. If learning more about the Spark code review process is something that excites you, I encourage you to watch some of my past streamed code reviews and look out for more future livestreams. If you don’t have enough pictures of stuffed animals sitting in front of computers, you can also follow me on Twitter.

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.