Testing future Apache Spark releases and changes on Google Kubernetes Engine and Cloud Dataproc
By Holden Karau, Google Cloud Developer Advocate for Open Source Software and Big Data
Do you want to try out a new version of Apache Spark without waiting on the entire release process? Does testing bleeding-edge builds on production data sound fun to you? (Hint: it’s safer not to.) Then this is the blog post for you, my friend!
We’ll help you experiment with code that hasn't even been reviewed yet. If you’re a little cautious, following my advice might sound like a bad idea, and often it is, but if you need to ensure that a pull request (PR) really fixes your bug, or your application will keep running after the release candidate (RC) process is finished, this post will help you try out new versions of Spark with a minimum amount of fuss. But please don't run this in production without a backup and a very fancy support contract for when things go sideways! (Ask me how I know.)
If you’ve ever had the pain of an Apache Spark update breaking (or simply slowing down) your pipeline, then being able to quickly run your pipelines against the release candidate can provide invaluable information—to both yourself and the development community, depending on where the problems are.
Before we start to dive into the details of running custom versions of Spark, it's important to note if all you need to do is run a “supported” version of Spark on Google Cloud Dataproc or Spark on Kubernetes there are much easier options and guides out there for you. Also, as we alluded above, it’s important to make sure your experiments don’t destroy your production data, so consider using a sub-account with more restrictive permissions.
If there is an off-the-shelf version of Spark you want to test, you can go ahead and download it here, and if you’re interested in testing the new release candidate the download instructions are posted to the dev mailing list. Or if you’d rather try out a specific patch you can check out the pull request to your local machine with
git fetch origin pull/ID/head:BRANCHNAME where ID is the PR number. You may then follow the directions to build Spark. Remember to include the -P components you want or need, including your cluster manager of choice—YARN or Kubernetes.
Custom Spark on YARN with Cloud Dataproc
If you’re old school, or a hipster who believes that Apache Hadoop YARN has been around long enough to be cool again, Dataproc gives us a nice YARN cluster with all of the services we need to run many of the common Spark benchmark tools (like
To transfer our packages onto our virtual machines you’ll need to have
ssh access. In my case I use a “jump” machine plus my old grand boss’s sshuttle project (who also has since joined Google), but you can also set up a public IP and enable
ssh access to your test cluster if you're comfortable with that.
Once we’ve decided on a version of Spark we want to test, one of the simplest complete options besides
localmode is Cloud Dataproc. You can go ahead and launch your Dataproc cluster as you normally would, then you’re going to want to upload the desired Spark version onto the master machine (I tend to use scp) and
scp spark.tar hkarau@[master-machine]:/~
ssh into the machine, extract the Spark installation and attempt to run it against the YARN master that Dataproc has:
ssh hkarau@[master-machine] ./bin/spark-shell --master yarn
This, however, results in sadness:
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
But it’s OK, this error message tells us that some environment variables need to be set for Spark to find the YARN and Hadoop configurations. You can add them to your
~/.bashrc or set them manually with:
export HADOOP_CONF_DIR="/etc/hadoop/conf" export HADOOP_HDFS_HOME="/usr/lib/hadoop/../hadoop-hdfs" export HADOOP_HOME="/usr/lib/hadoop" export HADOOP_MAPRED_HOME="/usr/lib/hadoop/../hadoop-mapreduce" export HADOOP_YARN_HOME="/usr/lib/hadoop/../hadoop-yarn"
Now you’re ready to run the shell and try out the new cool functionality—in this case I wanted to try the vectorized UDFs—but it turns out that sadly some of the dependencies required for vectorized UDFs aren’t included in the base image. For now I can use the experimental nteract coffee-boat package, but we’ll come back to that in more detail in the next installment of this series of posts.
So that helps me when I want to test Spark quickly on a YARN cluster, without having to set one up myself, but one of the coolest new features in Spark 2.3 (besides vectorized UDFs) is that the Kubernetes integration has finally made it into the mainstream release (sorry, hipsters).
- Want to access data stored in buckets? Use
--packages com.google.cloud.bigdataoss:gcs-connector-parent:1.7.0) to bring in the required connectors.
- You aren’t limited to the shell of course: you can use
spark-submitto try out your existing Spark jobs.
Custom Spark on Kubernetes Engine
Now instead of
scping tarballs around like it’s the mid-90s, we’ll build a container image and upload it to Google’s built-in registry. If you remember the early 90s, you can pretend this is almost like shipping a PXE boot image—bear with me, I miss the 90s! So, go back to your favourite download or build of Spark and prepare the magic incantation:
shopt -s expand_aliases && alias docker="gcloud docker --"
This step lets us talk to the Google Cloud registry, then you can use the built in docker-image-tool from Spark as normal, in my case it ends up being:
export PROJECTNAME=boos-demo-projects-are-rad export DOCKER_REPO=gcr.io/$PROJECTNAME/spark export SPARK_VERSION=`git rev-parse HEAD` ./bin/docker-image-tool.sh -r $DOCKER_REPO -t $SPARK_VERSION build ./bin/docker-image-tool.sh -r $DOCKER_REPO -t $SPARK_VERSION push
Now you are ready to kick off your super-fancy K8s Spark cluster & fetch credentials:
gcloud container clusters create mySparkCluster --zone us-east1-b --project $PROJECTNAME gcloud container clusters get-credentials mySparkCluster --zone us-east1-b --project $PROJECTNAME # Start a local proxy to access the K8s APIs. Optional, but simplifies some of the shell scripts kubectl proxy & echo $! > proxy.pid
The output of this will tell us where we can submit our job to, but unlike YARN mode the Kubernetes (K8s, for short) mode does not currently add local dependencies to the distributed cache. Since everyone loves a word count example (and it happens to be built in to the Spark examples), we can upload the example JAR to Google Cloud storage and make it accessible to our K8s cluster:
export JARBUCKETNAME="mybucket"; export MYJAR=spark-examples_2.11-2.3.0.jar gsutil cp spark-examples_2.11-2.3.0.jar gs://$JARBUCKETNAME/
For now, though, we don’t have the JARs installed to access our Google Cloud Storage bucket, and Spark on K8s doesn’t currently support spark-packages, so we need a way to make it accessible. The simplest (and least secure) way is to make our JAR public with the following command:
gsutil acl ch -u AllUsers:R gs://$JARBUCKETNAME/spark-examples_2.11-2.3.0.jar export JAR_URL=https://storage.googleapis.com/$JARBUCKETNAME/spark-examples_2.11-2.3.0.jar
This allows anyone to download our JARs over HTTP so we don’t need the Google Cloud Storage JARs to bootstrap.
Note: please only do this with OSS code, and remember to remove it after!
If you want a more secure distribution of your JARs you can create a signed URLs:
gcloud iam service-accounts create signer --display-name "signer" gcloud projects add-iam-policy-binding $PROJECTNAME \ --member serviceAccount:signer@$PROJECTNAME.iam.gserviceaccount.com \ --role roles/storage.objectViewer
Get the key:
gcloud iam service-accounts keys create ~/key.json \ --iam-account signer@$PROJECTNAME.iam.gserviceaccount.com
Sign the URL:
export JAR_URL=`gsutil signurl -d 24 -m GET ~/key.json \ gs://$JARBUCKETNAME/$MYJAR | cut -f 4 | tail -n 1`
Before we kick off our Spark job we need to make a service account for Spark that will have permission to edit the cluster:
kubectl create serviceaccount spark kubectl create clusterrolebinding spark-role \ --clusterrole=edit --serviceaccount=default:spark --namespace=default
And now we can finally run our default word count example:
./bin/spark-submit --master k8s://http://127.0.0.1:8001 \ --deploy-mode cluster \ --conf spark.kubernetes.container.image=$DOCKER_REPO/spark:$SPARK_VERSION \ --conf spark.executor.instances=1 \ --class org.apache.spark.examples.JavaWordCount \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --name wordcount \ $JAR_URL \/opt/spark/RELEASE
And we can verify the output with:
kubectl logs [podname-from-spark-submit]
Handling dependencies in Spark K8s (or accessing your data and JARs):
This leaves something to be desired though: we had to either make our data and code public, or set up secure links for our JARs. Thankfully, there is a Spark-on-Google-Kubernetes-Engine solutions article for non-custom versions, and we can follow the same approach they do in their docker file to add the necessary JARs.
mkdir /tmp/build && echo “FROM $DOCKER_REPO/spark:$SPARK_VERSION RUN rm \$SPARK_HOME/jars/guava-14.0.1.jar ADD http://central.maven.org/maven2/com/google/guava/guava/23.0/guava-23.0.jar $SPARK_HOME/jars ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar $SPARK_HOME/jars ENTRYPOINT [ '/opt/entrypoint.sh' ]” > /tmp/build/dockerfiledocker build -t $DOCKER_REPO/spark:$SPARK_VERSION-with-gcs -f /tmp/build/dockerfile /tmp/build
Push to our registry:
docker push $DOCKER_REPO/spark:$SPARK_VERSION-with-gcs
And launch using our Google Cloud Storage JAR rather than HTTP:
./bin/spark-submit --master k8s://http://127.0.0.1:8001 \ --deploy-mode cluster \ --conf spark.kubernetes.container.image=$DOCKER_REPO/spark:$SPARK_VERSION-with-gcs \ --conf spark.executor.instances=1 \ --class org.apache.spark.examples.JavaWordCount \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --name wordcount \ gcs://$JARBUCKETNAME/spark-examples_2.11-2.3.0.jar \ /opt/spark/RELEASE
Now that it’s all done remember to shut down any processes (e.g. kill `cat proxy.pid`) and cleanup any artifacts you don’t need anymore (especially if you made them public).
I hope this helps you feel more comfortable upgrading Spark releases frequently, or maybe even helping the Spark development team out, testing for possible performance regressions during code reviews & release candidates. If learning more about the Spark code review process is something that excites you, I encourage you to watch some of my past streamed code reviews and look out for more future livestreams. If you don’t have enough pictures of stuffed animals sitting in front of computers, you can also follow me on Twitter.