Testing future Apache Spark releases and changes on Google Kubernetes Engine and Cloud Dataproc
Holden Karau
Google Cloud Developer Advocate for Open Source Software and Big Data
Do you want to try out a new version of Apache Spark without waiting on the entire release process? Does testing bleeding-edge builds on production data sound fun to you? (Hint: it’s safer not to.) Then this is the blog post for you, my friend!
We’ll help you experiment with code that hasn't even been reviewed yet. If you’re a little cautious, following my advice might sound like a bad idea, and often it is, but if you need to ensure that a pull request (PR) really fixes your bug, or your application will keep running after the release candidate (RC) process is finished, this post will help you try out new versions of Spark with a minimum amount of fuss. But please don't run this in production without a backup and a very fancy support contract for when things go sideways! (Ask me how I know.)
If you’ve ever had the pain of an Apache Spark update breaking (or simply slowing down) your pipeline, then being able to quickly run your pipelines against the release candidate can provide invaluable information—to both yourself and the development community, depending on where the problems are.
Before we start to dive into the details of running custom versions of Spark, it's important to note if all you need to do is run a “supported” version of Spark on Google Cloud Dataproc or Spark on Kubernetes there are much easier options and guides out there for you. Also, as we alluded above, it’s important to make sure your experiments don’t destroy your production data, so consider using a sub-account with more restrictive permissions.
If there is an off-the-shelf version of Spark you want to test, you can go ahead and download it here, and if you’re interested in testing the new release candidate the download instructions are posted to the dev mailing list. Or if you’d rather try out a specific patch you can check out the pull request to your local machine with git fetch origin pull/ID/head:BRANCHNAME
where ID is the PR number. You may then follow the directions to build Spark. Remember to include the -P components you want or need, including your cluster manager of choice—YARN or Kubernetes.
Custom Spark on YARN with Cloud Dataproc
If you’re old school, or a hipster who believes that Apache Hadoop YARN has been around long enough to be cool again, Dataproc gives us a nice YARN cluster with all of the services we need to run many of the common Spark benchmark tools (likesql-perf
and ml-perf
).My friend Joey covered in YARN with stuffed animals from our Debugging Spark Talk at Strata Singapore 2018
To transfer our packages onto our virtual machines you’ll need to have ssh
access. In my case I use a “jump” machine plus my old grand boss’s sshuttle project (who also has since joined Google), but you can also set up a public IP and enable ssh
access to your test cluster if you're comfortable with that.
Once we’ve decided on a version of Spark we want to test, one of the simplest complete options besides localmode
is Cloud Dataproc. You can go ahead and launch your Dataproc cluster as you normally would, then you’re going to want to upload the desired Spark version onto the master machine (I tend to use scp) and ssh
over.
And then ssh
into the machine, extract the Spark installation and attempt to run it against the YARN master that Dataproc has:
This, however, results in sadness:
But it’s OK, this error message tells us that some environment variables need to be set for Spark to find the YARN and Hadoop configurations. You can add them to your ~/.bashrc
or set them manually with:
Now you’re ready to run the shell and try out the new cool functionality—in this case I wanted to try the vectorized UDFs—but it turns out that sadly some of the dependencies required for vectorized UDFs aren’t included in the base image. For now I can use the experimental nteract coffee-boat package, but we’ll come back to that in more detail in the next installment of this series of posts.
So that helps me when I want to test Spark quickly on a YARN cluster, without having to set one up myself, but one of the coolest new features in Spark 2.3 (besides vectorized UDFs) is that the Kubernetes integration has finally made it into the mainstream release (sorry, hipsters).
Two tips:
- Want to access data stored in buckets? Use
spark-packages
(e.g.--packages com.google.cloud.bigdataoss:gcs-connector-parent:1.7.0
) to bring in the required connectors. - You aren’t limited to the shell of course: you can use
spark-submit
to try out your existing Spark jobs.
Custom Spark on Kubernetes Engine
Kubernetes expert Kris Nova appears here as a hipster
Now instead of scp
ing tarballs around like it’s the mid-90s, we’ll build a container image and upload it to Google’s built-in registry. If you remember the early 90s, you can pretend this is almost like shipping a PXE boot image—bear with me, I miss the 90s! So, go back to your favourite download or build of Spark and prepare the magic incantation:
This step lets us talk to the Google Cloud registry, then you can use the built in docker-image-tool from Spark as normal, in my case it ends up being:
Now you are ready to kick off your super-fancy K8s Spark cluster & fetch credentials:
The output of this will tell us where we can submit our job to, but unlike YARN mode the Kubernetes (K8s, for short) mode does not currently add local dependencies to the distributed cache. Since everyone loves a word count example (and it happens to be built in to the Spark examples), we can upload the example JAR to Google Cloud storage and make it accessible to our K8s cluster:
For now, though, we don’t have the JARs installed to access our Google Cloud Storage bucket, and Spark on K8s doesn’t currently support spark-packages, so we need a way to make it accessible. The simplest (and least secure) way is to make our JAR public with the following command:
This allows anyone to download our JARs over HTTP so we don’t need the Google Cloud Storage JARs to bootstrap.
Note: please only do this with OSS code, and remember to remove it after!
If you want a more secure distribution of your JARs you can create a signed URLs:
Get the key:
Sign the URL:
Before we kick off our Spark job we need to make a service account for Spark that will have permission to edit the cluster:
And now we can finally run our default word count example:
And we can verify the output with:
Handling dependencies in Spark K8s (or accessing your data and JARs):
This leaves something to be desired though: we had to either make our data and code public, or set up secure links for our JARs. Thankfully, there is a Spark-on-Google-Kubernetes-Engine solutions article for non-custom versions, and we can follow the same approach they do in their docker file to add the necessary JARs.Push to our registry:
And launch using our Google Cloud Storage JAR rather than HTTP:
Wrapping up
Now that it’s all done remember to shut down any processes (e.g. kill `cat proxy.pid`) and cleanup any artifacts you don’t need anymore (especially if you made them public).I hope this helps you feel more comfortable upgrading Spark releases frequently, or maybe even helping the Spark development team out, testing for possible performance regressions during code reviews & release candidates. If learning more about the Spark code review process is something that excites you, I encourage you to watch some of my past streamed code reviews and look out for more future livestreams. If you don’t have enough pictures of stuffed animals sitting in front of computers, you can also follow me on Twitter.