Data Analytics

Google Cloud Dataproc - the fast, easy and safe way to try Spark 2.0-preview

If you like to stay on the cutting edge of the Apache Spark and Apache Hadoop ecosystem, we have some good news — you can test new versions of these tools in a fast, easy and cost-effective way. Google Cloud Dataproc, Google Cloud Platform’s managed Spark and Hadoop service, has a preview image version which often includes preview releases of popular Spark and Hadoop components. Starting today, our preview image includes a build of the next major milestone for the Spark project — the Spark 2.0-preview release.

Please note that the Spark project states this release of Spark 2.0 “...is not a stable release in terms of either API or functionality” and has a number of known issues. As such, you should expect to encounter incompatibilities and rough edges when testing Spark 2.0. Additionally, Spark 2.0 will not be officially supported on Cloud Dataproc until a stable release of Spark 2.0 is available. The preview release is offered for testing purposes only.

The Spark 2.0 preview includes a number of new features, such as:

  • Unification of the DataFrame and Dataset API
  • ANSI SQL parser and subquery support for Spark SQL
  • SparkSession, which replaces SQLContext and HiveContext
  • A machine learning API based on DataFrames
  • A second-generation Tungsten engine for increased performance
  • Machine learning models and pipelines can now be saved and re-loaded
  • A new Structured Streaming API for handling more complex streaming needs
For example, here is output from running the spark-shell on Cloud Dataproc and invoking the new SparkSession variable:

  Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-preview
      /_/
         
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark
res0: org.apache.spark.sql.SparkSession = 
org.apache.spark.sql.SparkSession@68f9e807

As an example of how you might use the SparkSession, you can easily read data from Google Cloud Storage into a DataFrame without having to use SQLContext.read():

val csvData = spark.read.csv("gs://my-bucket/project-data/csv")

Let’s assume you have a set of Spark SQL queries you run daily on data stored in Cloud Storage. You can easily test your scripts on Spark 2.0 preview by simply creating a new cluster using the preview image. Within approximately 90 seconds you'll have access to a cluster you can use for testing without impacting your existing users or pipelines. When you're done testing, simply delete the cluster. This testing may provide useful information on breaking changes, insights on performance differences or access to new features from the Spark 2.0 preview.

You can get started with the Spark 2.0 preview today on Cloud Dataproc by using the “preview” image version when you create clusters through the Google Developers Console, the Google Cloud SDK, or the Cloud Dataproc API. For more information about selecting Cloud Dataproc versions, please see the Cloud Dataproc documentation.