Learn how to run Serverless Spark workloads without provisioning and managing clusters
Brad Miro
Developer Advocate
Serverless Spark is a fully-managed and serverless product on Google Cloud that lets you run Apache Spark, PySpark, SparkR, and Spark SQL batch workloads without provisioning and managing your cluster. Serverless Spark enables you to run data processing jobs using Apache Spark, including PySpark, SparkR, and Spark SQL, on your data in BigQuery with the Apache Spark SQL connector for Google BigQuery, all from within a serverless environment. As a part of the Dataproc product portfolio, Serverless Spark also supports reading and writing to your Dataproc Metastore and provides access to the Spark History Server by configuring it with a Dataproc Persistent History Server.
We’re pleased to announce a new interactive tutorial directly in the Google Cloud console that walks you through several ways to start processing your data with Serverless Spark on Google Cloud.
Below we’ll cover at a high level what you’ll learn in the tutorial, which goes much deeper than this blog.
This tutorial will take you approximately 30 minutes. A basic understanding of Apache Spark will help you understand the concepts in this tutorial. Learn more about Apache Spark in the project documentation.
What is Apache Spark?
Apache Spark is an open-source distributed data processing engine for large-scale Python, Java, Scala, R, or SQL datasets. It contains a more extensive set of tools in the core library for use cases such as machine learning, graph processing, structured streaming, and a pandas integration for pandas-based workloads. In addition, numerous third-party libraries extend Spark’s functionality, including sparknlp and database connectors such as the Apache Spark SQL connector for Google BigQuery. Apache Spark supports multiple table formats, including Apache Iceberg, Apache Hudi, Parquet, and Avro.
Run a PySpark job with Serverless Spark on BigQuery data
This tutorial teaches you how to read and write data from BigQuery using PySpark and Serverless Spark. The Apache Spark SQL connector for Google BigQuery is now included in the latest Serverless Spark 2.1 runtime. You can also submit jobs via the following code:
View logs, console output, and Spark logs
Service-level jobs, such as Serverless Spark requesting extra executors when scaling up, are captured in Cloud Logging and can be viewed in real-time or later.
The console output will be visible via the command line as the job is running but is also logged to the Dataproc Batches console.
You can also view Spark logs via a Persistent History Server set up as a Dataproc single-node cluster. Create one below.
You can include this when running Serverless Spark jobs to view Spark logs.
The Persistent History Server is available in the Batches console by clicking on the Batch ID of the job and then View Spark History Server.
Use Dataproc templates for simple data processing jobs
Dataproc templates provide functionality for simple ETL (extract, transform, load) and ELT (extract, load, transform) jobs. Using this command line-based tool, you can move and process your data for simple and common use cases. These templates utilize Serverless Spark but do not require the user to write any Spark code. Some of these templates include:GCStoGCS
GCStoBigQuery
GCStoBigtable
GCStoJDBC
andJDBCtoGCS
HivetoBigQuery
MongotoGCS
andGCStoMongo
Check out the full list of templates.
The following example will use theGCStoGCS
template to convert a GCS file from csv
to parquet
.Get started
Check out the interactive tutorial for a more in-depth and comprehensive view of the information covered here. New customers also get Google Cloud’s $300 credit.
Learn more: