Introducing Serverless Spark for interactive development
Sr. Product Manager, Google Cloud
At Google Cloud, we’re committed to helping you build an open and integrated data platform that meets your specific business needs. We believe in order to build the world you want, you need to be able to use the tools you want, on a powerful and unified platform.
To that end, we’re making Apache Spark even more powerful on Google Cloud, and creating a simplified experience across the platform. Data engineers, data analysts, and data scientists like Spark for its high speed data querying, analysis, and transformation with large data sets; but they want to use it from their interface of choice, without the need for custom integrations. And if you’ve used Spark, you know someone has to manage tuning, provisioning, and security, and make sure you’re using what you pay for. Our customers have made it clear- they want that someone to be us. They want auto-tuned capabilities, easy provisioning, and an efficient cost basis without having to manually reconfigure every time load fluctuates.
Earlier this year, we made Serverless Spark generally available, creating the industry’s first autoscaling serverless Spark. Today, we announced at Data Cloud Summit 2022 we’re taking it a step further and making Serverless Spark even more powerful, by enabling serverless interactive development through Jupyter notebooks, natively integrated with Vertex AI Workbench. Now, a data scientist will not need to provision any infrastructure in advance; they can start developing on-demand.
Now you can use Spark to conduct data processing and machine learning without any custom integrations: it is available natively in BigQuery, Vertex AI, Dataplex, and Dataproc. We have integrated a PySpark editor into the BigQuery Console, powered by the Serverless Spark backend. Users can write and submit PySpark code from BigQuery, see the results in the BigQuery console, and get the same serverless, auto-scaling experience they are used to with SQL.
From the Vertex AI notebook, you can choose the Kernel that represents serverless Spark. Without any resources provisioned beforehand, clicking the Kernel will start a serverless Spark session, and you can start development right away. Once finished, submit the notebook as a Dataproc job for production or publish it for live inference in Vertex AI.
You can also use the Jupyter notebook inside the Serverless Spark session, if you choose to use it through gCloud CLI or the Dataproc API.
Once you have explored your data and developed your code through a notebook, you’ll want to take your work to production. To do so, you typically need to define ML pipelines that can be scheduled or triggered when specific events happen. We’re making this easier,with Dataproc components for Vertex AI Pipelines. This will simplify MLOps for Spark, Spark SQL, PySpark and Spark jobs.
Just as in Serverless Spark for ETL pipelines, you won’t be charged for starting up or shutting down the underlying infrastructure. You only pay for the time the Spark application is live.
We also know that some people want a bit more control. We’ve got good news for them too, with the general availability of Dataproc on GKE for Spark. Now you can deploy a GKE cluster, leverage the scaling capabilities of GKE, and use Spark with all security and compliance features of Dataproc. This adds the power of Spark to the advanced compute management and resource sharing capabilities you already rely on from your Kubernetes environment.
Now you can effortlessly power ETL, data science, data analytics and AI pipelines with Spark with exactly the amount of control you want; use serverless for no-ops automation, control where you need to with compute clusters, and standardize Kubernetes workloads when you run workloads across multiple environments using GKE.
We all know how much of a difference it makes when you can use the exact right tool for a job. But when all those tools play nice together too? That’s the power you need to tackle your biggest challenges.
Get started with Serverless Spark today by connecting to it with your tool of choice, and request allowlist access to the private preview of BigQuery and Vertex AI integrations here.