Data Analytics

Unify analytics with Spark procedures in BigQuery, now generally available

March 19, 2024

Lu He

Sr. Engineer

Sandeep Karmarkar

Product lead, BigQuery

BigQuery is powered by a highly scalable and capable SQL engine that can handle large data volumes with standard SQL, and that offers advanced capabilities such as BigQuery ML, remote functions, vector search, and more. However, there are cases where you may need to leverage open-source Apache Spark expertise or existing Spark-based business logic to expand BigQuery data processing beyond SQL. For example, you may want to use community packages for complex JSON processing or graph data processing, or use legacy code that was written in Spark prior to migration to BigQuery. Historically, this required you to leave BigQuery, enable a separate API, use an alternative user interface (UI), manage disparate permissions, and pay for non-BigQuery SKUs.

To address these challenges, we developed an integrated experience to extend BigQuery’s data processing to Apache Spark, and today, we are announcing the general availability (GA) of Apache Spark stored procedures in BigQuery. BigQuery users looking to extend their queries with Spark-based data processing can now use BigQuery APIs to create and execute Spark stored procedures. It brings Spark together with BigQuery under a single experience, including management, security and billing. Spark procedures are supported using PySpark, Scala and Java code.

Here’s what DeNA, a provider of internet and AI technologies and a BigQuery customer, had to say

“BigQuery Spark stored procedures deliver a frictionless experience with unified API, governance and billing across Spark and BigQuery. We can now seamlessly use our Spark expertise and community packages for advanced data processing in BigQuery.” - Yusuke Kamo, Division Director, Data Management Division, DeNA Co., Ltd

Let’s look into some key aspects of this unified experience.

Develop, test, and deploy PySpark code in BigQuery Studio

BigQuery Studio, a single, unified interface for all data practitioners, includes a Python editor to develop, test and deploy your PySpark code. Procedures can be configured with IN/OUT parameters along with other options. After you create a Spark connection you can iteratively test the code within the UI. For debugging and troubleshooting, the BigQuery console incorporates log messages from underlying Spark jobs and surfaces those within the same context. Spark experts can also tune Spark execution by passing Spark parameters to the procedure.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_k7dEcb7.max-1400x1400.png

Author PySpark procedure with a Python editor in BigQuery Studio

Once tested, the procedure is stored within a BigQuery dataset and access to the procedure can be managed similarly to your SQL procedures.

Extend for advanced use cases

One of the great benefits of Apache Spark is being able to take advantage of a wide range of community or third-party packages. You can configure Spark stored procedures in BigQuery to install packages that you need for your code execution.

For advanced use cases you can also import your code stored in Google Cloud Storage buckets or a custom container image that is available in Container Registry or Artifact Registry.

--Create spark procedure with custom image in artifact registry that has the OSS graphframe lib. Also use custom spark options for specific number of executors 
 
CREATE OR REPLACE PROCEDURE
 `myproject.mydataset.graphframe`(bucket_name STRING)
WITH CONNECTION `myproject.region.my-spark-connection` OPTIONS (engine='SPARK',
   runtime_version='1.1',
container_image='us-central1-docker.pkg.dev/myproj/myrepo/graph-db-image',
   properties=[("spark.executor.instances",
     "5")])
 LANGUAGE python AS R"""
from pyspark import *
from graphframes import *
from pyspark.sql import SparkSession
import sys
import pyspark.sql.functions as f
from bigquery.spark.procedure import SparkProcParamContext

spark = SparkSession.builder.appName("graphframes_data").getOrCreate()
sc=spark.sparkContext
spark_proc_param_context = SparkProcParamContext.getOrCreate(spark)
bucket_name=spark_proc_param_context.bucket_name

# Reading Vertex and Edges data from GCS
edges= spark.read.options(header='True', inferSchema='True', delimiter=',').csv("gs://"+bucket_name+"/edges/*.csv")
edges=edges.withColumnRenamed("Source","src").withColumnRenamed("Target","dst")
vertices= spark.read.options(header='True', inferSchema='True', delimiter=',').csv("gs://"+bucket_name+"/nodes/*.csv")
vertices=vertices.withColumnRenamed("Id","id")

g = GraphFrame(vertices, edges)
## Take a look at the DataFrames
g.vertices.show(20)
g.edges.show(20)
## Check the number of edges of each vertex
g.degrees.sort(g.degrees.degree.desc()).show(20)
""";

Advanced security and authentication options like customer-managed encryption keys (CMEK) and using a pre-existing service account are also supported.

Serverless execution with BigQuery billing

With this release, you enjoy the benefits of Spark within the BigQuery APIs and only see BigQuery charges. Behind the scenes, this is made possible by our industry leading Serverless Spark engine that enables serverless, autoscaling Spark. However, you don’t need to enable Dataproc APIs or be charged for Dataproc when you leverage this new capability. You will be charged for Spark procedures usage using the Enterprise edition (EE) pay-as-you-go (PAYG) pricing SKU. This feature is available in all the BigQuery editions, including the on-demand model. You will get charged for Spark procedures with EE PAYG SKU irrespective of the editions. Please see BigQuery pricing for more details.