Connecting Databricks to BigQuery

This tutorial shows you how to connect a BigQuery table or view for reading and writing data from a Databricks notebook. The steps are described using the Google Cloud console and Databricks Workspaces. You can also perform these steps using the gcloud and databricks command-line tools, although that guidance is outside the scope of this tutorial.

Databricks on Google Cloud is a Databricks environment hosted on Google Cloud, running on Google Kubernetes Engine (GKE) and providing built-in integration with BigQuery and other Google Cloud technologies. If you are new to Databricks, watch the Introduction to Databricks Unified Data Platform video for an overview of the Databricks lakehouse platform.

Deploy Databricks on Google Cloud

Complete the following steps to prepare to deploy Databricks on Google Cloud.

  1. To set up your Databricks account, follow the instructions in the Databricks documentation, Set up your Databricks on Google Cloud account.
  2. After you register, learn more about how to Manage your Databricks account.

Create a Databricks workspace, cluster, and notebook

The following steps describe how to create a Databricks workspace, a cluster, and a Python notebook to write code to access BigQuery.

  1. Confirm the Databrick prerequisites.

  2. Create your first workspace. On the Databricks account console, click Create Workspace.

  3. Specify gcp-bq for the Workspace name and select your Region.

    Create Workspace screen with Workspace name, region and Google Cloud
project ID

  4. To determine your Google Cloud project ID, visit the Google Cloud console, and then copy the value to the Google Cloud project ID field.

    Go to Google Cloud console

  5. Click Save to create your Databricks workspace.

  6. To create a Databricks cluster with Databricks runtime 7.6 or later, in the left menu bar select Clusters, and then click Create Cluster at the top.

  7. Specify the name of your cluster and its size, then click Advanced Options and specify the email addresss of your Google Cloud service account.

    New Cluster surface with Google Service Account details

  8. Click Create Cluster.

  9. To create a Python notebook for Databricks, follow instructions in Create a notebook.

Querying BigQuery from Databricks

With the configuration above, you can securely connect Databricks to BigQuery. Databricks uses a fork of the open source Google Spark Adapter to access BigQuery.

Databricks reduces data transfer and accelerates queries by automatically pushing down certain query predicates, for example filtering on nested columns to BigQuery. In addition, the added capability to first run a SQL query on BigQuery with the query() API reduces the transfer size of the resulting data set.

The following steps describe how to access a dataset in BigQuery and write your own data to BigQuery.

Access a public dataset on BigQuery

BigQuery provides a list of available public datasets. To query the BigQuery Shakespeare dataset that is part of the public datasets, follow these steps:

  1. To read the BigQuery table, use the following code snippet in your Databricks notebook.

    table = "bigquery-public-data.samples.shakespeare"
    df = spark.read.format("bigquery").option("table",table).load()
    df.createOrReplaceTempView("shakespeare")
    

    Execute the code by pressing Shift+Return.

    You can now query your BigQuery table through the Spark DataFrame (df). For example, use the following to show the first three rows of the dataframe:

    df.show(3)
    

    To query another table, update the table variable.

  2. A key feature of Databricks notebooks is that you can mix the cells of different languages such as Scala, Python, and SQL in a single notebook.

    The following SQL query allows you to visualize the word count in Shakespeare after running the previous cell that creates the temporary view.

    %sql
    SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word ORDER BY word_count DESC LIMIT 12
    
    

    wordcount in shakespeare bar graph

    The cell above runs a Spark SQL query against the dataframe in your Databricks cluster, not in BigQuery. The benefit of this approach is that data analysis occurs on a Spark level, no further BigQuery API calls are issued, and you incur no additional BigQuery costs.

  3. As an alternative, you can delegate the execution of a SQL query to BigQuery with the query() API and optimize for reducing the transfer size of the resulting data frame. Unlike in the example above—where the processing was done in Spark—if you use this approach, pricing and query optimizations apply for executing the query on BigQuery.

    The example below uses Scala, the query() API, and the public Shakespeare dataset in BigQuery to calculate the five most common words in Shakespeare's works. Before you run the code, you must first create an empty dataset in BigQuery called mdataset that the code can reference. For more information, see Writing data to BigQuery.

    %scala
    // public dataset
    val table = "bigquery-public-data.samples.shakespeare"
    
    // existing dataset where the Google Cloud user has table creation permission
    val tempLocation = "mdataset"
    // query string
    val q = s"""SELECT word, SUM(word_count) AS word_count FROM ${table}
        GROUP BY word ORDER BY word_count DESC LIMIT 10 """
    
    // read the result of a GoogleSQL query into a DataFrame
    val df2 =
      spark.read.format("bigquery")
      .option("query", q)
      .option("materializationDataset", tempLocation)
      .load()
    
    // show the top 5 common words in Shakespeare
    df2.show(5)
    

    For more code examples, see the Databricks BigQuery sample notebook.

Writing data to BigQuery

BigQuery tables exist in datasets. Before you can write data to a BigQuery table, you must create a new dataset in BigQuery. To create a dataset for a Databricks Python notebook, follow these steps:

  1. Go to the BigQuery page in the Google Cloud console.

    Go to BigQuery

  2. Expand the Actions option, click Create dataset, and then name it together.

  3. In the Databricks Python notebook, create a simple Spark dataframe from a Python list with three string entries using the following code snippet:

    from pyspark.sql.types import StringType
    mylist = ["Google", "Databricks", "better together"]
    
    df = spark.createDataFrame(mylist, StringType())
    
  4. Add another cell to your notebook that writes the Spark dataframe from the previous step to the BigQuery table myTable in the dataset together. The table is either created or overwritten. Use the bucket name that you specified earlier.

    bucket = YOUR_BUCKET_NAME
    table = "together.myTable"
    
    df.write
      .format("bigquery")
      .option("temporaryGcsBucket", bucket)
      .option("table", table)
      .mode("overwrite").save()
    
  5. To verify that you have successfully written the data, query and display your BigQuery table through the Spark DataFrame (df):

    display(spark.read.format("bigquery").option("table", table).load)