Using Spark on Kubernetes Engine to Process Data in BigQuery

This tutorial shows how to create and execute a data pipeline that uses BigQuery to store data and uses Spark on Google Kubernetes Engine (GKE) to process that data. This pipeline is useful for teams that have standardized their compute infrastructure on GKE and are looking for ways to port their existing workflows. For most teams, running Spark on Cloud Dataproc is the easiest and most scalable way to run their Spark applications. The tutorial assesses a public BigQuery dataset, GitHub data, to find projects that would benefit most from a contribution. This tutorial assumes that you are familiar with GKE and Apache Spark. The following high-level architecture diagram shows the technologies you'll use.

architecture diagram

Many projects on GitHub are written in Go, but few indicators tell contributors that a project needs help or where the codebase needs attention most.

In this tutorial, you use the following indicators to tell if a project needs contributions:

  • Number of open issues.
  • Number of contributors.
  • Number of times the packages of a project are imported by other projects.
  • Frequency of FIXME or TODO comments.

The following diagram shows the pipeline of the Spark application:

Spark application pipeline


  • Create a Kubernetes Engine cluster to run your Spark application.
  • Deploy a Spark application on Kubernetes Engine.
  • Query and write BigQuery tables in the Spark application.
  • Analyze the results by using BigQuery.


This tutorial uses billable components of Google Cloud, including:

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Kubernetes Engine and BigQuery APIs.

    Enable the APIs

Setting up your environment

In this section, you configure the project settings that you need in order to complete the tutorial.

Start a Cloud Shell instance

Open Cloud Shell

You work through the rest of the tutorial in Cloud Shell.

Running the pipeline manually

In the following steps, you start your pipeline by having BigQuery extract all files with extension .go from the sample_files table, which is a subset of [bigquery-public-data:github_repos.files]. Using the subset of data allows for more cost-effective experimentation.

  1. In Cloud Shell, run the following commands to create a new dataset and a new table in BigQuery to store intermediate query results:

    export PROJECT=$(gcloud info --format='value(config.project)')
    bq mk --project_id $PROJECT spark_on_k8s_manual
    bq mk --project_id $PROJECT spark_on_k8s_manual.go_files
  2. View a sample of the Go files from the GitHub repository dataset, and then store the files in an intermediate table with the --destination_table option:

    export PROJECT=$(gcloud info --format='value(config.project)')
    bq query --project_id $PROJECT --replace \
             --destination_table spark_on_k8s_manual.go_files \
        'SELECT id, repo_name, path FROM
         WHERE RIGHT(path, 3) = ".go"'

    You should see file paths listed along with the repository that they came from. For example:

    Waiting on bqjob_r311c807f17003279_0000015fb8007c47_1 ... (0s) Current status: DONE
    |                    id                    |    repo_name     |          path           |
    | 31a4559c1e636e | mandelsoft/spiff | spiff++/spiff.go        |
    | 15f7611dd21a89 | bep/gr           | examples/router/main.go |
    | 15cbb0b0f096a2 | knq/xo           | internal/fkmode.go      |

    The list of all identified Go files is now stored in your spark_on_k8s_manual.go_files table.

  3. Run the following query to display the first 10 characters of each file:

    export PROJECT=$(gcloud info --format='value(config.project)')
    bq query --project_id $PROJECT 'SELECT sample_repo_name as
    repo_name, SUBSTR(content, 0, 10) FROM
    [bigquery-public-data:github_repos.sample_contents] WHERE id IN
    (SELECT id FROM spark_on_k8s_manual.go_files)'

Running the pipeline with Spark on Kubernetes

Next, you automate a similar procedure with a Spark application that uses the spark-bigquery connector to run SQL queries directly against BigQuery. The application then manipulates the results and saves them to BigQuery by using the Spark SQL and DataFrames APIs.

Create a Kubernetes Engine cluster

To deploy Spark and the sample application, create a Kubernetes Engine cluster by running the following commands:

gcloud config set compute/zone us-central1-f
gcloud container clusters create spark-on-gke --machine-type n1-standard-2

Download sample code

Clone the sample application repository:

git clone
cd spark-on-k8s-gcp-examples/github-insights

Configure identity and access management

You must create an Identity and Access Management (IAM) service account to grant Spark access to BigQuery.

  1. Create the service account:

    gcloud iam service-accounts create spark-bq --display-name spark-bq
  2. Store the service account email address and your current project ID in environment variables to be used in later commands:

    export SA_EMAIL=$(gcloud iam service-accounts list --filter="displayName:spark-bq" --format='value(email)')
    export PROJECT=$(gcloud info --format='value(config.project)')
  3. The sample application must create and manipulate BigQuery datasets and tables and remove artifacts from Cloud Storage. Bind the bigquery.dataOwner, bigQuery.jobUser, and storage.admin roles to the service account:

    gcloud projects add-iam-policy-binding $PROJECT --member serviceAccount:$SA_EMAIL --role roles/storage.admin
    gcloud projects add-iam-policy-binding $PROJECT --member serviceAccount:$SA_EMAIL --role roles/bigquery.dataOwner
    gcloud projects add-iam-policy-binding $PROJECT --member serviceAccount:$SA_EMAIL --role roles/bigquery.jobUser
  4. Download the service account JSON key and store it in a Kubernetes secret. Your Spark drivers and executors use this secret to authenticate with BigQuery:

    gcloud iam service-accounts keys create spark-sa.json --iam-account $SA_EMAIL
    kubectl create secret generic spark-sa --from-file=spark-sa.json
  5. Add permissions for Spark to be able to launch jobs in the Kubernetes cluster.

    kubectl create clusterrolebinding user-admin-binding --clusterrole=cluster-admin --user=$(gcloud config get-value account)
    kubectl create clusterrolebinding --clusterrole=cluster-admin --serviceaccount=default:default spark-admin

Configure and run the Spark application

You now download, install, and configure Spark to execute the sample Spark application in your Kubernetes Engine cluster.

  1. Install Maven, which you use to manage the build process for the sample application:

    sudo apt-get install -y maven
  2. Build the sample application jar:

    mvn clean package
  3. Create a Cloud Storage bucket to store the application jar and the results of your Spark pipeline:

    export PROJECT=$(gcloud info --format='value(config.project)')
    gsutil mb gs://$PROJECT-spark-on-k8s
  4. Upload the application jar to the Cloud Storage bucket:

    gsutil cp target/github-insights-1.0-SNAPSHOT-jar-with-dependencies.jar \
  5. Create a new BigQuery dataset:

    bq mk --project_id $PROJECT spark_on_k8s
  6. Download the official Spark 2.3 distribution and unarchive it:

    tar xvf spark-2.3.0-bin-hadoop2.7.tgz
    cd spark-2.3.0-bin-hadoop2.7
  7. Configure your Spark application by creating a properties file that contains your project-specific information:

    cat > properties << EOF  github-insights
    spark.kubernetes.namespace default
    spark.kubernetes.driverEnv.GCS_PROJECT_ID $PROJECT
    spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS /mnt/secrets/spark-sa.json
    spark.kubernetes.driver.secrets.spark-sa  /mnt/secrets
    spark.kubernetes.executor.secrets.spark-sa /mnt/secrets
    spark.driver.cores 0.1
    spark.executor.instances 3
    spark.executor.cores 1
    spark.executor.memory 512m
    spark.executorEnv.GCS_PROJECT_ID    $PROJECT
    spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS /mnt/secrets/spark-sa.json true /mnt/secrets/spark-sa.json $PROJECT
  8. Run the Spark application on the sample GitHub dataset by using the following commands:

    export KUBERNETES_MASTER_IP=$(gcloud container clusters list --filter name=spark-on-gke --format='value(MASTER_IP)')
    bin/spark-submit \
    --properties-file properties \
    --deploy-mode cluster \
    --class spark.bigquery.example.github.NeedingHelpGoPackageFinder \
    --master k8s://https://$KUBERNETES_MASTER_IP:443 \
    --jars \
    gs://$PROJECT-spark-on-k8s/jars/github-insights-1.0-SNAPSHOT-jar-with-dependencies.jar \
    $PROJECT spark_on_k8s $PROJECT-spark-on-k8s --usesample
  9. Open a new Cloud Shell session by clicking the Add Cloud Shell session button:

    Add Cloud Shell sessions button

  10. In the new Cloud Shell session, view the logs of the driver pod by using the following command to track how the application progresses. The application takes about five minutes to execute.

    kubectl logs -l spark-role=driver
  11. When the application finishes executing, check the 10 most popular packages by running the following command:

    bq query "SELECT * FROM spark_on_k8s.popular_go_packages
    ORDER BY popularity DESC LIMIT 10"

You can run the same pipeline on the full set of tables in the GitHub dataset by removing the --usesample option in step 8. Note that the size of the full dataset is much larger than that of the sample dataset, so you will likely need a larger cluster to run the pipeline to completion in a reasonable amount of time.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

After you finish the tutorial, you can clean up the resources that you created so that they stop using quota and incurring charges. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

  • Check out another example of using Spark with BigQuery and Dataproc.
  • Check out this tutorial that uses Cloud Dataproc, BigQuery, and Apache Spark ML for machine learning.

  • Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.