Using Spark on Kubernetes Engine to Process Data in BigQuery

This tutorial shows how to create and execute a data pipeline that uses BigQuery to store data and uses Spark on Kubernetes Engine to process that data. The tutorial assesses a public BigQuery dataset, GitHub data, to find projects that would benefit most from a contribution. This tutorial assumes that you are familiar with Kubernetes Engine and Apache Spark. The following high-level architecture diagram shows the technologies you'll use.

architecture diagram

Many projects on GitHub are written in Go, but few indicators tell contributors that a project needs help or where the codebase needs attention most.

In this tutorial, you use the following indicators to tell if a project needs contributions:

  • Number of open issues.
  • Number of contributors.
  • Number of times the packages of a project are imported by other projects.
  • Frequency of FIXME or TODO comments.

The following diagram shows the pipeline of the Spark application:

Spark application pipeline

Objectives

  • Create a Kubernetes Engine cluster to run your Spark application.
  • Deploy a Spark application on Kubernetes Engine.
  • Query and write BigQuery tables in the Spark application.
  • Analyze the results by using BigQuery.

Costs

This tutorial uses billable components of Cloud Platform, including:

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. Enable the Kubernetes Engine and BigQuery APIs.

    Enable the APIs

Setting up your environment

In this section, you configure the project settings that you need in order to complete the tutorial.

Start a Cloud Shell instance

Open Cloud Shell

You work through the rest of the tutorial in Cloud Shell.

Running the pipeline manually

In the following steps, you start your pipeline by having BigQuery extract all files with extension .go from the sample_files table, which is a subset of [bigquery-public-data:github_repos.files]. Using the subset of data allows for more cost-effective experimentation.

  1. In Cloud Shell, run the following commands to create a new dataset and a new table in BigQuery to store intermediate query results:

    export PROJECT=$(gcloud info --format='value(config.project)')
    bq mk --project_id $PROJECT spark_on_k8s_manual
    bq mk --project_id $PROJECT spark_on_k8s_manual.go_files
    

  2. View a sample of the Go files from the GitHub repository dataset, and then store the files in an intermediate table with the --destination_table option:

    export PROJECT=$(gcloud info --format='value(config.project)')
    bq query --project $PROJECT --replace \
             --destination_table spark_on_k8s_manual.go_files \
        'SELECT id, repo_name, path FROM
    [bigquery-public-data:github_repos.sample_files]
         WHERE RIGHT(path, 3) = ".go"'
    

    You should see file paths listed along with the repository that they came from. For example:

    Waiting on bqjob_r311c807f17003279_0000015fb8007c47_1 ... (0s) Current status: DONE
    +------------------------------------------+------------------+-------------------------+
    |                    id                    |    repo_name     |          path           |
    +------------------------------------------+------------------+-------------------------+
    | 31a4559c1e636e | mandelsoft/spiff | spiff++/spiff.go        |
    | 15f7611dd21a89 | bep/gr           | examples/router/main.go |
    | 15cbb0b0f096a2 | knq/xo           | internal/fkmode.go      |
    +------------------------------------------+------------------+-------------------------+
    

    The list of all identified Go files is now stored in your spark_on_k8s_manual.go_files table.

  3. Run the following query to display the first 10 characters of each file:

    export PROJECT=$(gcloud info --format='value(config.project)')
    bq query --project $PROJECT 'SELECT sample_repo_name as
    repo_name, SUBSTR(content, 0, 10) FROM
    [bigquery-public-data:github_repos.sample_contents] WHERE id IN
    (SELECT id FROM spark_on_k8s_manual.go_files)'
    

Running the pipeline with Spark on Kubernetes

Next, you automate a similar procedure with a Spark application that uses the spark-bigquery connector to run SQL queries directly against BigQuery. The application then manipulates the results and saves them to BigQuery by using the Spark SQL and DataFrames APIs.

Create a Kubernetes Engine cluster

To deploy Spark and the sample application, create a Kubernetes Engine cluster by running the following commands:

   gcloud config set compute/zone us-central1-f
   gcloud container clusters create spark-on-gke --machine-type n1-standard-2
   

Download sample code

Clone the sample application repository:

   git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-gcp-examples.git
   cd spark-on-k8s-gcp-examples/github-insights
   

Configure identity and access management

You must create a Cloud IAM service account to grant Spark access to BigQuery.

  1. Create the service account:

    gcloud iam service-accounts create spark-bq --display-name spark-bq
    

  2. Store the service account email address and your current project ID in environment variables to be used in later commands:

    export SA_EMAIL=$(gcloud iam service-accounts list --filter="displayName:spark-bq" --format='value(email)')
    export PROJECT=$(gcloud info --format='value(config.project)')
    

  3. The sample application must create and manipulate BigQuery datasets and tables and remove artifacts from Cloud Storage. Bind the bigquery.dataOwner, bigQuery.jobUser, and storage.admin roles to the service account:

    gcloud projects add-iam-policy-binding $PROJECT --member serviceAccount:$SA_EMAIL --role roles/storage.admin
    gcloud projects add-iam-policy-binding $PROJECT --member serviceAccount:$SA_EMAIL --role roles/bigquery.dataOwner
    gcloud projects add-iam-policy-binding $PROJECT --member serviceAccount:$SA_EMAIL --role roles/bigquery.jobUser
    

  4. Download the service account JSON key and store it in a Kubernetes secret. Your Spark drivers and executors use this secret to authenticate with BigQuery:

    gcloud iam service-accounts keys create spark-sa.json --iam-account $SA_EMAIL
    kubectl create secret generic spark-sa --from-file=spark-sa.json
    

  5. Add permissions for Spark to be able to launch jobs in the Kubernetes cluster.

    kubectl create clusterrolebinding user-admin-binding --clusterrole=cluster-admin --user=$(gcloud config get-value account)
    kubectl create clusterrolebinding --clusterrole=cluster-admin --serviceaccount=default:default spark-admin
    

Configure and run the Spark application

You now download, install, and configure Spark to execute the sample Spark application in your Kubernetes Engine cluster.

  1. Install Maven, which you use to manage the build process for the sample application:

    sudo apt-get install -y maven

  2. Build the sample application jar:

    mvn clean package

  3. Create a Cloud Storage bucket to store the application jar and the results of your Spark pipeline:

    export PROJECT=$(gcloud info --format='value(config.project)')
    gsutil mb gs://$PROJECT-spark-on-k8s
    

  4. Upload the application jar to the Cloud Storage bucket:

    gsutil cp target/github-insights-1.0-SNAPSHOT-jar-with-dependencies.jar \
                   gs://$PROJECT-spark-on-k8s/jars/
    

  5. Create a new BigQuery dataset:

    bq mk --project_id $PROJECT spark_on_k8s

  6. Download the official Spark 2.3 distribution and unarchive it:

    wget https://dist.apache.org/repos/dist/release/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
    tar xvf spark-2.3.0-bin-hadoop2.7.tgz
    cd spark-2.3.0-bin-hadoop2.7
    

  7. Configure your Spark application by creating a properties file that contains your project-specific information:

    cat > properties << EOF
    spark.app.name  github-insights
    spark.kubernetes.namespace default
    spark.kubernetes.driverEnv.GCS_PROJECT_ID $PROJECT
    spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS /mnt/secrets/spark-sa.json
    spark.kubernetes.container.image gcr.io/cloud-solutions-images/spark:v2.3.0-gcs
    spark.kubernetes.driver.secrets.spark-sa  /mnt/secrets
    spark.kubernetes.executor.secrets.spark-sa /mnt/secrets
    spark.driver.cores 0.1
    spark.executor.instances 3
    spark.executor.cores 1
    spark.executor.memory 512m
    spark.executorEnv.GCS_PROJECT_ID    $PROJECT
    spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS /mnt/secrets/spark-sa.json
    spark.hadoop.google.cloud.auth.service.account.enable true
    spark.hadoop.google.cloud.auth.service.account.json.keyfile /mnt/secrets/spark-sa.json
    spark.hadoop.fs.gs.project.id $PROJECT
    spark.hadoop.fs.gs.system.bucket $PROJECT-spark-on-k8s
    EOF
    

  8. Run the Spark application on the sample GitHub dataset by using the following commands:

    export KUBERNETES_MASTER_IP=$(gcloud container clusters list --filter name=spark-on-gke --format='value(MASTER_IP)')
    bin/spark-submit \
    --properties-file properties \
    --deploy-mode cluster \
    --class spark.bigquery.example.github.NeedingHelpGoPackageFinder \
    --master k8s://https://$KUBERNETES_MASTER_IP:443 \
    --jars http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar \
    gs://$PROJECT-spark-on-k8s/jars/github-insights-1.0-SNAPSHOT-jar-with-dependencies.jar \
    $PROJECT spark_on_k8s $PROJECT-spark-on-k8s --usesample
    

  9. Open a new Cloud Shell session by clicking the Add Cloud Shell session button:

    Add Cloud Shell sessions button

  10. In the new Cloud Shell session, view the logs of the driver pod by using the following command to track how the application progresses. The application takes about five minutes to execute.

    kubectl logs -l spark-role=driver

  11. When the application finishes executing, check the 10 most popular packages by running the following command:

    bq query "SELECT * FROM spark_on_k8s.popular_go_packages
    ORDER BY popularity DESC LIMIT 10"
    

You can run the same pipeline on the full set of tables in the GitHub dataset by removing the --usesample option in step 8. Note that the size of the full dataset is much larger than that of the sample dataset, so you will likely need a larger cluster to run the pipeline to completion in a reasonable amount of time.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

After you've finished the Spark on Kubernetes Engine tutorial, you can clean up the resources you created on Google Cloud Platform so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project you created for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

  • Check out another example of using Spark with BigQuery and Dataproc.
  • Check out this tutorial that uses Cloud Dataproc, BigQuery, and Apache Spark ML for machine learning.

  • Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.

Send feedback about...