This tutorial shows how to create and execute a data pipeline that uses BigQuery to store data and uses Spark on Google Kubernetes Engine (GKE) to process that data. This pipeline is useful for teams that have standardized their compute infrastructure on GKE and are looking for ways to port their existing workflows. For most teams, running Spark on Cloud Dataproc is the easiest and most scalable way to run their Spark applications. The tutorial assesses a public BigQuery dataset, GitHub data, to find projects that would benefit most from a contribution. This tutorial assumes that you are familiar with GKE and Apache Spark. The following high-level architecture diagram shows the technologies you'll use.
Many projects on GitHub are written in Go, but few indicators tell contributors that a project needs help or where the codebase needs attention most.
In this tutorial, you use the following indicators to tell if a project needs contributions:
- Number of open issues.
- Number of contributors.
- Number of times the packages of a project are imported by other projects.
- Frequency of
The following diagram shows the pipeline of the Spark application:
- Create a Kubernetes Engine cluster to run your Spark application.
- Deploy a Spark application on Kubernetes Engine.
- Query and write BigQuery tables in the Spark application.
- Analyze the results by using BigQuery.
This tutorial uses billable components of Google Cloud, including:
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- Enable the Kubernetes Engine and BigQuery APIs.
Setting up your environment
In this section, you configure the project settings that you need in order to complete the tutorial.
Start a Cloud Shell instance
You work through the rest of the tutorial in Cloud Shell.
Running the pipeline manually
In the following steps, you start your pipeline by having BigQuery extract
all files with extension
.go from the
sample_files table, which is a subset
[bigquery-public-data:github_repos.files]. Using the subset of data allows
for more cost-effective experimentation.
In Cloud Shell, run the following commands to create a new dataset and a new table in BigQuery to store intermediate query results:
export PROJECT=$(gcloud info --format='value(config.project)') bq mk --project_id $PROJECT spark_on_k8s_manual bq mk --project_id $PROJECT spark_on_k8s_manual.go_files
View a sample of the Go files from the GitHub repository dataset, and then store the files in an intermediate table with the
export PROJECT=$(gcloud info --format='value(config.project)') bq query --project_id $PROJECT --replace \ --destination_table spark_on_k8s_manual.go_files \ 'SELECT id, repo_name, path FROM [bigquery-public-data:github_repos.sample_files] WHERE RIGHT(path, 3) = ".go"'
You should see file paths listed along with the repository that they came from. For example:
Waiting on bqjob_r311c807f17003279_0000015fb8007c47_1 ... (0s) Current status: DONE +------------------------------------------+------------------+-------------------------+ | id | repo_name | path | +------------------------------------------+------------------+-------------------------+ | 31a4559c1e636e | mandelsoft/spiff | spiff++/spiff.go | | 15f7611dd21a89 | bep/gr | examples/router/main.go | | 15cbb0b0f096a2 | knq/xo | internal/fkmode.go | +------------------------------------------+------------------+-------------------------+
The list of all identified Go files is now stored in your
Run the following query to display the first 10 characters of each file:
export PROJECT=$(gcloud info --format='value(config.project)') bq query --project_id $PROJECT 'SELECT sample_repo_name as repo_name, SUBSTR(content, 0, 10) FROM [bigquery-public-data:github_repos.sample_contents] WHERE id IN (SELECT id FROM spark_on_k8s_manual.go_files)'
Running the pipeline with Spark on Kubernetes
Next, you automate a similar procedure with a Spark application that uses the spark-bigquery connector to run SQL queries directly against BigQuery. The application then manipulates the results and saves them to BigQuery by using the Spark SQL and DataFrames APIs.
Create a Kubernetes Engine cluster
To deploy Spark and the sample application, create a Kubernetes Engine cluster by running the following commands:
gcloud config set compute/zone us-central1-f gcloud container clusters create spark-on-gke --machine-type n1-standard-2
Download sample code
Clone the sample application repository:
git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-gcp-examples.git cd spark-on-k8s-gcp-examples/github-insights
Configure identity and access management
You must create an Identity and Access Management (IAM) service account to grant Spark access to BigQuery.
Create the service account:
gcloud iam service-accounts create spark-bq --display-name spark-bq
Store the service account email address and your current project ID in environment variables to be used in later commands:
export SA_EMAIL=$(gcloud iam service-accounts list --filter="displayName:spark-bq" --format='value(email)') export PROJECT=$(gcloud info --format='value(config.project)')
The sample application must create and manipulate BigQuery datasets and tables and remove artifacts from Cloud Storage. Bind the
storage.adminroles to the service account:
gcloud projects add-iam-policy-binding $PROJECT --member serviceAccount:$SA_EMAIL --role roles/storage.admin gcloud projects add-iam-policy-binding $PROJECT --member serviceAccount:$SA_EMAIL --role roles/bigquery.dataOwner gcloud projects add-iam-policy-binding $PROJECT --member serviceAccount:$SA_EMAIL --role roles/bigquery.jobUser
Download the service account JSON key and store it in a Kubernetes secret. Your Spark drivers and executors use this secret to authenticate with BigQuery:
gcloud iam service-accounts keys create spark-sa.json --iam-account $SA_EMAIL kubectl create secret generic spark-sa --from-file=spark-sa.json
Add permissions for Spark to be able to launch jobs in the Kubernetes cluster.
kubectl create clusterrolebinding user-admin-binding --clusterrole=cluster-admin --user=$(gcloud config get-value account) kubectl create clusterrolebinding --clusterrole=cluster-admin --serviceaccount=default:default spark-admin
Configure and run the Spark application
You now download, install, and configure Spark to execute the sample Spark application in your Kubernetes Engine cluster.
Install Maven, which you use to manage the build process for the sample application:
sudo apt-get install -y maven
Build the sample application jar:
mvn clean package
Create a Cloud Storage bucket to store the application jar and the results of your Spark pipeline:
export PROJECT=$(gcloud info --format='value(config.project)') gsutil mb gs://$PROJECT-spark-on-k8s
Upload the application jar to the Cloud Storage bucket:
gsutil cp target/github-insights-1.0-SNAPSHOT-jar-with-dependencies.jar \ gs://$PROJECT-spark-on-k8s/jars/
Create a new BigQuery dataset:
bq mk --project_id $PROJECT spark_on_k8s
Download the official Spark 2.3 distribution and unarchive it:
wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz tar xvf spark-2.3.0-bin-hadoop2.7.tgz cd spark-2.3.0-bin-hadoop2.7
Configure your Spark application by creating a properties file that contains your project-specific information:
cat > properties << EOF spark.app.name github-insights spark.kubernetes.namespace default spark.kubernetes.driverEnv.GCS_PROJECT_ID $PROJECT spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS /mnt/secrets/spark-sa.json spark.kubernetes.container.image gcr.io/cloud-solutions-images/spark:v2.3.0-gcs spark.kubernetes.driver.secrets.spark-sa /mnt/secrets spark.kubernetes.executor.secrets.spark-sa /mnt/secrets spark.driver.cores 0.1 spark.executor.instances 3 spark.executor.cores 1 spark.executor.memory 512m spark.executorEnv.GCS_PROJECT_ID $PROJECT spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS /mnt/secrets/spark-sa.json spark.hadoop.google.cloud.auth.service.account.enable true spark.hadoop.google.cloud.auth.service.account.json.keyfile /mnt/secrets/spark-sa.json spark.hadoop.fs.gs.project.id $PROJECT EOF
Run the Spark application on the sample GitHub dataset by using the following commands:
export KUBERNETES_MASTER_IP=$(gcloud container clusters list --filter name=spark-on-gke --format='value(MASTER_IP)') bin/spark-submit \ --properties-file properties \ --deploy-mode cluster \ --class spark.bigquery.example.github.NeedingHelpGoPackageFinder \ --master k8s://https://$KUBERNETES_MASTER_IP:443 \ --jars http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar \ gs://$PROJECT-spark-on-k8s/jars/github-insights-1.0-SNAPSHOT-jar-with-dependencies.jar \ $PROJECT spark_on_k8s $PROJECT-spark-on-k8s --usesample
Open a new Cloud Shell session by clicking the Add Cloud Shell session button:
In the new Cloud Shell session, view the logs of the driver pod by using the following command to track how the application progresses. The application takes about five minutes to execute.
kubectl logs -l spark-role=driver
When the application finishes executing, check the 10 most popular packages by running the following command:
bq query "SELECT * FROM spark_on_k8s.popular_go_packages ORDER BY popularity DESC LIMIT 10"
You can run the same pipeline on the full set of tables in the GitHub dataset by
--usesample option in step 8. Note that the size of the full
dataset is much larger than that of the sample dataset, so you will likely need
a larger cluster to run the pipeline to completion in a reasonable amount of
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
After you finish the tutorial, you can clean up the resources that you created so that they stop using quota and incurring charges. The following sections describe how to delete or turn off these resources.
Deleting the project
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
- Check out another example of using Spark with BigQuery and Dataproc.
Check out this tutorial that uses Cloud Dataproc, BigQuery, and Apache Spark ML for machine learning.
Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.