Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Analyzing BigQuery datasets using BigQuery Interpreter for Apache Zeppelin

Thursday, September 1, 2016

Posted by Babu Prasad Elumalai, Technical Program Manager, Enterprise Partner Engineering and Alexander Bezzubov, Contributor, PMC Member of Apache Zeppelin

Today, data is being generated at exponential rates. There's an ever growing need to analyze this data and gain insights to make improvements and enhancements to existing systems. Google BigQuery, our hosted data warehousing and analytics engine, and Apache Zeppelin, an open source data analytics and visualization platform, can go a long way toward meeting that goal. And accessing those services via Google Cloud Dataproc, a managed Apache Hadoop and Spark service, brings new levels of ease-of-use to what can otherwise be a complex process. Read on to learn more about:

  1. Features of BigQuery and Apache Zeppelin
  2. How to install Apache Zeppelin on Google Cloud Dataproc
  3. Using Apache Zeppelin to derive insights, by running SQL to query the data stored in Google BigQuery from within a Zeppelin notebook. We'll use sample data sets hosted in the BigQuery Public Datasets to demonstrate querying and visualization.

Beginning with BigQuery and Zeppelin

Before we go any further, here's a brief introduction about BigQuery and Apache Zeppelin to help you get started.

BigQuery is Google's fully managed, petabyte scale, cost efficient analytics data warehouse. BigQuery is serverless — there's no infrastructure to manage and you don't need a database administrator, so you can focus on analyzing data to find meaningful insights, using familiar SQL and take advantage of our pay-as-you-go model. BigQuery is a powerful Big Data analytics platform used by all types of organizations, from startups to Fortune 500 companies.

Apache Zeppelin is an open-source web-based “notebook” that enables interactive data analytics. It's integrated with distributed general-purpose data processing systems such as Apache Spark, Apache Flink and many others and allows you to make beautiful, data-driven, interactive documents with SQL, Scala, R or Python right in the browser.

Here are some of the salient features of Apache Zeppelin that make it a popular option for data analysis:

  • An interactive interface that allows you to see the results of your analytics right away and have an immediate connection with your creation
  • Notebooks that run in your browser (either on your machine or remotely)
  • Support for experimentation with different charts for data exploration
  • Integration with a variety of open source tools in the current Big Data ecosystem such as Apache Spark, Apache Flink and so on
  • Open source and distributed under a business-friendly Apache 2.0 license
  • A diverse and vibrant developer community and interpreters for new data sources that are being added constantly

Installing Apache Zeppelin on Google Cloud Dataproc

Google Cloud Dataproc allows you to create clusters with Apache Hadoop, Apache Spark, Apache Hive and Apache Pig installed out of the box so you can use the cluster for processing your data sets. Cloud Dataproc can also be used as a deployment vehicle to install any software through means of initialization actions. Initialization actions are scripts that Cloud Dataproc can execute when it brings up a new cluster. You can find a list of initialization actions in our repository.

Let's now take a look at how we can use the initialization action for Apache Zeppelin to install the software on Google Cloud Dataproc.

gcloud dataproc clusters create my-cluster  \
--initialization-actions gs://dataproc-initialization-actions/zeppelin/zeppelin.sh

**You can also create a cluster from within the Google cloud console. In this case, you have to specify the Google Cloud Storage path (gs://dataproc-initialization-actions/zeppelin/zeppelin.sh) to the initialization action parameter in the console.

Once the cluster has been created, Zeppelin is configured to run on port 8080 on the master node in the cluster.

To connect to the Apache Zeppelin web interface with local port forwarding, run the following command in your shell:

gcloud compute ssh  --zone=<master-host-zone> --ssh-flag="-N" --ssh-flag="-L 
1080:<master-host-external-ip>:8080" <my-cluster-m>

Then, open a browser and go to the following link: http://localhost:1080/

If you need to use dynamic port forwarding and use a SOCKS 5 proxy please, follow the instructions described in the Cloud Dataproc web interfaces documentation.

Querying BigQuery using Apache Spark on Zeppelin

Let's first cover the method of using Spark to process the datasets available in BigQuery. This can be done using the BigQuery connector for Spark. To demonstrate this, let’s do some analysis on the github public dataset available in BigQuery.

We're interested in test repositories (repository names with ‘test’ in the string). We'll analyze how many commits are being done to these repositories and whether we can spot anything in the data.

In order to optimize costs for this test analysis, let’s copy relevant data into our own dataset in our project, so that we have less data to query. Once we're done with that, we'll run Scala code that will get the data from BigQuery using the connector and create a temporary table that resides in memory. We'll then use Spark SQL to query this data using Zeppelin.

Copy relevant data

We’ll use the BigQuery command line tool bq () to create our tables and load it with data from the query on the github public dataset.

Enabling billing

To enable billing and increase your quota, we'll guide you through the Google Developers Console to do a few things first.

Continue




Here’s the code snippet to load and register the table:

 

import com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
import com.google.gson.JsonObject
import java.sql.Timestamp

import org.apache.hadoop.fs.Path
import org.apache.hadoop.io.LongWritable
import org.apache.spark.sql.types._
import org.apache.spark.sql._

// Marked as transient since configuration is not Serializable. This should
// only be necessary in spark-shell REPL.
@transient
val conf = sc.hadoopConfiguration

// Input parameters
val fullyQualifiedInputTableId = "wordcount_dataset.git_commit"

val bucket = conf.get("fs.gs.system.bucket")

// Output parameters
val projectId = conf.get("fs.gs.project.id")

// Input configuration
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, bucket)
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)

val inputTmpDir = s"gs://${bucket}/hadoop/tmp/bigquery/gittestdata"conf.set(BigQueryConfiguration.TEMP_GCS_PATH_KEY, inputTmpDir)

// Helper to convert JsonObjects to (word, count) tuples.
def convertToTuple(record: JsonObject) : (String, String, String) = {  
  val repo_name = record.get("repo_name").getAsString.toLowerCase  
  val committer_date = record.get("committer_date").getAsString  
  val committer_email =  record.get("committer_email").getAsString  
  return (repo_name, committer_date, committer_email)
}

// Load data from BigQuery.
val tableData = sc.newAPIHadoopRDD(    
    conf,    
    classOf[GsonBigQueryInputFormat],    
    classOf[LongWritable],    classOf[JsonObject]).cache

// Perform word count.
val commitData = (tableData    
    .map(entry => convertToTuple(entry._2)))    

val schema =  
  StructType(  
  StructField("repo_name", StringType, false) ::  
  StructField("committer_date", StringType, false) ::  StructField("committer_email", StringType, false) :: Nil)  

val dTable = commitData.map(p => Row(p._1, p._2, p._3))
val dTabledf = sqlContext.createDataFrame(dTable, schema)
dTabledf.registerTempTable("commitTableTestData")

Here's a SQL analysis of the table data


The analysis on the data reveals that there was an anomaly on the number of commits for repositories that have the string “test” in their name. This seems to have happened on a particular date, when there were 20 times more commits on a test repository. Drilling down on the data in the same notebook in another paragraph reveals that most of these commits came from two repositories that belonged to the same person.

More to the point, if you observe this example carefully, you’ll notice that we were able to complete the analysis in a single notebook and page. We were also able to sequentially define and execute our work in the notebook. This efficiency is one of the biggest advantages of using Zeppelin.

New: Native BigQuery Interpreter for Apache Zeppelin

You now know how to use the BigQuery Spark connector to process the data stored in BigQuery and analyze it using Zeppelin. However, this approach requires you to write code and then optionally run SQL to perform analysis on Zeppelin.

Starting with Zeppelin version 0.6.1, the native BigQuery Interpreter allows you to process and analyze datasets stored in Google BigQuery by directly running SQL against it from within an Apache Zeppelin notebook — eliminating the need to write code. Configuring BigQuery Interpreter is simple.

Enter your Google project Id and click save to create the interpreter. The wait_time property allows you to control the server side timeout setting for your queries, while max_no_of_rows allows you to control result set size. If you're running Apache Zeppelin outside of Google Compute Engine, you’ll also need to configure access credentials at startup through the GOOGLE_APPLICATION_CREDENTIALS environment variable in conf/zeppelin-env.sh. Do so by following these instructions.

Running queries

Once you've configured the interpreter, you should be able to query all the datasets you have access to in BigQuery and analyze them using Apache Zeppelin. As an example, let’s redo the same analysis we did using Spark on BigQuery in Zeppelin earlier. Only this time, we'll run plain SQL to do the analysis using the BigQuery Interpreter.

Visualizing the BigQuery dataset via the native BigQuery Interpreter

Here are some more examples:

Example 1: Let’s examine the NYC Taxi and Limousine Commission (TLC) trip records from all trips completed in yellow and green taxis in NYC from 2009 to 2015. This is available as a public dataset today in BigQuery. We'll analyze the average speed per hour of day per year.


From the analysis above, it's clear that, the average speed is highest(~22 MPH) at 5 AM in the morning from 2009 to 2015.

Here's the SQL that we used:

SELECT  
  year(pickup_datetime) year,  
  HOUR(pickup_datetime) hour,  

INTEGER(100*AVG(trip_distance/((dropoff_datetime-pickup_datetime)/3600000000)))/100 speed
FROM  
  [nyc-tlc:yellow.trips]
WHERE  
  fare_amount/trip_distance BETWEEN 2  
  AND 10
GROUP BY  
  1,2
ORDER BY  
  1,2 

Example 2: Let’s use the earlier Github dataset we used earlier to discover new patterns using the interpreter. We'll analyze the top programming languages by size of their repositories in github.


From the pie chart above, it looks like the repositories written in C are largest in terms of size in bytes

Here's the SQL:

select language.name as language_name,sum(language.bytes) as bytes from 
[bigquery-public-data:github_repos.languages] group by language_name having 
sum(language.bytes) > 100000000 order by 2 desc

Conclusion

In this blog, we talked about using BigQuery for data warehousing, leveraging Apache Zeppelin for data analysis and using Apache Spark BigQuery connector to query BigQuery from Zeppelin. Finally, we introduced you to the native BigQuery Interpreter for Zeppelin that allows you to run SQL on your BigQuery datasets. We also showed you a couple of visualizations using the public datasets hosted in BigQuery.

Now, that you've seen how to use the BigQuery Interpreter on Apache Zeppelin, you can play around with the interpreter and build cool visualizations and analyze your datasets stored in BigQuery.

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.

TRY IT FREE