Hide
Google BigQuery

Codelab: Creating a BigQuery Dashboard

Author: Szabolcs Feczak, Date: 06/05/2014, Tags: Codelab, BigQuery, App Engine, Google Charts

Contents

  1. Objective
  2. Requirements
  3. Introduction
  4. Setting Up
    1. Create a New App Engine Project: hellodashboard
    2. Download and Install Libraries
    3. Enable BigQuery API
    4. Create and Record Client Secret
    5. Add OAuth2 Support
  5. Create a Class for BigQuery
    1. A Small Optimization
    2. Understanding the Service Object
    3. [Optional Reading] Under the Covers with the OAuth2 Decorator
  6. BigQueryClient Constructor
    1. Which Project ID Should You Use?
  7. Make Our First Authenticated Calls
  8. Building the Dashboard
    1. Picking the Right Question for a Query
    2. Viewing the Data
  9. Conclusion

Objective

This codelab describes how to build a dashboard application on top of Google BigQuery. Our specific implementation uses Google App Engine using Python and Google Charts using Javascript. The general strategy and techniques should be applicable to any platform and visualization package.

This is an intermediate lab and we assume you have already completed Part 1 of the Python Codelab or the Python App Engine Getting Started guide. That means you are already have your basic App Engine development environment configured and that you've successfully run a basic app on your system using the development server.

Requirements

These requirements were required in the Python Codelab and are also required for this lab:

In addition, this BigQuery Dashboard Codelab requires:

Introduction

Like other Google APIs, the BigQuery service requires strong authentication, and OAuth2 is the best practice. We will take that existing tutorial and add calls to BigQuery that aggregates across a large public data set and then display the results in a visual fashion suitable for a dashboard.

Highlights

Setting Up

In the Python Codelab for App Engine, you focused on using the services available within App Engine itself, like the Datastore, email, and XMPP. If you've already gotten things running by using the Google App Engine Getting Started guide, that is fine as an alternative to the Python Codelab.

BigQuery is a service that runs outside of App Engine and so that involves additional libraries for authentication and access. We can add the authentication and access in the following steps:

  • Create a new App Engine project
  • Download and install libraries.
  • Enable the BigQuery API
  • Create and Record Client Secrets in client_secrets.json
  • Add OAuth2 Support to main.py

After the setup, we will create a basic class for BigQuery and make our first authenticated query. Once we have the query results, we will reformat them to fit Google Charts and display them as a dashboard.

  1. Create New App Engine Project: hellodashboard

    As you did in the Python Codelab, create a new App Engine application containing app.yaml and main.py. Remember that if you are going to deploy your application to the internet then you must pick a unique name and register a new project at http://appengine.google.com first. I'll call mine 'hellodashboard'. Using the launcher to create the base application, files and directory is encouraged. Your application must exist in its own directory so that we can install the supporting libraries. Let's assume this project is in 'source/hellodashboard'. We'll refer to that later when we install the libraries.

    Inside the hellodashboard directory, create a file named main.py, and give it the following contents:

    At this point you could try running the sample in your development App Engine server and access it through your browser. You should see a friendly 'Hello World.' That's nice. What we really wanted was that directory, 'source/hellodashboard' so we have a place to install our libraries.

  2. Download and Install Libraries

    1. Refer to the Google APIs Client Library for Python documentation on different methods of installing the library

      The most simple way is to get the library and all of its dependencies (httplib2, uritemplate and oauth2client) as one package. Download the file named google-api-python-client-gae-N.M.zip from the list of downloads where N.M is the version number of the latest release. For example:

      export LC_CTYPE=C
      export LANG=C
      cd hellodashboard
      curl -O https://google-api-python-client.googlecode.com/files/google-api-python-client-gae-1.2.zip
      unzip google-api-python-client-gae-1.2.zip
      mv apiclient googleapiclient
      sed -ie "s/from apiclient/from googleapiclient/g" googleapiclient/*
      curl -O https://pypi.python.org/packages/source/g/gviz_data_table/gviz_data_table-1.0.1.tar.gz
      tar -xzf gviz_data_table-1.0.1.tar.gz
      mv gviz_data_table-1.0.1/gviz_data_table .
      rm -rf gviz_data_table-1.0.1* google-api-python-client-gae-1.2.zip
      
    2. Alternatively you can use the following method with pip version 1.3 and above

      If you have a lower version of pip installed you can upgrade it with sudo pip install --upgrade pip first.

      sudo pip install --upgrade --user googleapiclient gviz-data-table
      function copylib() {
        $(pip show $1  | egrep "^Location|^Requires" | awk -F ": " '{print "export "$1"="$2}' | sed -e "s/, /,/g")
        if [ -e $Location/$1 ]; then
          cp -R $Location/$1 $2
        else
          cp -R $Location/$(echo $1 | sed -e "s/-/_/g") $2
        fi;
        if [ $Requires ]; then
          echo $Requires | tr , "\n" | while read r; do
            copylib $r $2;
          done;
        fi;
      }
      copylib googleapiclient hellodashboard/
      copylib gviz-data-table hellodashboard/
      

      The API and authentication libraries are now ready for use.

    3. Enable the BigQuery API

      Before we can call the BigQuery API, we must make sure that the API is enabled for the project. In short: go to the Google Developers Console, https://console.developers.google.com/. Select APIs & auth from the left hand navigation. Scroll to the BigQuery API and check if the status is On in the last column.

      Refer to the instructions on the BigQuery signup page for more details..
  3. Create and Record Client Secrets

    Why do we need secrets? A shared secret establishes a trust relationship between Google, who authenticates the user, and your application, which the user authorizes to access data. This is where authentication and authorization via OAuth2 come into play. You have already installed the OAuth2 libraries (step 2, above) and these libraries, combined with the secret and auth services at Google, will enforce the security and trust between Google, your application, and the user. If you are interested in the authentication process then there is a very informative presentation on the Google Code site: How Authentication Works. For now, we'll stick to the practical matters of setting up your application and let the libraries do all the hard work.

    The first step is to generate some secrets and to tell the API Console the internet location where your application is running. The secrets are for OAuth and the location is so that the Google Servers can redirect the user back to your application after he has identified himself.

    1. Go to the Google Developers Console, https://console.developers.google.com/ and open your 'Hello Dashboard Project'.
    2. Click APIs & Auth.
    3. Click Credentials.
    4. Click Create new Client ID.
    5. Choose application type Web Application
    6. Fill Authorized Javascript Origins with localhost:8080 (assuming this is the port where your App Engine development server is running change it if necessary). The Authorized Redirect URI should be http://localhost:8080/oauth2callback If you wish to run this in production you need to add your appspot.com name as well in a new line for each box, eg. origin: http://dashdemo.appspot.com and callback: http://dashdemo.appspot.com/oauth2callback
    7. Click Create Client ID.
    8. You should see a box titled Client ID for web applications.
    9. Click Download JSON

    There! You've told Google where your development server will be running and you've generated the client id and secret. Now it is time to give this information to hellodashboard.

    Let's rename the file you just downloaded to: hellodashboard/client_secrets.json

    1. Open the client_secrets.json file for editing.
    2. Copy the value for Client ID, Client secret and Redirect URIs from the Google APIs console Client ID for web applications to client_secrets.json.
      The results should look like this (after formatting):
      The file works without formatting as well, the same format as you download it without linebreaks, the formatting is done here for readability only. Keep your numeric Client ID nearby because you will need it later on as the value of PROJECT_ID.
    3. Save client_secrets.json.
  4. Add OAuth2 Support to main.py

    Let's take a look at main.py again and consider the code we need to add to make BigQuery API calls. This file includes the handler for the get() request which is usually the first interaction a user has with your dashboard. Since the user wants to see information from BigQuery and all BigQuery calls require authorization, that first get() must be authenticated and authorized. That means we must add OAuth2 support.

    There are several ways to add OAuth2 to an application, but for App Engine apps the simplest approach is to use a special App Engine-specific OAuth2 decorator. First we need to import the right libraries:

    That's a long name for a constructor, but it tells us exactly what it does: create an OAuth2 Python decorator from the client_secrets.json file we created earlier. Here is how we instantiate the decorator in main.py (partial file):

    The decorator needs to know the information for the OAuth2 flow we stored in client_secrets.json as well as the scope of permissions to request. In this codelab, we are just interested in BigQuery, so we use the "Authentication:" scope as defined in the reference manual. If we want to add other authentication scopes for other APIs, we include them in the same string, separated by spaces. With all the scopes in the same string, each service can use the same credentials and the user only authorizes the application once.

    What is this OAuth2 decorator going to do for us? It will handle the OAuth2 flow, checking that the user is logged-in and that we have valid credentials for the requested scopes. If any of these things is missing, then the decorator automatically interacts with the httplib2 request object and redirect the user to the proper pages to log in and/or give permission to the application to acquire credentials. You can do all of these steps manually, but the decorator is a great way to simplify your App Engine code, and so for clarity and brevity, we will use it in this codelab.

    Now let's "decorate" the get() method in main.py so that any call to get() prepares the decorator for the OAuth2 flow (partial file):

    To make use of the authentication by the decorator we define the _get_given_name function to retrieve the given name of the logged in user from the OAuth2 service

    You could try running the code now but you won't see any behavior difference because we're not trying to make an authenticated API call yet. You won't see any user prompts to log-in or authorize APIs. To get ready for those prompts, we must add an OAuth callback handler. Many people call the OAuth authorization and authentication flow a "dance" because the user moves back and forth between the application and the OAuth servers on the internet. During this dance, your application must handle a callback from the authentication servers. Therefore we need to map the decorator.callback_path to the library code in the webapp2.WSGIApplication parameter list. We can do that at the bottom of the main.py file:

    When you construct the OAuth2 decorator using the oauth2decorator_from_clientsecrets constructor, you should use the default callback address for your application, /oauth2callback. This is the same callback we configured when we created the API credentials in "Hello Dashboard Project" above and specified in the redirect_uris of client_secrets.json. Now map the OAuth 2 callback url decorator.callback_path in main.py to decorator.callback_handler() which does all the tedious work of obtaining the user credentials and storing them for future use. If you're curious about those credentials, you can find them stored in the App Engine datastore as a CredentialsModel.

    You've now completed the changes necessary to add OAuth2 to your main get() handler and we can think about what BigQuery API calls we would like to make, confident that the user will be prompted for any necessary permissions. Your main.py should look like this (complete file):

Create a Class for BigQuery

Since we are building a dashboard, we're going to be interacting with BigQuery quite a bit, making queries and reformatting the results for display. We should wrap BigQuery into a class so that we can encapsulate some default values and present a clean interface. To keep the main.py code clean, let's create this class in a new file, bqclient.py.

The BigQuery API is part of a set of Google APIs available through the API Discovery Service. For Python, that means you don't need to download an API library specifically for BigQuery. Instead, you can download one library and create all the interfaces to any library hosted in the API Discovery Service.

Let's take a look at the function we need to construct those interfaces to BigQuery API calls, our first line of bqclient.py:

from googleapiclient.discovery import build

The build() function requires two parameters, serviceName and version, which, in our case, have the values bigquery and v2, specifying the BigQuery service with version 2 of the BigQuery API (the current version in December 2012). You can find these parameters as part of the REST URI in the reference manual. We could now construct the interface like this:

service = build('bigquery', 'v2')
but that would miss an opportunity for optimization.

A Small Optimization

The build() method accepts an httplib2 object with caching enabled and takes advantage of HTTP caching and ETag support in the Google APIs. In fact, to get the maximum advantage, we should share this httplib2 object with any other Google APIs we might call so that we have a single object that can reuse connections and cache efficiently.

We could hide the oauth2client usage within bqclient.py, but then we cannot reuse the decorator object for other API communication. So let's include oauth2client directly in main.py and instantiate it as a global decorator variable. We'll make memcache available to the decorator object to use as an HTTP cache. So let's add this to main.py (partial file):

So that leads us to this first version of a constructor in bqclient.py:

After we learn more about the service object we've constructed, we'll add a bit more to the BigQueryClient constructor to help it complete API calls successfully.

Understanding the Service Object

What is this BigQueryClient.service object that we built in the constructor? This is our primary interface to the Google BigQuery REST API, constructed dynamically from the API Discovery Service. All of the functionality of a service is available in a set of collections. In the case of BigQuery, you can see the collections listed in the REST API documentation: datasets, jobs, projects, tabledata, and tables. Each collection contains one or more methods, and as you may have guessed from the name, most of the work gets done in the jobs collection. The other collections help read and sometimes write metadata for the project, dataset, and tables. In contrast, the jobs collection gives us methods for making queries, importing data, and controlling the jobs themselves, all the longer-running tasks.

Here is an example of building up a BigQueryClient request for the a table's metadata:

How did I know that tables().get() would require those three parameters? You can discover the parameters either by looking at the REST API documentation:

GET https://www.googleapis.com/bigquery/v2/projects/{projectId}/datasets/{datasetId}/tables/{tableId}

or by using Python's interactive mode to generate documentation with help():

$ python
Python 2.7.3 (default, Mar 13 2014, 11:03:55)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from apiclient.discovery import build
>>> service = build("bigquery", "v2")
>>> help(service.tables())
Help on Resource in module apiclient.discovery object:
class Resource(__builtin__.object)
 |  A class for interacting with a resource.
 |   |  Methods defined here:
...
  |  get = method(self, **kwargs)
 |      Gets the specified table resource by table ID. This
 |      method does not return the data in the table, it only
 |      returns the table resource, which describes the
 |      structure of this table.
 |
 |      Args:
 |        projectId: string, Project ID of the requested table
 |                  (required)
 |        tableId: string, Table ID of the requested table
 |                  (required)
 |        datasetId: string, Dataset ID of the requested table
 |                  (required)
 |
 |      Returns:
 |        An object of the form |
 |          { |            "kind": "bigquery#table",
...

The request builds up an HttpRequest based on the collection and method parameters, but it does not actually try to contact the internet service for the BigQuery REST API until the execute() command. Since BigQuery requires authorization for all API requests, we must provide access to the credentials. We can do this by using the same OAuth2 decorator we used to wrap the get() request handler in main.py. Without this decorator, execute() would fail because it would try to use the httplib2 object we passed in to the BigQueryClient constructor. That original httplib2 object, by itself, has no idea how to get the credentials needed for OAuth2 authentication. It needs help from the same decorator that is wrapping the get() handler in main.py. All you need to do is call the decorator's http() method within whatever function it is wrapping (in this case, the get() request handler). If you try to call the decorator's http() function before the call to the decorated handler, the decorator won't have access to the right HttpRequest object and authorization fails.

[Optional Reading] Under the Covers with the OAuth2 Decorator

The OAuth2 decorator wrapping the get() request handler in main.py wraps the whole get() method and thereby gets access to the HttpRequest object passed-in. Since it has access to the request object, it can intercept the call flow and redirect the user immediately to log in or to authorize the application for access to BigQuery. This happens at the moment you call execute(decorated_http). If the user is logged in and the application has the authorization credentials it needs, then execute() makes the REST API call you've built-up. If the user is not logged-in or the application lacks valid credentials, then the decorator immediately redirects the user to the proper log in and authorization pages, and then resumes the original request. For more details, please see the comments in oauth2client/appengine.py.

BigQueryClient Constructor

Now that we know how important the OAuth2 decorator is, let's include that in the BigQueryClient constructor and use it in a method (complete bqclient.py file):

You may have noticed that getTableData() calls service.tables().get(...).execute(decorated) all in one line. This is the easy way to do it when you want the REST request to be issued immediately. Most of the metadata requests return quickly, so it makes sense to issue these immediately. For requests that may take a few seconds, like jobs(), sometimes it makes more sense to create the request and then execute() it from a longer-lived App Engine task. This codelab focuses on synchronous calls for simplicity, even for the longer-lived query job.

Which project ID should you use?

It is important to know that the project id used in the query is different than the project id used in getting the last modification time. Your query lives in a "job" and jobs live in billable projects. When BigQuery is launched for general availability, we will charge for jobs like queries and ingestion, so the project for your job must be a project where you are authorized as a viewer, so that you can bill to that project. Let's call that project the BILLING_PROJECT_ID and you can use the same numeric value from your client id that you added to client_secrets.json.

The other kind of project id is a location where some data lives. When you ask simple information about a table, like its last modification date, that does not run in a job. Since there is no job, your API call is actually a request directly to that table resource, and so the project ID to these calls is the project id for the data. Let's call this the DATA_PROJECT_ID. For this sample, we will use the "publicdata" project.

The Python API library unfortunately hides the distinction between "jobs" and basic "get" requests, but they are pretty easy to see in the REST API documentation. If you find yourself getting confused about which project id to send to the function, take a look at the original REST documentation to see if it is a "job" or not.

The Project ID and Project Number are interchangable in most cases.

We've imported our libraries and constructed everything we need for OAuth2 authentication on App Engine. We've even built our interface to the BigQuery service. It is time to make our first BigQuery call.

Make our first authenticated calls

Here is the complete code to the updated main.py with the changes needed to construct and call the BigQueryClient:

Go ahead and run it in the development App Engine server!

When you visit localhost:8080 with your browser, the application tries to perform an HTTP GET. Unlike the first time you ran this application, get() is now decorated with @decorator.oauth_required and so this time you have to log in. Make sure you log in with your real Google email address as your user name, the one with access to BigQuery! Although you are running the development App Engine server, your application is talking to the real BigQuery service, so you need to provide a real account for authentication and authorization.

When you complete the OAuth dance, you should see:

Hello Dashboard! Last mod time = 1318622687171
That timestamp is milliseconds since January 1, 1970, also known as "epoch time", but given in milliseconds instead of the traditional seconds. This converts to 10/14/2011 8:04pm GMT, when we last updated this public dataset on U.S. natality.

We've now successfully made an authorized call to the BigQuery API. We're ready to make a query, but what should we ask BigQuery about the natality data set, and how should we display it?

Building the Dashboard

Google's BigQuery is an ideal back-end for strategic dashboards because it excels at aggregation across massive data sets. The natality dataset we are accessing in this codelab is pretty big, but not massive, containing 137,826,763 rows, just about one for each live birth between 1969 and 2008. It doesn't quite equal one row per birth because between 1969 and 1985 many records are only a 50% sample and should be counted twice. What kind of questions can we ask, and how would we like to display it?

Picking the Right Question for a Query

The description included in the publicdata:samples.natality dataset include a nice pointer to some additional information about the data and some Query examples. You can see this in the data returned from bigquery.tables().get(), or you click on the table in the Web UI:

One of the example queries asks this question: "Do gestation periods vary by state?"

select state, SUM(gestation_weeks) / COUNT(gestation_weeks) as weeks from publicdata:samples.natality where year > 1990 and year < 2005 and IS_EXPLICITLY_DEFINED(gestation_weeks) group by state order by weeks

Grouping by state is nice because it limits the size of the return set to 51 rows (the 50 states plus Washington, DC) and it includes geographic information, which always makes a compelling dashboard.

Viewing the Data

How can we present this data in Google Charts? There are some very nice geocharts to choose from, and options to allow us to focus on just the United States. This is going to require our output to be more sophisticated than the simple text strings we've been creating with response.out.write(). We're going to need to include both Javascript and HTML, so let's take the same strategy as the basic Python Codelab and make use of Web Templates.

Let's create the HTML template in a new file, index.html (complete file):

This should look almost exactly like the examples given in Google Charts, with the exception of geodata = {{ data }}. As you recall from the basic Python Codelab, a Web Template gets processed before it is sent to the requesting web browser, and template variables are enclosed in double curly-braces, like {{ data }} and {{ query }} above. We will be able to substitute our geographic data as a JSON object and replace {{ data }}, so we don't need to construct the data row by row as they do in the Google Chart examples.

Using the Web Template requires adding new imports to main.py, but this is included with App Engine, so there are no new libraries to install or copy:

from gviz_data_table import encode
from gviz_data_table import Table

We include the Table class from the gviz_data_table library to generate a data structure for the Google Charts JS library, and we need the encode function to transform the data and return it as a JSON string.

Let's create a utility method called _bq2geo to convert from the BigQuery query response format to what Google geocharts expects. A BigQuery query response comes back as JSON, but the Python API library interprets it into a Python data structure for you. We're going to need to loop through it and convert it into the two columns of data that Google Charts geomaps expects: region and value. One change we're going to have to make is to convert from the two-letter state abbreviations used in the natality table to the 'US-'+state name expected by Google Charts data format:

This method assumes that the query is structure to put the geographic location in the first column and the numeric value in the second column. Any other values in the row are ignored. At the end we convert geodata to a JSON string so that the template renderer can substitute the JSON string directly into the template.

Let's add a global definition for the query using the SQL we decided on above, and we'll make a few minor changes to the get() request handler to use the template in the response.out.write() (main.py, complete file):

And we're done! You should be able to run this on your development App Engine server and see the following:

Congratulations! You've just written your first basic dashboard using BigQuery, Google Charts, and App Engine.

Conclusion

Now that you've finish your first dashboard, you've learned the basics about using OAuth2 in Python for App Engine, creating credentials and wrapping functions that require auth with decorators. You should be able to add your own queries for your own data using the same pattern of code.

If you'd like the complete source code for this sample, please see the Python Hello Dashboard sample.