Visualizing BigQuery Data Using Google Cloud Datalab

BigQuery is a petabyte-scale analytics data warehouse that you can use to run SQL queries over vast amounts of data in near realtime.

Data visualization tools can help you make sense of your BigQuery data and help you analyze the data interactively. You can use visualization tools to help you identify trends, respond to them, and make predictions using your data. In this tutorial, you use Google Cloud Datalab to visualize data in the BigQuery natality sample table.

Objectives

In this tutorial you:

  • Set up an environment to run Cloud Datalab notebooks on Google Cloud Platform
  • Query and visualize BigQuery data using Google Cloud Datalab

Costs

Running Cloud Datalab on a Compute Engine VM will incur charges. In addition, BigQuery is a paid product and you incur BigQuery usage costs when accessing BigQuery through Cloud Datalab. BigQuery query pricing provides the first 1 TB per month free of charge. For more information, see BigQuery pricing.

Before you begin

Before you begin this tutorial, use the Google Cloud Platform Console to create or select a project and enable billing.

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform project.

    Go to the Projects page

  3. Enable billing for your project.

    Enable billing

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, Enable the BigQuery API.

    Enable the API

After creating your project, download, install, and initialize the Google Cloud SDK using the appropriate quick start for your platform.

Set up Google Cloud Datalab on a Compute Engine VM

Cloud Datalab functionality is packaged into a docker container. This container contains a ready-to-use environment including the Python runtime, a set of libraries picked for data analysis and visualization scenarios, Google Cloud Platform integration functionality, and a front-end server enabling this environment.

Each deployed or running Cloud Datalab environment represents a Cloud Datalab workspace. Within this workspace, the Cloud Datalab frontend manages notebooks, notebook sessions, and the corresponding instances of IPython and Python runtime. For more information, see managing Cloud Datalab notebooks.

You use the gcloud compute command-line tool to create the VM that hosts the Datalab container and to connect to the instance over SSH. Before you set up the cloud-based Datalab environment, verify you followed the instructions for running gcloud init in the appropriate Google Cloud SDK quick start for your platform. The gcloud init command-line tool sets your default project and obtains credentials for your account. Alternatively, you can log into your account and obtain credentials from the command line using gcloud auth login, and you can set your default project using gcloud set project.

To set up Cloud Datalab on a Compute Engine VM:

  1. Open a terminal or command window and type the following command to download a YAML file that defines how to setup the Cloud Datalab VM.

    gsutil cp gs://cloud-datalab/server.yaml ./datalab-server.yaml
    
  2. Type the following command to create a Compute Engine VM used to run the Datalab container. Replace “ZONE” with the appropriate Compute Engine zone for your VM. Leave the value in quotes.

    If you configured a default Compute Engine region and zone, you can remove --zone from the command. For instructions on setting the Compute Engine region/zone, see set default zone and region in your local client.

    gcloud compute instances create "dltutorial" --zone "[ZONE]" --image-family "container-vm" --image-project "google-containers" --metadata "google-container-manifest=$(cat datalab-server.yaml)" --machine-type "n1-highmem-2" --scopes "cloud-platform"

    This command creates a VM named “dltutorial” in the default network for your GCP project. The default network contains a firewall rule allowing SSH access to your instance. The “cloud-platform” scope gives your VM full access to all GCP resources and services.

  3. When the instance is created, the command line output should show the status of the VM as “RUNNING.” Type the following command to connect to the VM. Replace “ZONE” with the appropriate Compute Engine zone for your VM. Leave the value in quotes. If you configured a default Compute Engine region and zone, you can remove --zone from the command.

    gcloud compute ssh --quiet --zone "[ZONE]" --ssh-flag="-N" --ssh-flag="-L" --ssh-flag="localhost:8081:localhost:8080" "${USER}@dltutorial"

    This command maps the remote Datalab web interface to localhost port 8081.

  4. Once you see the message Warning: Permanently added [host] to list of known hosts you are connected to the instance. Proceed to the next section.

    The command line interface may no
    longer produce output once the SSH connection is established.

Create queries and visualizations using Google Cloud Datalab

A notebook provides an environment to author and execute code. A notebook is essentially a source artifact, saved as a .ipynb file — it can contain descriptive text content, executable code blocks, and associated results (rendered as interactive HTML). Structurally, a notebook is a sequence of cells.

A cell is a block of input text that is evaluated to produce results. Cells can be of two types:

  • Code cells - contain code to evaluate. Any outputs or results from executing the code are rendered inline after the input code.

  • Markdown cells - contain markdown text that is converted to HTML to produce headers, lists, and formatted text.

Each opened notebook is associated with a running session. In IPython, this is also referred to as a kernel. This session executes all the code entered within the notebook, and manages the state (variables, their values, functions and classes, and any existing Python modules you load).

In this section of the tutorial, you create a Google Cloud Datalab notebook used to query and visualize data in BigQuery. You create visualizations using the data in the natality sample table.

To query and visualize BigQuery data using Cloud Datalab:

  1. Type the following URL in your browser to open the Cloud Datalab web interface:

    http://localhost:8081
    
  2. When prompted, click Accept to agree to the license agreement.

  3. On the Cloud Datalab home page, click Notebook. This creates a new notebook with a single, empty code block.

  4. In the title bar, click Untitled notebook.

  5. In the Rename notebook dialog, type a new name such as BigQuery tutorial, and then click Ok.

  6. Type the following Python code into the code block.

    #Import the BigQuery module
    import datalab.bigquery as bq

    The Cloud Datalab APIs are provided in the datalab Python library, and the BigQuery functionality is contained within the datalab.bigquery module.

  7. Click Add code.

  8. In the new code block, type the following query. Because the source_year field is stored as an integer, the query uses the CAST function to convert the field from integer to string.

    #SQL query to return total births by year
    bq.Query('SELECT CAST(source_year AS string) AS year, COUNT(is_male) AS birth_count FROM [publicdata:samples.natality] GROUP BY year ORDER BY year DESC LIMIT 15').results()

    Alternatively, this query can be written using the %%sql escape sequence as in the following example. Using %%sql does not require you to import the BigQuery module.

    %%sql
    SELECT CAST(source_year AS string) AS year,
    COUNT(is_male) AS birth_count
    FROM [publicdata:samples.natality]
    GROUP BY year
    ORDER BY year
    DESC
    LIMIT 15

    Cloud Datalab also introduces the ability to author SQL, and JavaScript, or even shell commands. This capability allows you to issue command-like instructions or use alternate languages in the same notebook. These code cells start with an escape sequence,%%, which instructs the session to treat the content of the cell as something other than Python.

  9. Click Run > Run all cells. The results of the query are displayed below the code block.

    Add data source

    Query results can also be displayed using Pandas DataFrames. For more information on using DataFrames, see the SQL and Pandas DataFrames notebook: /datalab/docs/tutorials/BigQuery/SQL and Pandas DataFrames.ipynb.

  10. Click Add code.

  11. To chart the query results using a DataFrame, type the following code in the code block.

    bq.Query('SELECT CAST(source_year AS string) AS year, COUNT(is_male) AS birth_count FROM [publicdata:samples.natality] GROUP BY year ORDER BY year DESC LIMIT 15').to_dataframe().set_index('year').plot(kind='bar')

    You can use a number of Python data analysis, data wrangling, and visualization libraries, such as numpy, pandas, matplotlib, and many others. Several of these libraries build on top of a DataFrame object.

  12. Click Run > Run from this cell. The chart appears below the code block.

    Add data source

  13. Click Add code.

  14. Type the following query to chart the number of female and male babies born each year. This query uses a CASE statement to evaluate the is_male Boolean field to determine how many births were female and how many were male for a given year.

    bq.Query('SELECT CAST(source_year AS string) AS year, SUM(CASE WHEN is_male THEN 1 ELSE 0 END) AS male_births, SUM(CASE WHEN is_male THEN 0 ELSE 1 END) AS female_births FROM [publicdata:samples.natality] GROUP BY year ORDER BY year DESC LIMIT 15').to_dataframe().set_index('year').plot(kind='bar')

  15. Click Run > Run from this cell. The chart appears below the code block.

    Add data source

  16. Click Notebook > Save and checkpoint. Creating a checkpoint allows you to roll the notebook back to a previous state.

  17. Click Add code.

  18. Type the following query to chart the number of births by week day.

    bq.Query('SELECT CAST(wday AS string) AS weekday, SUM(CASE WHEN is_male THEN 1 ELSE 0 END) AS male_births, SUM(CASE WHEN is_male THEN 0 ELSE 1 END) AS female_births FROM [publicdata:samples.natality] WHERE wday IS NOT NULL GROUP BY weekday ORDER BY weekday ASC').to_dataframe().set_index('weekday').plot(kind='line')

    Because the wday (weekday) field allows null values, the query excludes records where wday is null.

  19. Click Run > Run from this cell. The line chart displays below the code block.

    Add data source

    Notice the number of births dramatically decreases on Saturday (7) and Sunday (1).

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

  1. In the Cloud Platform Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Deleting your project removes the Cloud Datalab VM. If you do not want to delete the Cloud Platform project, you can delete the Cloud Datalab VM.

To delete the Cloud Datalab VM:

  1. Open the Compute Engine VM Instances page.

    Go to the VM instances page

  2. Check the dltutorial instance and then click Delete.

  3. When prompted, click Delete.

What's next

  • Learn more about writing queries for BigQueryQuerying Data in the BigQuery documentation explains how to run sychronous and asynchronous queries, create user-defined functions (UDFs), and more.

  • Explore BigQuery syntax — BigQuery's legacy SQL-like syntax is described in the Query Reference (legacy SQL). The preferred dialect for SQL queries in BigQuery is standard SQL, which is described in the SQL Reference.

Send feedback about...