Visualizing BigQuery Data Using Google Cloud Datalab

BigQuery is a petabyte-scale analytics data warehouse that you can use to run SQL queries over vast amounts of data in near realtime.

Data visualization tools can help you make sense of your BigQuery data and help you analyze the data interactively. You can use visualization tools to help you identify trends, respond to them, and make predictions using your data. In this tutorial, you use Google Cloud Datalab to visualize data in the BigQuery natality sample table.

Objectives

In this tutorial you:

  • Set up an environment to run Cloud Datalab notebooks on Google Cloud Platform
  • Query and visualize BigQuery data using Google Cloud Datalab

Costs

Running Cloud Datalab on a Compute Engine VM will incur charges. In addition, BigQuery is a paid product and you incur BigQuery usage costs when accessing BigQuery through Cloud Datalab. BigQuery query pricing provides the first 1 TB per month free of charge. For more information, see BigQuery pricing.

Before you begin

Before you begin this tutorial, use the Google Cloud Platform Console to create or select a project and enable billing.

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform project.

    Go to the Manage resources page

  3. Enable billing for your project.

    Enable billing

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, Enable the BigQuery API.

    Enable the API

  5. After creating your project, enable the Google Compute Engine API.

Overview of Google Cloud Datalab

Cloud Datalab is packaged as a container and run in a VM (Virtual Machine) instance. The Cloud Datalab quickstart explains VM creation, running the container in that VM, and establishing a connection from your browser to the Cloud Datalab container, which allows you to open existing Cloud Datalab notebooks and create new notebooks. You can find additional overview documentation in the Cloud Datalab datalab/docs/notebooks/intro notebook directory.

Cloud Datalab uses notebooks instead of the text files containing code. Notebooks bring together code, documentation written as markdown, and the results of code execution—whether as text, image or, HTML/JavaScript. Like a code editor or IDE, notebooks help you write code: they allow you to execute code in an interactive and iterative manner, rendering the results alongside the code. Further, when you share a notebook with team members, you can include code, markdown-formatted documentation, and results that include interactive charts, to provide them with context that goes beyond what Python or SQL code files along can provide.

When you open a notebook, a backend “kernel” process is launched to manage the variables defined during the session and execute your notebook code. When the executed code accesses Google Cloud services such as BigQuery or Google Machine Learning Engine, it uses the service account available in the VM. Hence, the service account must be authorized to access the data or request the service. To display the cloud project and service account names, click the user icon user-icon in the top-right corner of the Cloud Datalab notebook or notebook listing page in your browser. The VM used for running Cloud Datalab is a shared resource accessible to all the members of the associated cloud project. Therefore, using an individual's personal cloud credentials to access data is strongly discouraged.

As you execute code in the notebook, the state of the process executing the code changes. If you assign or reassign a variable, its value is used for subsequent computations as a side effect. Each running notebook is shown as a session in Cloud Datalab. You can click on the sessions icon session-icon on the Cloud Datalab notebook listing page to list and stop sessions. While a session is running, the underlying process consumes memory resources. If you stop a session, the underlying process goes away along with its in-memory state, and memory used by the session is freed. Results saved in the notebook remain in persistent format on the disk.

Setting up Google Cloud Datalab on a Compute Engine VM

In this tutorial, you set up Cloud Datalab using Google Cloud Shell.

To set up Cloud Datalab on a Compute Engine VM:

  1. Go to the Cloud Platform Console.

    Cloud Platform Console

  2. Click Activate Google Cloud Shell.

    Activate Cloud Shell icon

  3. Choose a default zone for the Cloud SDK. This zone will contain your Cloud Datalab instance. Enter the following command in the Cloud Shell window to configure the gcloud tool to use your selected zone. Replace [ZONE] with your zone name.

    gcloud config set compute/zone [ZONE]
    
  4. Enter the following command to create a Compute Engine VM (named dltutorial) used to run the Cloud Datalab container. This command also creates an SSH connection to your VM and maps the remote Cloud Datalab web interface to localhost port 8081. If prompted, enter the number corresponding to your default Compute Engine zone.

    datalab create dltutorial
    
  5. In the Cloud Shell window, click Web preview > Change port > Port 8081.

    This opens Cloud Datalab in a new tab.

    gcloud compute ssh --quiet --zone "[ZONE]" --ssh-flag="-N" --ssh-flag="-L" --ssh-flag="localhost:8081:localhost:8080" "${USER}@dltutorial"
    

Google Cloud Datalab notebooks

A notebook provides an environment to author and execute code. A notebook is essentially a source artifact, saved as a .ipynb file — it can contain descriptive text content, executable code blocks, and associated results (rendered as interactive HTML). Structurally, a notebook is a sequence of cells.

A cell is a block of input text that is evaluated to produce results. Cells can be of two types:

  • Code cells — contain code to evaluate. Any outputs or results from executing the code are rendered inline after the input code.

  • Markdown cells — contain markdown text that is converted to HTML to produce headers, lists, and formatted text.

Each opened notebook is associated with a running session. In IPython, this is also referred to as a kernel. This session executes all the code entered within the notebook, and manages the state (variables, their values, functions and classes, and any existing Python modules you load).

Querying and visualizing BigQuery data

In this section of the tutorial, you create a Google Cloud Datalab notebook used to query and visualize data in BigQuery. You create visualizations using the data in the natality sample table. All queries in this tutorial are in standard SQL syntax.

To query and visualize BigQuery data using Cloud Datalab:

  1. Click the Google Cloud Datalab tab in your browser.

  2. On the Cloud Datalab home page, at the top of the window, click Notebook.

    create notebook icon

    This creates a new notebook with a single, empty code block.

  3. At the top of the page, click Untitled Notebook.

  4. In the Rename notebook dialog, type a new name such as BigQuery tutorial, and then click OK.

  5. BigQuery commands are invoked using the escape sequence %%bq, which instructs the session to treat the content of the cell as something other than Python. Enter the following query in the code block.

    %%bq query
    #SQL query to return total births by year
    SELECT CAST(source_year AS string) AS year, COUNT(is_male) AS birth_count
    FROM `publicdata.samples.natality`
    GROUP BY year
    ORDER BY year DESC
    LIMIT 15

    Because the source_year field is stored as an integer, the query uses the CAST function to convert the field from integer to string. This allows you to properly format the axes in your visualizations.

  6. Click Run > Run from this cell. The query results appear below the code block.

    Total births by year

    Notice the query job details appear below the query results. The details include the query processing time, the amount of data processed, and the job ID.

  7. Click Append a new code cell.

    Append a new code cell icon

  8. In the new code block, enter the following SQL query. In this query, the --name flag is added. This query object can then be passed to a chart by referencing the name, and the chart will execute the contained query.

    %%bq query --name total_births
    #SQL query to return total births by year
    SELECT CAST(source_year AS string) AS year, COUNT(is_male) AS birth_count
    FROM `publicdata.samples.natality`
    GROUP BY year
    ORDER BY year DESC
    LIMIT 15

  9. Click Append a new code cell.

  10. In the new code block, enter the following code to visualize the query results using a bar chart. The --data parameter uses the name of the previous query.

    %chart columns --data total_births --fields year,birth_count
    

  11. Click inside the query in the cell above the chart code block and then click Run > Run from this cell. The chart is displayed below the chart code block.

    chart of total births by year

  12. Click Append a new code cell.

  13. Enter the following query that retrieves the number of births by weekday.

    %%bq query --name births_by_weekday
    SELECT CAST(wday AS string) AS weekday, SUM(CASE WHEN is_male THEN 1 ELSE 0 END) AS male_births, SUM(CASE WHEN is_male THEN 0 ELSE 1 END) AS female_births
    FROM `publicdata.samples.natality`
    WHERE wday IS NOT NULL
    GROUP BY weekday
    ORDER BY weekday ASC

    Because the wday (weekday) field allows null values, the query excludes records where wday is null.

  14. Click Append a new code cell.

  15. In the new code block, enter the following code to visualize the query results using a line chart. The --data parameter uses the name of the previous query.

    %chart line --data births_by_weekday --fields weekday,male_births,female_births
    

  16. Click inside the query in the cell above the chart code block and then click Run > Run from this cell. The chart is displayed below the chart code block.

    Births by weekday line chart

    Notice the number of births dramatically decreases on Saturday (7) and Sunday (1). If you hover over a point on the chart, you can view the total male or female births for that day.

    Births by weekday drilldown

  17. Click Notebook > Save and checkpoint. Creating a checkpoint allows you to roll the notebook back to a previous state.

Pandas DataFrames

Google Cloud Datalab allows you to use SQL to interact with Google BigQuery. However, SQL is just the starting point. Datalab enables you to combine the power of declarative SQL with imperative code (Python) to perform interesting data analysis, visualization, and transformation tasks.

For more information on using DataFrames, see the SQL and Pandas DataFrames notebook: /datalab/docs/tutorials/BigQuery/SQL and Pandas DataFrames.ipynb.

Querying and visualizing BigQuery data using pandas DataFrames

In this section of the tutorial, you query and visualize data in BigQuery using pandas DataFrames. You use the google.datalab.bigquery.Query class in the BigQuery API to query BigQuery data. You use the pandas Python library to analyze data using DataFrames.

  1. Click Append a new code cell.

  2. Type the following Python code into the code block to import the BigQuery module and the pandas library.

    import google.datalab.bigquery as bq
    import pandas as pd

  3. Click Run > Run from this cell.

  4. Click Append a new code cell.

  5. To run a query, create a new Query object with the desired SQL string. You can also use an object that has already been defined by the %%bq query --name command (as in the previous examples). In the new code block, enter the following query to retrieve the total number of births by year.

    total_births = bq.Query('SELECT CAST(source_year AS string) AS year, COUNT(is_male) AS birth_count FROM `publicdata.samples.natality` GROUP BY year ORDER BY year DESC LIMIT 15')
    df = total_births.execute(output_options=bq.QueryOutput.dataframe()).result()
    df.head(10)

    Because the source_year field is stored as an integer, the query uses the CAST function to convert the field from integer to string.

  6. Click Run > Run from this cell. The results of the query are displayed below the query code block in a DataFrame.

    Total births by year DataFrame

  7. Click Append a new code cell.

  8. To chart the query results in your DataFrame, enter the following code in the code block.

    ax = df.plot(kind='bar',x='year',title='Total births by year')
    ax.set_xlabel('Year')
    ax.set_ylabel('Birth count')

    This code plots the data as a bar chart and sets a chart title, x-axis label, and y-axis label.

  9. Click Run > Run from this cell. The chart appears below the code block.

    Total births by year bar chart

  10. Click Append a new code cell.

  11. Enter the following query that retrieves the number of births by weekday.

    births_by_weekday = bq.Query('SELECT CAST(wday AS string) AS weekday, SUM(CASE WHEN is_male THEN 1 ELSE 0 END) AS male_births, SUM(CASE WHEN is_male THEN 0 ELSE 1 END) AS female_births FROM `publicdata.samples.natality` WHERE wday IS NOT NULL GROUP BY weekday ORDER BY weekday ASC')
    df2 = births_by_weekday.execute(output_options=bq.QueryOutput.dataframe()).result()

    Because the wday (weekday) field allows null values, the query excludes records where wday is null.

  12. Click Run > Run from this cell.

  13. Click Append a new code cell.

  14. To chart the query results in your DataFrame using matplotlib, type the following code in the code block.

    df2 = births_by_weekday.execute(output_options=bq.QueryOutput.dataframe()).result()
    ax = df2.plot(kind='line',x='weekday',title='Births by weekday')
    ax.set_xlabel('Weekday')
    ax.set_ylabel('Total')
    

  15. Click Run > Run from this cell. The line chart appears below the code block.

    Births by weekday line chart

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

  1. In the Cloud Platform Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Deleting your project removes the Cloud Datalab VM. If you do not want to delete the Cloud Platform project, you can delete the Cloud Datalab VM.

To delete the Cloud Datalab VM:

  1. Open the Compute Engine VM Instances page.

    Go to the VM instances page

  2. Check the dltutorial instance and then click Delete.

  3. When prompted, click Delete.

What's next

  • Learn more about writing queries for BigQueryQuerying Data in the BigQuery documentation explains how to run queries, create user-defined functions (UDFs), and more.

  • Explore BigQuery syntax — The preferred dialect for SQL queries in BigQuery is standard SQL, which is described in the SQL Reference. BigQuery's legacy SQL-like syntax is described in the Query Reference (legacy SQL).

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...