Downloading BigQuery data to pandas using the BigQuery Storage API

The BigQuery Storage API provides fast access to data stored in BigQuery. Use the BigQuery Storage API to download data stored in BigQuery for use in analytics tools such as the pandas library for Python.

Objectives

In this tutorial you:

  • Download query results to a pandas DataFrame by using the BigQuery Storage API from the IPython magics for BigQuery in a Jupyter notebook.
  • Download query results to a pandas DataFrame by using the BigQuery client library for Python.
  • Download BigQuery table data to a pandas DataFrame by using the BigQuery client library for Python.
  • Download BigQuery table data to a pandas DataFrame by using the BigQuery Storage API client library for Python.

Costs

BigQuery is a paid product and you will incur BigQuery usage costs for the queries you run. The first 1 TB of query data processed per month is free. For more information, see the BigQuery Pricing page.

BigQuery Storage API is a paid product and you will incur usage costs for the table data you scan when downloading a DataFrame. For more information, see the BigQuery Pricing page.

Before you begin

Before you begin this tutorial, use the Google Cloud Console to create or select a project and enable billing.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  6. BigQuery is automatically enabled in new projects. To activate BigQuery in a preexisting project, Enable the BigQuery, BigQuery Storage API APIs.

    Enable the APIs

  7. Set up a Python development environment.
    Setup Python
  8. Set up authentication for your development environment.
    Setup Authentication

You should also be familiar with the IPython magics for BigQuery, the BigQuery client library, and how to use the client library with pandas before completing this tutorial.

Install the client libraries

Install the BigQuery Python client library version 1.9.0 or higher and the BigQuery Storage API Python client library.

PIP

Install the google-cloud-bigquery and google-cloud-bigquery-storage packages.

pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]'

Conda

Install the BigQuery and the BigQuery Storage API Conda packages from the community-run conda-forge channel.

conda install -c conda-forge google-cloud-bigquery \
  google-cloud-bigquery-storage \
  pandas \
  pyarrow

Download query results using the IPython magics for BigQuery

Start the Jupyter notebook server and create a new Jupyter notebook. Load the IPython magics for BigQuery using the %load_ext magic.

%load_ext google.cloud.bigquery

As of version 1.26.0 of the google-cloud-bigquery Python package, the BigQuery Storage API is used by default to download results from the %%bigquery magics.

%%bigquery tax_forms
SELECT * FROM `bigquery-public-data.irs_990.irs_990_2012`

Use the Python client libraries

Download query results using the BigQuery client library

Run a query by using the query method. Call the to_dataframe method to wait for the query to finish and download the results by using the BigQuery Storage API.

from google.cloud import bigquery

bqclient = bigquery.Client()

# Download query results.
query_string = """
SELECT
CONCAT(
    'https://stackoverflow.com/questions/',
    CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
"""

dataframe = (
    bqclient.query(query_string)
    .result()
    .to_dataframe(
        # Optionally, explicitly request to use the BigQuery Storage API. As of
        # google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
        # API is used by default.
        create_bqstorage_client=True,
    )
)
print(dataframe.head())

Download table data using the BigQuery client library

Download all rows in a table by using the list_rows method, which returns a RowIterator object. Download rows by using the BigQuery Storage API by calling the to_dataframe method with the bqstorage_client argument.

from google.cloud import bigquery

bqclient = bigquery.Client()

# Download a table.
table = bigquery.TableReference.from_string(
    "bigquery-public-data.utility_us.country_code_iso"
)
rows = bqclient.list_rows(
    table,
    selected_fields=[
        bigquery.SchemaField("country_name", "STRING"),
        bigquery.SchemaField("fips_code", "STRING"),
    ],
)
dataframe = rows.to_dataframe(
    # Optionally, explicitly request to use the BigQuery Storage API. As of
    # google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
    # API is used by default.
    create_bqstorage_client=True,
)
print(dataframe.head())

Download table data using the BigQuery Storage API client library

Use the BigQuery Storage API client library directly for fine-grained control over filters and parallelism. When only simple row filters are needed, a BigQuery Storage API read session may be used in place of a query.

Create a TableReadOptions object to select columns or filter rows. Create a read session using the create_read_session method.

If there are any streams on the session, begin reading rows from it by using the read_rows method. Loop through the pages property. Call the to_dataframe method to convert a message into a pandas DataFrame. For better performance, read from multiple streams in parallel, but this code example reads from only a single stream for simplicity.

your_project_id = "project-for-read-session"
from google.cloud import bigquery_storage
from google.cloud.bigquery_storage import types
import pandas

bqstorageclient = bigquery_storage.BigQueryReadClient()

project_id = "bigquery-public-data"
dataset_id = "new_york_trees"
table_id = "tree_species"
table = f"projects/{project_id}/datasets/{dataset_id}/tables/{table_id}"

# Select columns to read with read options. If no read options are
# specified, the whole table is read.
read_options = types.ReadSession.TableReadOptions(
    selected_fields=["species_common_name", "fall_color"]
)

parent = "projects/{}".format(your_project_id)

requested_session = types.ReadSession(
    table=table,
    # Avro is also supported, but the Arrow data format is optimized to
    # work well with column-oriented data structures such as pandas
    # DataFrames.
    data_format=types.DataFormat.ARROW,
    read_options=read_options,
)
read_session = bqstorageclient.create_read_session(
    parent=parent, read_session=requested_session, max_stream_count=1,
)

# This example reads from only a single stream. Read from multiple streams
# to fetch data faster. Note that the session may not contain any streams
# if there are no rows to read.
stream = read_session.streams[0]
reader = bqstorageclient.read_rows(stream.name)

# Parse all Arrow blocks and create a dataframe.
frames = []
for message in reader.rows().pages:
    frames.append(message.to_dataframe())
dataframe = pandas.concat(frames)
print(dataframe.head())

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete your project. You didn't create any BigQuery resources this tutorial, but deleting your project removes any other resources that you created.

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next