Downloading BigQuery data to pandas using the BigQuery Storage API

The BigQuery Storage API provides fast access to data stored in BigQuery. Use the BigQuery Storage API to download data stored in BigQuery for use in analytics tools such as the pandas library for Python.

Objectives

In this tutorial you:

  • Download query results to a pandas DataFrame by using the BigQuery Storage API from the IPython magics for BigQuery in a Jupyter notebook.
  • Download query results to a pandas DataFrame by using the BigQuery client library for Python.
  • Download BigQuery table data to a pandas DataFrame by using the BigQuery client library for Python.
  • Download BigQuery table data to a pandas DataFrame by using the BigQuery Storage API client library for Python.

Costs

BigQuery is a paid product and you will incur BigQuery usage costs for the queries you run. The first 1 TB of query data processed per month is free. For more information, see the BigQuery Pricing page.

BigQuery Storage API is a paid product and you will incur usage costs for the table data you scan when downloading a DataFrame. For more information, see the BigQuery Pricing page.

Before you begin

Before you begin this tutorial, use the Google Cloud Platform Console to create or select a project and enable billing.

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Google Cloud Platform project.

    Go to the Manage resources page

  3. Projeniz için faturalandırmanın etkinleştirildiğinden emin olun.

    Faturalandırmayı etkinleştirmeyi öğren

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, Enable the BigQuery, BigQuery Storage API APIs.

    Enable the APIs

  5. Set up a Python development environment.
    Setup Python
  6. Set up authentication for your development environment.
    Setup Authentication

You should also be familiar with the IPython magics for BigQuery, the BigQuery client library, and how to use the client library with pandas before completing this tutorial.

Install the client libraries

Install the BigQuery Python client library version 1.9.0 or higher and the BigQuery Storage API Python client library.

PIP

Install the google-cloud-bigquery and google-cloud-bigquery-storage packages.

pip install --upgrade google-cloud-bigquery[bqstorage,pandas]

Conda

Install the BigQuery and the BigQuery Storage API Conda packages from the community-run conda-forge channel.

conda install -c conda-forge google-cloud-bigquery \
  google-cloud-bigquery-storage \
  pandas \
  pyarrow

Download query results using the IPython magics for BigQuery

Start the Jupyter notebook server and create a new Jupyter notebook. Load the IPython magics for BigQuery using the %load_ext magic.

%load_ext google.cloud.bigquery

Download large query results with the BigQuery Storage API by adding the --use_bq_storage_api argument to the %%bigquery magics.

%%bigquery nodejs_deps --use_bqstorage_api
SELECT
    dependency_name,
    dependency_platform,
    project_name,
    project_id,
    version_number,
    version_id,
    dependency_kind,
    optional_dependency,
    dependency_requirements,
    dependency_project_id
FROM
    `bigquery-public-data.libraries_io.dependencies`
WHERE
    LOWER(dependency_platform) = 'npm'
LIMIT 2500000

When this argument is used with small query results, the magics use the BigQuery API to download the results.

%%bigquery stackoverflow --use_bqstorage_api
SELECT
  CONCAT(
    'https://stackoverflow.com/questions/',
    CAST(id as STRING)) as url,
  view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
LIMIT 10

Set the context.use_bqstorage_api property to True to use the BigQuery Storage API by default.

import google.cloud.bigquery.magics

google.cloud.bigquery.magics.context.use_bqstorage_api = True

After you set the context.use_bqstorage_api property, run the %%bigquery magics without additional arguments to use the BigQuery Storage API to download large results.

%%bigquery java_deps
SELECT
    dependency_name,
    dependency_platform,
    project_name,
    project_id,
    version_number,
    version_id,
    dependency_kind,
    optional_dependency,
    dependency_requirements,
    dependency_project_id
FROM
    `bigquery-public-data.libraries_io.dependencies`
WHERE
    LOWER(dependency_platform) = 'maven'
LIMIT 2500000

Use the Python client libraries

Create Python clients

Use the following code to construct a BigQuery Client object and a BigQueryStorageClient.

import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1

# Explicitly create a credentials object. This allows you to use the same
# credentials for both the BigQuery and BigQuery Storage clients, avoiding
# unnecessary API calls to fetch duplicate authentication tokens.
credentials, your_project_id = google.auth.default(
    scopes=["https://www.googleapis.com/auth/cloud-platform"]
)

# Make clients.
bqclient = bigquery.Client(
    credentials=credentials,
    project=your_project_id,
)
bqstorageclient = bigquery_storage_v1beta1.BigQueryStorageClient(
    credentials=credentials
)

Use the google-auth Python library to create credentials that are sufficiently scoped for both APIs. Pass in a credentials object to each constructor to avoid authenticating twice.

Download query results using the BigQuery client library

Run a query by using the query method. Call the to_dataframe method to wait for the query to finish and download the results by using the BigQuery Storage API.

# Download query results.
query_string = """
SELECT
CONCAT(
    'https://stackoverflow.com/questions/',
    CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
"""

dataframe = (
    bqclient.query(query_string)
    .result()
    .to_dataframe(bqstorage_client=bqstorageclient)
)
print(dataframe.head())

Download table data using the BigQuery client library

Download all rows in a table by using the list_rows method, which returns a RowIterator object. Download rows by using the BigQuery Storage API by calling the to_dataframe method with the bqstorage_client argument.

# Download a table.
table = bigquery.TableReference.from_string(
    "bigquery-public-data.utility_us.country_code_iso"
)
rows = bqclient.list_rows(
    table,
    selected_fields=[
        bigquery.SchemaField("country_name", "STRING"),
        bigquery.SchemaField("fips_code", "STRING"),
    ],
)
dataframe = rows.to_dataframe(bqstorage_client=bqstorageclient)
print(dataframe.head())

Download table data using the BigQuery Storage API client library

Use the BigQuery Storage API client library directly for fine-grained control over filters and parallelism. When only simple row filters are needed, a BigQuery Storage API read session may be used in place of a query.

Create a TableReference object with the desired table to read. Create a TableReadOptions object to select columns or filter rows. Create a read session using the create_read_session method.

If there are any streams on the session, begin reading rows from it by using the read_rows method. Call the to_dataframe method on the reader to write the entire stream to a pandas DataFrame. For better performance, read from multiple streams in parallel, but this code example reads from only a single stream for simplicity.

table = bigquery_storage_v1beta1.types.TableReference()
table.project_id = "bigquery-public-data"
table.dataset_id = "new_york_trees"
table.table_id = "tree_species"

# Select columns to read with read options. If no read options are
# specified, the whole table is read.
read_options = bigquery_storage_v1beta1.types.TableReadOptions()
read_options.selected_fields.append("species_common_name")
read_options.selected_fields.append("fall_color")

parent = "projects/{}".format(your_project_id)
session = bqstorageclient.create_read_session(
    table,
    parent,
    read_options=read_options,
    # This API can also deliver data serialized in Apache Avro format.
    # This example leverages Apache Arrow.
    format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
    # We use a LIQUID strategy in this example because we only read from a
    # single stream. Consider BALANCED if you're consuming multiple streams
    # concurrently and want more consistent stream sizes.
    sharding_strategy=(
        bigquery_storage_v1beta1.enums.ShardingStrategy.LIQUID
    ),
)

# This example reads from only a single stream. Read from multiple streams
# to fetch data faster. Note that the session may not contain any streams
# if there are no rows to read.
stream = session.streams[0]
position = bigquery_storage_v1beta1.types.StreamPosition(stream=stream)
reader = bqstorageclient.read_rows(position)

# Parse all Avro blocks and create a dataframe. This call requires a
# session, because the session contains the schema for the row blocks.
dataframe = reader.to_dataframe(session)
print(dataframe.head())

Source code for all examples

View the complete source code for all client library examples.

import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1

# Explicitly create a credentials object. This allows you to use the same
# credentials for both the BigQuery and BigQuery Storage clients, avoiding
# unnecessary API calls to fetch duplicate authentication tokens.
credentials, your_project_id = google.auth.default(
    scopes=["https://www.googleapis.com/auth/cloud-platform"]
)

# Make clients.
bqclient = bigquery.Client(
    credentials=credentials,
    project=your_project_id,
)
bqstorageclient = bigquery_storage_v1beta1.BigQueryStorageClient(
    credentials=credentials
)
# Download a table.
table = bigquery.TableReference.from_string(
    "bigquery-public-data.utility_us.country_code_iso"
)
rows = bqclient.list_rows(
    table,
    selected_fields=[
        bigquery.SchemaField("country_name", "STRING"),
        bigquery.SchemaField("fips_code", "STRING"),
    ],
)
dataframe = rows.to_dataframe(bqstorage_client=bqstorageclient)
print(dataframe.head())
# Download query results.
query_string = """
SELECT
CONCAT(
    'https://stackoverflow.com/questions/',
    CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
"""

dataframe = (
    bqclient.query(query_string)
    .result()
    .to_dataframe(bqstorage_client=bqstorageclient)
)
print(dataframe.head())
table = bigquery_storage_v1beta1.types.TableReference()
table.project_id = "bigquery-public-data"
table.dataset_id = "new_york_trees"
table.table_id = "tree_species"

# Select columns to read with read options. If no read options are
# specified, the whole table is read.
read_options = bigquery_storage_v1beta1.types.TableReadOptions()
read_options.selected_fields.append("species_common_name")
read_options.selected_fields.append("fall_color")

parent = "projects/{}".format(your_project_id)
session = bqstorageclient.create_read_session(
    table,
    parent,
    read_options=read_options,
    # This API can also deliver data serialized in Apache Avro format.
    # This example leverages Apache Arrow.
    format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
    # We use a LIQUID strategy in this example because we only read from a
    # single stream. Consider BALANCED if you're consuming multiple streams
    # concurrently and want more consistent stream sizes.
    sharding_strategy=(
        bigquery_storage_v1beta1.enums.ShardingStrategy.LIQUID
    ),
)

# This example reads from only a single stream. Read from multiple streams
# to fetch data faster. Note that the session may not contain any streams
# if there are no rows to read.
stream = session.streams[0]
position = bigquery_storage_v1beta1.types.StreamPosition(stream=stream)
reader = bqstorageclient.read_rows(position)

# Parse all Avro blocks and create a dataframe. This call requires a
# session, because the session contains the schema for the row blocks.
dataframe = reader.to_dataframe(session)
print(dataframe.head())

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

Delete your project. You didn't create any BigQuery resources this tutorial, but deleting your project removes any other resources that you created.

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete .
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Bu sayfayı yararlı buldunuz mu? Lütfen görüşünüzü bildirin:

Şunun hakkında geri bildirim gönderin...