Optimize BigQuery DataFrames performance
BigQuery DataFrames helps you analyze and transform data in BigQuery using a pandas-compatible API. To make your data processing faster and more cost-effective, you can use several techniques to improve performance.
This document describes the following ways to optimize performance:
- Use partial ordering mode.
- Cache results after expensive operations.
- Preview data by using the
peek()
method. - Defer the
repr()
data retrieval.
Use partial ordering mode
BigQuery DataFrames has an ordering mode feature, which enforces a
specific row order for operations like window functions and joins. You can
specify the ordering mode by setting the ordering_mode
property to either
strict
(known as strict ordering mode, which is the default) or
partial
(known as partial ordering mode). Using the partial
setting can make your queries more efficient.
Partial ordering mode is different from strict ordering mode. Strict ordering
mode arranges all rows in a specific order. This total ordering makes
BigQuery DataFrames work better with pandas, letting you access rows by
their order using the DataFrame.iloc
property. However, total ordering and
its default sequential index prevent filters on columns or rows from reducing
the amount of data scanned. This prevention occurs unless you apply those
filters as parameters to the read_gbq
and read_gbq_table
functions. To order
all the rows in the DataFrame, BigQuery DataFrames creates a hash of all
the rows. This operation can cause a full data scan that ignores row and column
filters.
Partial ordering mode stops BigQuery DataFrames from creating a total
order for all rows and turns off features that need a total order, like
the DataFrame.iloc
property. Partial ordering mode also sets the
DefaultIndexKind
class
to a null index, instead of to a sequential index.
When you filter a DataFrame
object using partial ordering mode,
BigQuery DataFrames doesn't calculate which rows are missing in the
sequential index. Partial ordering mode also doesn't automatically combine data
based on index. These approaches can increase the efficiency of your queries.
However, whether you use the default strict ordering mode or partial ordering
mode, the BigQuery DataFrames API works like the familiar pandas API.
With both partial and strict ordering modes, you pay for the BigQuery resources you use. However, using partial ordering mode can lower costs when working with large clustered and partitioned tables. This cost reduction occurs because row filters on cluster and partition columns reduce the amount of data processed.
Enable partial ordering mode
To use partial ordering, set the ordering_mode
property to partial
before
performing any other operation with BigQuery DataFrames, as
shown in the following code sample:
Partial ordering mode prevents implicit joins of unrelated
BigQuery DataFrames objects because it lacks a sequential index.
Instead, you must explicitly call the DataFrame.merge
method to join
two BigQuery DataFrames objects that derive from different table
expressions.
The Series.unique()
and Series.drop_duplicates()
features don't work with
partial ordering mode. Instead, use the groupby
method to find unique
values, as shown in the following example:
With partial ordering mode, the output of the DataFrame.head(n)
and
Series.head(n)
functions might not be the same every time you run them. To
download a small, random sample of the data, use the
DataFrame.peek()
or Series.peek()
methods.
For a detailed tutorial in which you use the ordering_mode = "partial"
property, see
Analyzing package downloads from PyPI with BigQuery DataFrames.
Troubleshooting
Because BigQuery DataFrames in partial ordering mode sometimes lacks an ordering or index, you might encounter the following issues when using some pandas-compatible methods.
Order required error
Some features, like the DataFrame.head()
and DataFrame.iloc
functions, need
an ordering. For a list of features that require ordering, see the Requires
ordering column in
Supported pandas APIs.
When an object has no ordering, the operation fails with an
OrderRequiredError
message like the following: OrderRequiredError: Op iloc
requires an ordering. Use .sort_values or .sort_index to provide an ordering.
As the error message states, you can provide an ordering using the
DataFrame.sort_values()
method
to sort by one or more columns. Other methods, such as
DataFrame.groupby()
,
implicitly provide a total ordering based on the group by keys.
If the ordering isn't a completely stable total ordering for all rows, later
operations might show an AmbiguousWindowWarning
message like the
following: AmbiguousWindowWarning: Window ordering may be ambiguous, this can
cause unstable results.
If your work can handle results that aren't always the same, or if you can
manually check that your ordering is a total ordering, you can filter out the
AmbiguousWindowWarning
message in this way:
Null index error
Some features, like the DataFrame.unstack()
and Series.interpolate()
properties, need an index. For a list of features that require an index, see
the Requires index column in
Supported pandas APIs.
When you use an operation that requires an index with partial ordering mode,
the operation raises a NullIndexError
message like the following:
NullIndexError: DataFrame cannot perform interpolate as it has no index.
Set an index using set_index.
As the error message states, you can provide an index using the
DataFrame.set_index()
method
to sort by one or more columns. Other methods, such as
DataFrame.groupby()
,
implicitly provide an index based on the group by keys, unless the
as_index=False
parameter is set.
Cache results after expensive operations
BigQuery DataFrames stores operations locally and defers running queries until certain conditions are met. This can cause the same operations to run multiple times across different queries.
To avoid repeating costly operations, save intermediate results with the
cache()
method, as shown in the following example:
This method creates a temporary BigQuery table to store your results. You are charged for the storage of this temporary table in BigQuery.
Preview your data with the peek()
method
BigQuery DataFrames offers two API methods to preview data:
peek(n)
returnsn
rows of data, wheren
is the number of rows.head(n)
returns the firstn
rows of data, depending on the context, wheren
is the number of rows.
Use the head()
method only when the order of data is important, for example,
when you want the five largest values in a column. In other cases, use the
peek()
method for more efficient data retrieval, as shown in the following
code sample:
You can also use the peek()
method to download a small, random sample of data
while using partial ordering mode.
Defer the repr()
data retrieval
You can call the repr()
method in BigQuery DataFrames with
notebooks or your IDE debugger. This
call triggers the head()
call that retrieves the actual data. This retrieval
can slow down your iterative coding and debugging process and also incur costs.
To prevent the repr()
method from retrieving data, set the repr_mode
attribute to "deferred"
, as shown in the following example:
In the deferred mode, you can only preview your data with explicit
peek()
and head()
calls.
What's next
- Learn how to use BigQuery DataFrames.
- Learn how to visualize BigQuery DataFrames.
- Explore the BigQuery DataFrames API reference.
- View BigQuery DataFrames source code, sample notebooks, and samples on GitHub.