Comparison with pandas-gbq
The pandas-gbq
library provides a simple interface for running queries and
uploading pandas dataframes to BigQuery. It is a thin wrapper
around the BigQuery client
library, google-cloud-bigquery
. Both
of these libraries focus on helping you perform data analysis using SQL. This
topic provides code samples comparing google-cloud-bigquery
and pandas-gbq
.
Key differences in the level of functionality and support between the two libraries include:
pandas-gbq | google-cloud-bigquery | |
---|---|---|
Support | Open source library maintained by PyData and volunteer contributors | Open source library maintained by Google |
BigQuery API functionality covered | Run queries and save data from pandas DataFrames to tables | Full BigQuery API functionality, with added support for reading/writing pandas DataFrames and a Jupyter magic for running queries |
docs / source | docs / source |
Install the libraries
To use the code samples in this guide, install the pandas-gbq
package and the
BigQuery Python client libraries.
PIP
Install the
pandas-gbq
and
google-cloud-bigquery
packages.
pip install --upgrade pandas-gbq 'google-cloud-bigquery[bqstorage,pandas]'
Conda
Install the
pandas-gbq
and
google-cloud-bigquery
Conda packages from the community-run conda-forge
channel.
conda install -c conda-forge pandas-gbq google-cloud-bigquery
Running Queries
Both libraries support querying data stored in BigQuery. Key differences between the libraries include:
pandas-gbq | google-cloud-bigquery | |
---|---|---|
Default SQL syntax | GoogleSQL (configurable with pandas_gbq.context.dialect ) |
GoogleSQL |
Query configurations | Sent as dictionary in the format specified in the BigQuery REST reference. | Use the QueryJobConfig class, which contains properties for the various API configuration options. |
Querying data with the GoogleSQL syntax
The following sample shows how to run a GoogleSQL query with and without explicitly specifying a project. For both libraries, if a project is not specified, the project will be determined from the default credentials.
pandas-gbq
:
google-cloud-bigquery
:
Querying data with the legacy SQL syntax
The following sample shows how to run a query using legacy SQL syntax. See the GoogleSQL migration guide for guidance on updating your queries to GoogleSQL.
pandas-gbq
:
google-cloud-bigquery
:
Using the BigQuery Storage API to download large results
Use the BigQuery Storage API to speed-up downloads of large results by 15 to 31 times.
pandas-gbq
:
google-cloud-bigquery
:
Running a query with a configuration
Sending a configuration with a BigQuery API request is required
to perform certain complex operations, such as running a parameterized query or
specifying a destination table to store the query results. In pandas-gbq
, the
configuration must be sent as a dictionary in the format specified in the
BigQuery REST reference.
In google-cloud-bigquery
, job configuration classes are provided, such as
QueryJobConfig
,
which contain the necessary properties to configure complex jobs.
The following sample shows how to run a query with named parameters.
pandas-gbq
:
google-cloud-bigquery
:
Loading a pandas DataFrame to a BigQuery table
Both libraries support uploading data from a pandas DataFrame to a new table in BigQuery. Key differences include:
pandas-gbq | google-cloud-bigquery | |
---|---|---|
Type support | Converts the DataFrame to CSV format before sending to the API, which does not support nested or array values. | Converts the DataFrame to Parquet or CSV format before sending to the API, which supports nested and array values. Choose Parquet for struct and array values and CSV for date and time serialization flexibility. Parquet is the default choice. Note that pyarrow , which is the parquet engine used to send the DataFrame data to the BigQuery API, must be installed to load the DataFrame to a table. |
Load configurations | Sent as dictionary in the format specified in the BigQuery REST reference. | Use the LoadJobConfig class, which contains properties for the various API configuration options. |
pandas-gbq
:
google-cloud-bigquery
:
google-cloud-bigquery
package requires the pyarrow
library to serialize
a pandas DataFrame to a Parquet file.
Install the pyarrow
package:
conda install -c conda-forge pyarrow
or
pip install pyarrow
Features not supported by pandas-gbq
While the pandas-gbq
library provides a useful interface for querying data
and writing data to tables, it does not cover many of the
BigQuery API features, including but not limited to:
- Managing datasets, including creating new datasets, updating dataset properties, and deleting datasets
- Loading data into BigQuery from formats other than pandas DataFrames
- Managing tables, including listing tables in a dataset, copying table data, and deleting tables
- Exporting BigQuery data directly to Cloud Storage