Package pandas (0.3.0)

API documentation for pandas package.



NamedAgg(column, aggfunc)

Packages Functions


Concatenate BigQuery DataFrames objects along a particular axis.

Allows optional set logic along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.


The axis to concatenate along.


How to handle indexes on other axis (or axes).


If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.


    x: bigframes.series.Series, bins: int, *, labels: typing.Optional[bool] = None
) -> bigframes.series.Series

Bin values into discrete intervals.

Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.

labels=False implies you just want the bins back.


import bigframes.pandas as pd

pd.options.display.progress_bar = None
s = pd.Series([0, 1, 1, 2])
pd.cut(s, bins=4, labels=False)

0    0
1    1
2    1
3    3
dtype: Int64

The input Series to be binned. Must be 1-dimensional.


The criteria to bin by. int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.


Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins. This affects the type of the output container (see below). If True, raises an error. When ordered=False, labels must be provided.


    left: bigframes.dataframe.DataFrame,
    right: bigframes.dataframe.DataFrame,
    how: typing.Literal["inner", "left", "outer", "right"] = "inner",
    on: typing.Optional[str] = None,
    left_on: typing.Optional[str] = None,
    right_on: typing.Optional[str] = None,
    sort: bool = False,
    suffixes: tuple[str, str] = ("_x", "_y")
) -> bigframes.dataframe.DataFrame

Merge DataFrame objects with a database-style join.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.


    filepath_or_buffer: str | IO["bytes"],
    sep: Optional[str] = ",",
    header: Optional[int] = 0,
    names: Optional[
        Union[MutableSequence[Any], numpy.ndarray[Any, Any], Tuple[Any, ...], range]
    ] = None,
    index_col: Optional[
        Union[int, str, Sequence[Union[str, int]], Literal[False]]
    ] = None,
    usecols: Optional[
            Tuple[str, ...],
            numpy.ndarray[Any, Any],
            Callable[[Any], bool],
    ] = None,
    dtype: Optional[Dict] = None,
    engine: Optional[
        Literal["c", "python", "pyarrow", "python-fwf", "bigquery"]
    ] = None,
    encoding: Optional[str] = None,
) -> bigframes.dataframe.DataFrame

Loads DataFrame from comma-separated values (csv) file locally or from Cloud Storage.

The CSV file data will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.


A local or Google Cloud Storage (gs://) path with engine="bigquery" otherwise passed to pandas.read_csv.


the separator for fields in a CSV file. For the BigQuery engine, the separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF-8. Both engines support sep=" " to specify tab character as separator. Default engine supports having any number of spaces as separator by specifying sep="\s+". Separators longer than 1 character are interpreted as regular expressions by the default engine. BigQuery engine only supports single character separators.


row number to use as the column names. - None: Instructs autodetect that there are no headers and data should be read starting from the first row. - 0: If using engine="bigquery", Autodetect tries to detect headers in the first row. If they are not detected, the row is read as data. Otherwise data is read starting from the second row. When using default engine, pandas assumes the first row contains column names unless the names argument is specified. If names is provided, then the first row is ignored, second row is read as data, and column names are inferred from names. - N > 0: If using engine="bigquery", Autodetect skips N rows and tries to detect headers in row N+1. If headers are not detected, row N+1 is just skipped. Otherwise row N+1 is used to extract column names for the detected schema. When using default engine, pandas will skip N rows and assumes row N+1 contains column names unless the names argument is specified. If names is provided, row N+1 will be ignored, row N+2 will be read as data, and column names are inferred from names.


a list of column names to use. If the file contains a header row and you want to pass this parameter, then header=0 should be passed as well so the first (header) row is ignored. Only to be used with default engine.


column(s) to use as the row labels of the DataFrame, either given as string name or column index. index_col=False can be used with the default engine only to enforce that the first column is not used as the index. Using column index instead of column name is only supported with the default engine. The BigQuery engine only supports having a single column name as the index_col. Neither engine supports having a multi-column index.


List of column names to use): The BigQuery engine only supports having a list of string column names. Column indices and callable functions are only supported with the default engine. Using the default engine, the column names in usecols can be defined to correspond to column names provided with the names parameter (ignoring the document's header row of column names). The order of the column indices/names in usecols is ignored with the default engine. The order of the column names provided with the BigQuery engine will be consistent in the resulting dataframe. If using a callable function with the default engine, only column names that evaluate to True by the callable function will be in the resulting dataframe.


Data type for data or columns. Only to be used with default engine.


Type of engine to use. If engine="bigquery" is specified, then BigQuery's load API will be used. Otherwise, the engine will be passed to pandas.read_csv.


encoding the character encoding of the data. The default encoding is UTF-8 for both engines. The default engine acceps a wide range of encodings. Refer to Python documentation for a comprehensive list, The BigQuery engine only supports UTF-8 and ISO-8859-1.


    query: str,
    index_col: Iterable[str] | str = (),
    col_order: Iterable[str] = (),
    max_results: Optional[int] = None
) -> bigframes.dataframe.DataFrame

Loads a DataFrame from BigQuery.

BigQuery tables are an unordered, unindexed data source. By default, the DataFrame will have an arbitrary index and ordering.

Set the index_col argument to one or more columns to choose an index. The resulting DataFrame is sorted by the index columns. For the best performance, ensure the index columns don't contain duplicate values.

If your query doesn't have an ordering, select GENERATE_UUID() AS rowindex in your SQL and set index_col='rowindex' for the best performance.


>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

Preserve ordering in a query input.

>>> bpd.read_gbq('''
...    SELECT
...       -- Instead of an ORDER BY clause on the query, use
...       -- ROW_NUMBER() to create an ordered DataFrame.
...       ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
...         AS rowindex,
...       pitcherFirstName,
...       pitcherLastName,
...       AVG(pitchSpeed) AS averagePitchSpeed
...     FROM ``
...     WHERE year = 2016
...     GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex").head(n=5)
         pitcherFirstName pitcherLastName  averagePitchSpeed
1                Albertin         Chapman          96.514113
2                 Zachary         Britton          94.591039
3                  Trevor       Rosenthal          94.213953
4                    Jose          Torres          94.103448
5                  Tayron        Guerrero          93.863636
[5 rows x 3 columns]

A SQL string to be executed or a BigQuery table to be read. The table must be specified in the format of project.dataset.tablename or dataset.tablename.


Name of result column(s) to use for index in results DataFrame.


List of BigQuery column names in the desired order for results DataFrame.


If set, limit the maximum number of rows to fetch from the query results.


read_gbq_function(function_name: str)

Loads a BigQuery function from BigQuery.

Then it can be applied to a DataFrame or Series.


the function's name in BigQuery in the format project_id.dataset_id.function_name, or dataset_id.function_name to load from the default project, or function_name to load from the default project and the dataset associated with the current session.


read_gbq_model(model_name: str)

Loads a BigQuery ML model from BigQuery.


the model's name in BigQuery in the format project_id.dataset_id.model_id, or just dataset_id.model_id to load from the default project.


    query: str,
    index_col: Iterable[str] | str = (),
    col_order: Iterable[str] = (),
    max_results: Optional[int] = None
) -> bigframes.dataframe.DataFrame

Turn a SQL query into a DataFrame.

Note: Because the results are written to a temporary table, ordering by ORDER BY is not preserved. A unique index_col is recommended. Use row_number() over () if there is no natural unique index or you want to preserve ordering.

See also: Session.read_gbq.


    query: str,
    index_col: Iterable[str] | str = (),
    col_order: Iterable[str] = (),
    max_results: Optional[int] = None
) -> bigframes.dataframe.DataFrame

Turn a BigQuery table into a DataFrame.

See also: Session.read_gbq.


    path_or_buf: str | IO["bytes"],
    orient: Literal[
        "split", "records", "index", "columns", "values", "table"
    ] = "columns",
    dtype: Optional[Dict] = None,
    encoding: Optional[str] = None,
    lines: bool = False,
    engine: Literal["ujson", "pyarrow", "bigquery"] = "ujson",
) -> bigframes.dataframe.DataFrame

Convert a JSON string to DataFrame object.


A local or Google Cloud Storage (gs://) path with engine="bigquery" otherwise passed to pandas.read_json.


If engine="bigquery" orient only supports "records". Indication of expected JSON string format. Compatible JSON strings can be produced by to_json() with a corresponding orient value. The set of possible orients is: - 'split' : dict like {{index -> [index], columns -> [columns], data -> [values]}} - 'records' : list like [{{column -> value}}, ... , {{column -> value}}] - 'index' : dict like {{index -> {{column -> value}}}} - 'columns' : dict like {{column -> {{index -> value}}}} - 'values' : just the values array


If True, infer dtypes; if a dict of column to dtype, then use those; if False, then don't infer dtypes at all, applies only to the data. For all orient values except 'table', default is True.


The encoding to use to decode py3 bytes.


Read the file as a json object per line. If using engine="bigquery" lines only supports True.


Type of engine to use. If engine="bigquery" is specified, then BigQuery's load API will be used. Otherwise, the engine will be passed to pandas.read_json.


    pandas_dataframe: pandas.core.frame.DataFrame,
) -> bigframes.dataframe.DataFrame

Loads DataFrame from a pandas DataFrame.

The pandas DataFrame will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.


a pandas DataFrame object to be loaded.


read_parquet(path: str | IO["bytes"]) -> bigframes.dataframe.DataFrame

Load a Parquet object from the file path (local or Cloud Storage), returning a DataFrame.


Local or Cloud Storage path to Parquet file.


    filepath_or_buffer: FilePath | ReadPickleBuffer,
    compression: CompressionOptions = "infer",
    storage_options: StorageOptions = None,

Load pickled BigFrames object (or any object) from file.


String, path object (implementing os.PathLike[str]), or file-like object implementing a binary readlines() function. Also accepts URL. URL is not limited to S3 and GCS.


For on-the-fly decompression of on-disk data. If 'infer' and 'filepath_or_buffer' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). If using 'zip' or 'tar', the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary compression={'method': 'zstd', 'dict_data': my_compression_dict}.


Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to Please see fsspec and urllib for more details, and for more examples on storage options refer here.


    input_types: typing.List[type],
    output_type: type,
    dataset: typing.Optional[str] = None,
    bigquery_connection: typing.Optional[str] = None,
    reuse: bool = True,

Decorator to turn a user defined function into a BigQuery remote function.

  1. Have the below APIs enabled for your project:

    • BigQuery Connection API
    • Cloud Functions API
    • Cloud Run API
    • Cloud Build API
    • Artifact Registry API
    • Cloud Resource Manager API

    This can be done from the cloud console (change PROJECT_ID to yours):,,,,,

    Or from the gcloud CLI:

    $ gcloud services enable

  2. Have following IAM roles enabled for you:

    • BigQuery Data Editor (roles/bigquery.dataEditor)
    • BigQuery Connection Admin (roles/bigquery.connectionAdmin)
    • Cloud Functions Developer (roles/cloudfunctions.developer)
    • Service Account User (roles/iam.serviceAccountUser)
    • Storage Object Viewer (roles/storage.objectViewer)
    • Project IAM Admin (roles/resourcemanager.projectIamAdmin) (Only required if the bigquery connection being used is not pre-created and is created dynamically with user credentials.)
  3. Either the user has setIamPolicy privilege on the project, or a BigQuery connection is pre-created with necessary IAM role set:

    1. To create a connection, follow
    2. To set up IAM, follow

      Alternatively, the IAM could also be setup via the gcloud CLI:

      $ gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:CONNECTION_SERVICE_ACCOUNT_ID" --role="roles/run.invoker".


List of input data types in the user defined function.


Data type of the output in the user defined function.


Dataset in which to create a BigQuery remote function. It should be in <project_id>.<dataset_name> or <dataset_name> format. If this parameter is not provided then session dataset id is used.


Name of the BigQuery connection. You should either have the connection already created in the location you have chosen, or you should have the Project IAM Admin role to enable the service to create the connection for you if you need it.If this parameter is not provided then the BigQuery connection from the session is used.


Reuse the remote function if already exists. True by default, which will result in reusing an existing remote function (if any) that was previously created for the same udf. Setting it to false would force creating a unique remote function. If the required remote function does not exist then it would be created irrespective of this param.