Class Session (2.23.0)

Session(
    context: typing.Optional[bigframes._config.bigquery_options.BigQueryOptions] = None,
    clients_provider: typing.Optional[bigframes.session.clients.ClientsProvider] = None,
)

Establishes a BigQuery connection to capture a group of job activities related to DataFrames.

Parameters
Name	Description
`context`	`bigframes._config.bigquery_options.BigQueryOptions` Configuration adjusting how to connect to BigQuery and related APIs. Note that some options are ignored if `clients_provider` is set.
`clients_provider`	`bigframes.session.clients.ClientsProvider` An object providing client library objects.

Properties

bqclient

API documentation for bqclient property.

bqconnectionclient

API documentation for bqconnectionclient property.

bqconnectionmanager

API documentation for bqconnectionmanager property.

bqstoragereadclient

API documentation for bqstoragereadclient property.

bytes_processed_sum

The sum of all bytes processed by bigquery jobs using this session.

cloudfunctionsclient

API documentation for cloudfunctionsclient property.

objects

API documentation for objects property.

resourcemanagerclient

API documentation for resourcemanagerclient property.

session_id

API documentation for session_id property.

slot_millis_sum

The sum of all slot time used by bigquery jobs in this session.

Methods

del

__del__()

Automatic cleanup of internal resources.

enter

__enter__()

Enter the runtime context of the Session object.

See With Statement Context Managers for more details.

exit

__exit__(*_)

Exit the runtime context of the Session object.

See With Statement Context Managers for more details.

close

close()

Delete resources that were created with this session's session_id. This includes BigQuery tables, remote functions and cloud functions serving the remote functions.

deploy_remote_function

deploy_remote_function(func, **kwargs)

Orchestrates the creation of a BigQuery remote function that deploys immediately.

This method ensures that the remote function is created and available for use in BigQuery as soon as this call is made.

deploy_udf

deploy_udf(func, **kwargs)

Orchestrates the creation of a BigQuery UDF that deploys immediately.

This method ensures that the UDF is created and available for use in BigQuery as soon as this call is made.

from_glob_path

from_glob_path(
    path: str, *, connection: Optional[str] = None, name: Optional[str] = None
) -> dataframe.DataFrame

Create a BigFrames DataFrame that contains a BigFrames Blob column from a global wildcard path. This operation creates a temporary BQ Object Table under the hood and requires bigquery.connections.delegate permission or BigQuery Connection Admin role. If you have an existing BQ Object Table, use read_gbq_object_table().

Parameters
Name	Description
`path`	`str` The wildcard global path, such as "gs://
`connection`	`str or None, default None` Connection to connect with remote service. str of the format <PROJECT_NUMBER/PROJECT_ID>.
`name`	`str` The column name of the Blob column.

Returns
Type	Description
`bigframes.pandas.DataFrame`	Result BigFrames DataFrame.

read_arrow

read_arrow(pa_table: pyarrow.lib.Table) -> bigframes.dataframe.DataFrame

Load a PyArrow Table to a BigQuery DataFrames DataFrame.

Parameter
Name	Description
`pa_table`	`pyarrow.Table` PyArrow table to load data from.

Returns
Type	Description
`bigframes.dataframe.DataFrame`	A new DataFrame representing the data from the PyArrow table.

read_csv

read_csv(
    filepath_or_buffer: str | IO["bytes"],
    *,
    sep: Optional[str] = ",",
    header: Optional[int] = 0,
    names: Optional[
        Union[MutableSequence[Any], np.ndarray[Any, Any], Tuple[Any, ...], range]
    ] = None,
    index_col: Optional[
        Union[
            int,
            str,
            Sequence[Union[str, int]],
            bigframes.enums.DefaultIndexKind,
            Literal[False],
        ]
    ] = None,
    usecols: Optional[
        Union[
            MutableSequence[str],
            Tuple[str, ...],
            Sequence[int],
            pandas.Series,
            pandas.Index,
            np.ndarray[Any, Any],
            Callable[[Any], bool],
        ]
    ] = None,
    dtype: Optional[Dict] = None,
    engine: Optional[
        Literal["c", "python", "pyarrow", "python-fwf", "bigquery"]
    ] = None,
    encoding: Optional[str] = None,
    write_engine: constants.WriteEngineType = "default",
    **kwargs
) -> dataframe.DataFrame

Loads data from a comma-separated values (csv) file into a DataFrame.

The CSV file data will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.

Note: using engine="bigquery" will not guarantee the same ordering as the file. Instead, set a serialized index column as the index and sort by that in the resulting DataFrame. Only files stored on your local machine or in Google Cloud Storage are supported.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
>>> df = bpd.read_csv(filepath_or_buffer=gcs_path)
>>> df.head(2)
      name post_abbr
0  Alabama        AL
1   Alaska        AK
<BLANKLINE>
[2 rows x 2 columns]

Parameters
Name	Description
`filepath_or_buffer`	`str` A local or Google Cloud Storage (`gs://`) path with `engine="bigquery"` otherwise passed to pandas.read_csv.
`sep`	`Optional[str], default ","` the separator for fields in a CSV file. For the BigQuery engine, the separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF-8. Both engines support `sep=" "` to specify tab character as separator. Default engine supports having any number of spaces as separator by specifying `sep="\s+"`. Separators longer than 1 character are interpreted as regular expressions by the default engine. BigQuery engine only supports single character separators.
`header`	`Optional[int], default 0` row number to use as the column names. - `None`: Instructs autodetect that there are no headers and data should be read starting from the first row. - `0`: If using `engine="bigquery"`, Autodetect tries to detect headers in the first row. If they are not detected, the row is read as data. Otherwise data is read starting from the second row. When using default engine, pandas assumes the first row contains column names unless the `names` argument is specified. If `names` is provided, then the first row is ignored, second row is read as data, and column names are inferred from `names`. - `N > 0`: If using `engine="bigquery"`, Autodetect skips N rows and tries to detect headers in row N+1. If headers are not detected, row N+1 is just skipped. Otherwise row N+1 is used to extract column names for the detected schema. When using default engine, pandas will skip N rows and assumes row N+1 contains column names unless the `names` argument is specified. If `names` is provided, row N+1 will be ignored, row N+2 will be read as data, and column names are inferred from `names`.
`names`	`default None` a list of column names to use. If the file contains a header row and you want to pass this parameter, then `header=0` should be passed as well so the first (header) row is ignored.
`index_col`	`default None` column(s) to use as the row labels of the DataFrame, either given as string name or column index. `index_col=False` can be used with the default engine only to enforce that the first column is not used as the index. Using column index instead of column name is only supported with the default engine. The BigQuery engine only supports having a single column name as the `index_col`. Neither engine supports having a multi-column index.
`usecols`	`default None` List of column names to use): The BigQuery engine only supports having a list of string column names. Column indices and callable functions are only supported with the default engine. Using the default engine, the column names in `usecols` can be defined to correspond to column names provided with the `names` parameter (ignoring the document's header row of column names). The order of the column indices/names in `usecols` is ignored with the default engine. The order of the column names provided with the BigQuery engine will be consistent in the resulting dataframe. If using a callable function with the default engine, only column names that evaluate to True by the callable function will be in the resulting dataframe.
`dtype`	`data type for data or columns` Data type for data or columns. Only to be used with default engine.
`engine`	`Optional[Dict], default None` Type of engine to use. If `engine="bigquery"` is specified, then BigQuery's load API will be used. Otherwise, the engine will be passed to `pandas.read_csv`.
`encoding`	`Optional[str], default to None` encoding the character encoding of the data. The default encoding is `UTF-8` for both engines. The default engine acceps a wide range of encodings. Refer to Python documentation for a comprehensive list, https://docs.python.org/3/library/codecs.html#standard-encodings The BigQuery engine only supports `UTF-8` and `ISO-8859-1`.
`write_engine`	`str` How data should be written to BigQuery (if at all). See `bigframes.pandas.read_pandas` for a full description of supported values.

Exceptions
Type	Description
`bigframes.exceptions.DefaultIndexWarning`	Using the default index is discouraged, such as with clustered or partitioned tables without primary keys.

Returns
Type	Description
`bigframes.pandas.DataFrame`	A BigQuery DataFrames.

read_gbq

Loads a DataFrame from BigQuery.

BigQuery tables are an unordered, unindexed data source. To add support pandas-compatibility, the following indexing options are supported via the index_col parameter:

(Empty iterable, default) A default index. Behavior may change. Explicitly set index_col if your application makes use of specific index values.

If a table has primary key(s), those are used as the index, otherwise a sequential index is generated.
(<xref uid="bigframes.enums.DefaultIndexKind.SEQUENTIAL_INT64">bigframes.enums.DefaultIndexKind.SEQUENTIAL_INT64</xref>) Add an arbitrary sequential index and ordering. Warning This uses an analytic windowed operation that prevents filtering push down. Avoid using on large clustered or partitioned tables.
(Recommended) Set the index_col argument to one or more columns. Unique values for the row labels are recommended. Duplicate labels are possible, but note that joins on a non-unique index can duplicate rows via pandas-compatible outer join behavior.

Note: By default, even SQL query inputs with an ORDER BY clause create a DataFrame with an arbitrary ordering. Use

row_number() OVER
(ORDER BY ...) AS rowindex

in your SQL query and set index_col='rowindex' to preserve the desired ordering.

If your query doesn't have an ordering, select

GENERATE_UUID() AS
    rowindex

in your SQL and set index_col='rowindex' for the best performance.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

If the input is a table ID:

>>> df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")

Read table path with wildcard suffix and filters:

>>> df = bpd.read_gbq_table("bigquery-public-data.noaa_gsod.gsod19*", filters=[("_table_suffix", ">=", "30"), ("_table_suffix", "<=", "39")])

Preserve ordering in a query input.

>>> df = bpd.read_gbq('''
...    SELECT
...       -- Instead of an ORDER BY clause on the query, use
...       -- ROW_NUMBER() to create an ordered DataFrame.
...       ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
...         AS rowindex,
...
...       pitcherFirstName,
...       pitcherLastName,
...       AVG(pitchSpeed) AS averagePitchSpeed
...     FROM `bigquery-public-data.baseball.games_wide`
...     WHERE year = 2016
...     GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
         pitcherFirstName pitcherLastName  averagePitchSpeed
rowindex
1                Albertin         Chapman          96.514113
2                 Zachary         Britton          94.591039
<BLANKLINE>
[2 rows x 3 columns]

Reading data with columns and filters parameters:

>>> columns = ['pitcherFirstName', 'pitcherLastName', 'year', 'pitchSpeed']
>>> filters = [('year', '==', 2016), ('pitcherFirstName', 'in', ['John', 'Doe']), ('pitcherLastName', 'in', ['Gant']), ('pitchSpeed', '>', 94)]
>>> df = bpd.read_gbq(
...             "bigquery-public-data.baseball.games_wide",
...             columns=columns,
...             filters=filters,
...         )
>>> df.head(1)
  pitcherFirstName pitcherLastName  year  pitchSpeed
0             John            Gant  2016          95
<BLANKLINE>
[1 rows x 4 columns]

Parameters
Name	Description
`query_or_table`	`str` A SQL string to be executed or a BigQuery table to be read. The table must be specified in the format of `project.dataset.tablename` or `dataset.tablename`. Can also take wildcard table name, such as `project.dataset.table_prefix*`. In tha case, will read all the matched table as one DataFrame.
`index_col`	`Iterable[str], str, bigframes.enums.DefaultIndexKind` Name of result column(s) to use for index in results DataFrame. If an empty iterable, such as `()`, a default index is generated. Do not depend on specific index values in this case. New in bigframes version 1.3.0: If `index_cols` is not set, the primary key(s) of the table are used as the index. New in bigframes version 1.4.0: Support `bigframes.enums.DefaultIndexKind` to override default index behavior.
`columns`	`Iterable[str]` List of BigQuery column names in the desired order for results DataFrame.
`configuration`	`dict, optional` Query config parameters for job processing. For example: configuration = {'query': {'useQueryCache': False}}. For more information see `BigQuery REST API Reference https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query`__.
`max_results`	`Optional[int], default None` If set, limit the maximum number of rows to fetch from the query results.
`filters`	`Union[Iterable[FilterType], Iterable[Iterable[FilterType]]], default ()` To filter out data. Filter syntax: [[(column, op, val), …],…] where op is [==, >, >=, <, <=, !=, in, not in, LIKE]. The innermost tuples are transposed into a set of filters applied through an AND operation. The outer Iterable combines these sets of filters through an OR operation. A single Iterable of tuples can also be used, meaning that no OR operation between set of filters is to be conducted. If using wildcard table suffix in query_or_table, can specify '_table_suffix' pseudo column to filter the tables to be read into the DataFrame.
`use_cache`	`Optional[bool], default None` Caches query results if set to `True`. When `None`, it behaves as `True`, but should not be combined with `useQueryCache` in `configuration` to avoid conflicts.
`col_order`	`Iterable[str]` Alias for columns, retained for backwards compatibility.
`allow_large_results`	`bool, optional` Whether to allow large query results. If `True`, the query results can be larger than the maximum response size. This option is only applicable when `query_or_table` is a query. Defaults to `bpd.options.compute.allow_large_results`.

Exceptions
Type	Description
`bigframes.exceptions.DefaultIndexWarning`	Using the default index is discouraged, such as with clustered or partitioned tables without primary keys.
`ValueError`	When both `columns` and `col_order` are specified.
`ValueError`	If `configuration` is specified when directly reading from a table.

Returns
Type	Description
`bigframes.pandas.DataFrame`	A DataFrame representing results of the query or table.

read_gbq_function

read_gbq_function(function_name: str, is_row_processor: bool = False)

Loads a BigQuery function from BigQuery.

Then it can be applied to a DataFrame or Series.

BigQuery Utils provides many public functions under the bqutil project on Google Cloud Platform project (See: https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs#using-the-udfs). You can checkout Community UDFs to use community-contributed functions. (See: https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs/community#community-udfs).

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

Use the cw_lower_case_ascii_only function from Community UDFs.

>>> func = bpd.read_gbq_function("bqutil.fn.cw_lower_case_ascii_only")

You can run it on scalar input. Usually you would do so to verify that it works as expected before applying to all values in a Series.

>>> func('AURÉLIE')
'aurÉlie'

You can apply it to a BigQuery DataFrames Series.

>>> df = bpd.DataFrame({'id': [1, 2, 3], 'name': ['AURÉLIE', 'CÉLESTINE', 'DAPHNÉ']})
>>> df
   id       name
0   1    AURÉLIE
1   2  CÉLESTINE
2   3     DAPHNÉ
<BLANKLINE>
[3 rows x 2 columns]

>>> df1 = df.assign(new_name=df['name'].apply(func))
>>> df1
   id       name   new_name
0   1    AURÉLIE    aurÉlie
1   2  CÉLESTINE  cÉlestine
2   3     DAPHNÉ     daphnÉ
<BLANKLINE>
[3 rows x 3 columns]

You can even use a function with multiple inputs. For example, cw_regexp_replace_5 from Community UDFs.

>>> func = bpd.read_gbq_function("bqutil.fn.cw_regexp_replace_5")
>>> func('TestStr123456', 'Str', 'Cad$', 1, 1)
'TestCad$123456'

>>> df = bpd.DataFrame({
...     "haystack" : ["TestStr123456", "TestStr123456Str", "TestStr123456Str"],
...     "regexp" : ["Str", "Str", "Str"],
...     "replacement" : ["Cad$", "Cad$", "Cad$"],
...     "offset" : [1, 1, 1],
...     "occurrence" : [1, 2, 1]
... })
>>> df
           haystack regexp replacement  offset  occurrence
0     TestStr123456    Str        Cad$       1           1
1  TestStr123456Str    Str        Cad$       1           2
2  TestStr123456Str    Str        Cad$       1           1
<BLANKLINE>
[3 rows x 5 columns]
>>> df.apply(func, axis=1)
0       TestCad$123456
1    TestStr123456Cad$
2    TestCad$123456Str
dtype: string

Another use case is to define your own remote function and use it later. For example, define the remote function:

>>> @bpd.remote_function(cloud_function_service_account="default")
... def tenfold(num: int) -> float:
...     return num * 10

Then, read back the deployed BQ remote function:

>>> tenfold_ref = bpd.read_gbq_function(
...     tenfold.bigframes_remote_function,
... )

>>> df = bpd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
>>> df
    a   b   c
0   1   3   5
1   2   4   6
<BLANKLINE>
[2 rows x 3 columns]

>>> df['a'].apply(tenfold_ref)
0    10.0
1    20.0
Name: a, dtype: Float64

It also supports row processing by using is_row_processor=True. Please note, row processor implies that the function has only one input parameter.

>>> @bpd.remote_function(cloud_function_service_account="default")
... def row_sum(s: bpd.Series) -> float:
...     return s['a'] + s['b'] + s['c']

>>> row_sum_ref = bpd.read_gbq_function(
...     row_sum.bigframes_remote_function,
...     is_row_processor=True,
... )

>>> df = bpd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
>>> df
    a   b   c
0   1   3   5
1   2   4   6
<BLANKLINE>
[2 rows x 3 columns]

>>> df.apply(row_sum_ref, axis=1)
0     9.0
1    12.0
dtype: Float64

Parameters
Name	Description
`function_name`	`str` The function's name in BigQuery in the format `project_id.dataset_id.function_name`, or `dataset_id.function_name` to load from the default project, or `function_name` to load from the default project and the dataset associated with the current session.
`is_row_processor`	`bool, default False` Whether the function is a row processor. This is set to True for a function which receives an entire row of a DataFrame as a pandas Series.

Returns
Type	Description
`collections.abc.Callable`	A function object pointing to the BigQuery function read from BigQuery. The object is similar to the one created by the `remote_function` decorator, including the `bigframes_remote_function` property, but not including the `bigframes_cloud_function` property.

read_gbq_model

read_gbq_model(model_name: str)

Loads a BigQuery ML model from BigQuery.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

Read an existing BigQuery ML model.

>>> model_name = "bigframes-dev.bqml_tutorial.penguins_model"
>>> model = bpd.read_gbq_model(model_name)

Parameter
Name	Description
`model_name`	`str` the model's name in BigQuery in the format `project_id.dataset_id.model_id`, or just `dataset_id.model_id` to load from the default project.

read_gbq_object_table

read_gbq_object_table(
    object_table: str, *, name: Optional[str] = None
) -> dataframe.DataFrame

Read an existing object table to create a BigFrames Blob DataFrame. Use the connection of the object table for the connection of the blob. This function dosen't retrieve the object table data. If you want to read the data, use read_gbq() instead.

Parameters
Name	Description
`object_table`	`str` name of the object table of form <PROJECT_ID>.<DATASET_ID>.<TABLE_ID>.
`name`	`str or None` the returned blob column name.

Returns
Type	Description
`bigframes.pandas.DataFrame`	Result BigFrames DataFrame.

read_gbq_query

Turn a SQL query into a DataFrame.

Note: Because the results are written to a temporary table, ordering by ORDER BY is not preserved. A unique index_col is recommended. Use row_number() over () if there is no natural unique index or you want to preserve ordering.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

Simple query input:

>>> df = bpd.read_gbq_query('''
...    SELECT
...       pitcherFirstName,
...       pitcherLastName,
...       pitchSpeed,
...    FROM `bigquery-public-data.baseball.games_wide`
... ''')

Preserve ordering in a query input.

>>> df = bpd.read_gbq_query('''
...    SELECT
...       -- Instead of an ORDER BY clause on the query, use
...       -- ROW_NUMBER() to create an ordered DataFrame.
...       ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
...         AS rowindex,
...
...       pitcherFirstName,
...       pitcherLastName,
...       AVG(pitchSpeed) AS averagePitchSpeed
...     FROM `bigquery-public-data.baseball.games_wide`
...     WHERE year = 2016
...     GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
         pitcherFirstName pitcherLastName  averagePitchSpeed
rowindex
1                Albertin         Chapman          96.514113
2                 Zachary         Britton          94.591039
<BLANKLINE>
[2 rows x 3 columns]

read_gbq_table

Turn a BigQuery table into a DataFrame.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

Read a whole table, with arbitrary ordering or ordering corresponding to the primary key(s).

>>> df = bpd.read_gbq_table("bigquery-public-data.ml_datasets.penguins")

read_gbq_table_streaming

read_gbq_table_streaming(table: str) -> streaming_dataframe.StreamingDataFrame

Turn a BigQuery table into a StreamingDataFrame.

import bigframes.streaming as bst import bigframes.pandas as bpd bpd.options.display.progress_bar = None

sdf = bst.read_gbq_table("bigquery-public-data.ml_datasets.penguins")

Returns
Type	Description
`bigframes.streaming.dataframe.StreamingDataFrame`	A StreamingDataFrame representing results of the table.

read_json

read_json(
    path_or_buf: str | IO["bytes"],
    *,
    orient: Literal[
        "split", "records", "index", "columns", "values", "table"
    ] = "columns",
    dtype: Optional[Dict] = None,
    encoding: Optional[str] = None,
    lines: bool = False,
    engine: Literal["ujson", "pyarrow", "bigquery"] = "ujson",
    write_engine: constants.WriteEngineType = "default",
    **kwargs
) -> dataframe.DataFrame

Convert a JSON string to DataFrame object.

Note: using engine="bigquery" will not guarantee the same ordering as the file. Instead, set a serialized index column as the index and sort by that in the resulting DataFrame.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

>>> gcs_path = "gs://bigframes-dev-testing/sample1.json"
>>> df = bpd.read_json(path_or_buf=gcs_path, lines=True, orient="records")
>>> df.head(2)
   id   name
0   1  Alice
1   2    Bob
<BLANKLINE>
[2 rows x 2 columns]

Parameters
Name	Description
`path_or_buf`	`a valid JSON str, path object or file-like object` A local or Google Cloud Storage (`gs://`) path with `engine="bigquery"` otherwise passed to pandas.read_json.
`orient`	`str, optional` If `engine="bigquery"` orient only supports "records". Indication of expected JSON string format. Compatible JSON strings can be produced by `to_json()` with a corresponding orient value. The set of possible orients is: - `'split'` : dict like `{{index -> [index], columns -> [columns], data -> [values]}}` - `'records'` : list like `[{{column -> value}}, ... , {{column -> value}}]` - `'index'` : dict like `{{index -> {{column -> value}}}}` - `'columns'` : dict like `{{column -> {{index -> value}}}}` - `'values'` : just the values array
`dtype`	`bool or dict, default None` If True, infer dtypes; if a dict of column to dtype, then use those; if False, then don't infer dtypes at all, applies only to the data. For all `orient` values except `'table'`, default is True.
`encoding`	`str, default is 'utf-8'` The encoding to use to decode py3 bytes.
`lines`	`bool, default False` Read the file as a json object per line. If using `engine="bigquery"` lines only supports True.
`engine`	`{{"ujson", "pyarrow", "bigquery"}}, default "ujson"` Type of engine to use. If `engine="bigquery"` is specified, then BigQuery's load API will be used. Otherwise, the engine will be passed to `pandas.read_json`.
`write_engine`	`str` How data should be written to BigQuery (if at all). See `bigframes.pandas.read_pandas` for a full description of supported values.

Exceptions
Type	Description
`bigframes.exceptions.DefaultIndexWarning`	Using the default index is discouraged, such as with clustered or partitioned tables without primary keys.
`ValueError`	`lines` is only valid when `orient` is `records`.

Returns
Type	Description
`bigframes.pandas.DataFrame`	The DataFrame representing JSON contents.

read_pandas

Loads DataFrame from a pandas DataFrame.

The pandas DataFrame will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.

Examples:

>>> import bigframes.pandas as bpd
>>> import pandas as pd
>>> bpd.options.display.progress_bar = None

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> pandas_df = pd.DataFrame(data=d)
>>> df = bpd.read_pandas(pandas_df)
>>> df
   col1  col2
0     1     3
1     2     4
<BLANKLINE>
[2 rows x 2 columns]

Parameters
Name	Description
`pandas_dataframe`	`pandas.DataFrame, pandas.Series, or pandas.Index` a pandas DataFrame/Series/Index object to be loaded.
`write_engine`	`str` How data should be written to BigQuery (if at all). Supported values: * "default": (Recommended) Select an appropriate mechanism to write data to BigQuery. Depends on data size and supported data types. * "bigquery_inline": Inline data in BigQuery SQL. Use this when you know the data is small enough to fit within BigQuery's 1 MB query text size limit. * "bigquery_load": Use a BigQuery load job. Use this for larger data sizes. * "bigquery_streaming": Use the BigQuery streaming JSON API. Use this if your workload is such that you exhaust the BigQuery load job quota and your data cannot be embedded in SQL due to size or data type limitations. * "bigquery_write": [Preview] Use the BigQuery Storage Write API. This feature is in public preview.

Exceptions
Type	Description
`ValueError`	When the object is not a Pandas DataFrame.

read_parquet

read_parquet(
    path: str | IO["bytes"],
    *,
    engine: str = "auto",
    write_engine: constants.WriteEngineType = "default"
) -> dataframe.DataFrame

Load a Parquet object from the file path (local or Cloud Storage), returning a DataFrame.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"
>>> df = bpd.read_parquet(path=gcs_path, engine="bigquery")

Parameters
Name	Description
`path`	`str` Local or Cloud Storage path to Parquet file.
`engine`	`str` One of `'auto', 'pyarrow', 'fastparquet'`, or `'bigquery'`. Parquet library to parse the file. If set to `'bigquery'`, order is not preserved. Default, `'auto'`.

Returns
Type	Description
`bigframes.pandas.DataFrame`	A BigQuery DataFrames.

read_pickle

read_pickle(
    filepath_or_buffer: FilePath | ReadPickleBuffer,
    compression: CompressionOptions = "infer",
    storage_options: StorageOptions = None,
    *,
    write_engine: constants.WriteEngineType = "default"
)

Load pickled BigFrames object (or any object) from file.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

>>> gcs_path = "gs://bigframes-dev-testing/test_pickle.pkl"
>>> df = bpd.read_pickle(filepath_or_buffer=gcs_path)

Parameters
Name	Description
`filepath_or_buffer`	`str, path object, or file-like object` String, path object (implementing os.PathLike[str]), or file-like object implementing a binary readlines() function. Also accepts URL. URL is not limited to S3 and GCS.
`compression`	`str or dict, default 'infer'` For on-the-fly decompression of on-disk data. If 'infer' and 'filepath_or_buffer' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). If using 'zip' or 'tar', the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary compression={'method': 'zstd', 'dict_data': my_compression_dict}.
`storage_options`	`dict, default None` Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.
`write_engine`	`str` How data should be written to BigQuery (if at all). See `bigframes.pandas.read_pandas` for a full description of supported values.

Returns
Type	Description
`bigframes.pandas.DataFrame or bigframes.pandas.Series`	same type as object stored in file.

remote_function

remote_function(
    input_types: typing.Union[None, type, typing.Sequence[type]] = None,
    output_type: typing.Optional[type] = None,
    dataset: typing.Optional[str] = None,
    *,
    bigquery_connection: typing.Optional[str] = None,
    reuse: bool = True,
    name: typing.Optional[str] = None,
    packages: typing.Optional[typing.Sequence[str]] = None,
    cloud_function_service_account: str,
    cloud_function_kms_key_name: typing.Optional[str] = None,
    cloud_function_docker_repository: typing.Optional[str] = None,
    max_batching_rows: typing.Optional[int] = 1000,
    cloud_function_timeout: typing.Optional[int] = 600,
    cloud_function_max_instances: typing.Optional[int] = None,
    cloud_function_vpc_connector: typing.Optional[str] = None,
    cloud_function_vpc_connector_egress_settings: typing.Optional[
        typing.Literal["all", "private-ranges-only", "unspecified"]
    ] = None,
    cloud_function_memory_mib: typing.Optional[int] = 1024,
    cloud_function_ingress_settings: typing.Literal[
        "all", "internal-only", "internal-and-gclb"
    ] = "internal-only",
    cloud_build_service_account: typing.Optional[str] = None
)

Decorator to turn a user defined function into a BigQuery remote function. Check out the code samples at: https://cloud.google.com/bigquery/docs/remote-functions#bigquery-dataframes.

Note: input_types=Series scenario is in preview. It currently only supports dataframe with column types Int64/Float64/boolean/ string/binary[pyarrow].

Warning: To use remote functions with Bigframes 2.0 and onwards, please (preferred) set an explicit user-managed cloud_function_service_account or (discouraged) set cloud_function_service_account to use the Compute Engine service account by setting it to "default".

See, https://cloud.google.com/functions/docs/securing/function-identity.

Have the below APIs enabled for your project:
- BigQuery Connection API
- Cloud Functions API
- Cloud Run API
- Cloud Build API
- Artifact Registry API
- Cloud Resource Manager API
This can be done from the cloud console (change PROJECT_ID to yours): https://console.cloud.google.com/apis/enableflow?apiid=bigqueryconnection.googleapis.com,cloudfunctions.googleapis.com,run.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com,cloudresourcemanager.googleapis.com&project=PROJECT_ID

Or from the gcloud CLI:

$ gcloud services enable bigqueryconnection.googleapis.com cloudfunctions.googleapis.com run.googleapis.com cloudbuild.googleapis.com artifactregistry.googleapis.com cloudresourcemanager.googleapis.com
Have following IAM roles enabled for you:
- BigQuery Data Editor (roles/bigquery.dataEditor)
- BigQuery Connection Admin (roles/bigquery.connectionAdmin)
- Cloud Functions Developer (roles/cloudfunctions.developer)
- Service Account User (roles/iam.serviceAccountUser) on the service account PROJECT_NUMBER-compute@developer.gserviceaccount.com
- Storage Object Viewer (roles/storage.objectViewer)
- Project IAM Admin (roles/resourcemanager.projectIamAdmin) (Only required if the bigquery connection being used is not pre-created and is created dynamically with user credentials.)
Either the user has setIamPolicy privilege on the project, or a BigQuery connection is pre-created with necessary IAM role set:
1. To create a connection, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#create_a_connection
2. To set up IAM, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#grant_permission_on_function
  
  Alternatively, the IAM could also be setup via the gcloud CLI:
  
  $ gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:CONNECTION_SERVICE_ACCOUNT_ID" --role="roles/run.invoker".

Parameters
Name	Description
`input_types`	`type or sequence(type), Optional` For scalar user defined function it should be the input type or sequence of input types. The supported scalar input types are `bool`, `bytes`, `float`, `int`, `str`. For row processing user defined function (i.e. functions that receive a single input representing a row in form of a Series), type `Series` should be specified.
`output_type`	`type, Optional` Data type of the output in the user defined function. If the user defined function returns an array, then `list[type]` should be specified. The supported output types are `bool`, `bytes`, `float`, `int`, `str`, `list[bool]`, `list[float]`, `list[int]` and `list[str]`.
`dataset`	`str, Optional` Dataset in which to create a BigQuery remote function. It should be in `<project_id>.<dataset_name>` or `<dataset_name>` format. If this parameter is not provided then session dataset id is used.
`bigquery_connection`	`str, Optional` Name of the BigQuery connection. You should either have the connection already created in the `location` you have chosen, or you should have the Project IAM Admin role to enable the service to create the connection for you if you need it. If this parameter is not provided then the BigQuery connection from the session is used.
`reuse`	`bool, Optional` Reuse the remote function if already exists. `True` by default, which will result in reusing an existing remote function and corresponding cloud function that was previously created (if any) for the same udf. Please note that for an unnamed (i.e. created without an explicit `name` argument) remote function, the BigQuery DataFrames session id is attached in the cloud artifacts names. So for the effective reuse across the sessions it is recommended to create the remote function with an explicit `name`. Setting it to `False` would force creating a unique remote function. If the required remote function does not exist then it would be created irrespective of this param.
`name`	`str, Optional` Explicit name of the persisted BigQuery remote function. Use it with caution, because more than one users working in the same project and dataset could overwrite each other's remote functions if they use the same persistent name. When an explicit name is provided, any session specific clean up ( `bigframes.session.Session.close`/ `bigframes.pandas.close_session`/ `bigframes.pandas.reset_session`/ `bigframes.pandas.clean_up_by_session_id`) does not clean up the function, and leaves it for the user to manage the function and the associated cloud function directly.
`packages`	`str[], Optional` Explicit name of the external package dependencies. Each dependency is added to the `requirements.txt` as is, and can be of the form supported in https://pip.pypa.io/en/stable/reference/requirements-file-format/.
`cloud_function_service_account`	`str` Service account to use for the cloud functions. If "default" provided then the default service account would be used. See https://cloud.google.com/functions/docs/securing/function-identity for more details. Please make sure the service account has the necessary IAM permissions configured as described in https://cloud.google.com/functions/docs/reference/iam/roles#additional-configuration.
`cloud_function_kms_key_name`	`str, Optional` Customer managed encryption key to protect cloud functions and related data at rest. This is of the format projects/PROJECT_ID/locations/LOCATION/keyRings/KEYRING/cryptoKeys/KEY. Read https://cloud.google.com/functions/docs/securing/cmek for more details including granting necessary service accounts access to the key.
`cloud_function_docker_repository`	`str, Optional` Docker repository created with the same encryption key as `cloud_function_kms_key_name` to store encrypted artifacts created to support the cloud function. This is of the format projects/PROJECT_ID/locations/LOCATION/repositories/REPOSITORY_NAME. For more details see https://cloud.google.com/functions/docs/securing/cmek#before_you_begin.
`max_batching_rows`	`int, Optional` The maximum number of rows to be batched for processing in the BQ remote function. Default value is 1000. A lower number can be passed to avoid timeouts in case the user code is too complex to process large number of rows fast enough. A higher number can be used to increase throughput in case the user code is fast enough. `None` can be passed to let BQ remote functions service apply default batching. See for more details https://cloud.google.com/bigquery/docs/remote-functions#limiting_number_of_rows_in_a_batch_request.
`cloud_function_timeout`	`int, Optional` The maximum amount of time (in seconds) BigQuery should wait for the cloud function to return a response. See for more details https://cloud.google.com/functions/docs/configuring/timeout. Please note that even though the cloud function (2nd gen) itself allows seeting up to 60 minutes of timeout, BigQuery remote function can wait only up to 20 minutes, see for more details https://cloud.google.com/bigquery/quotas#remote_function_limits. By default BigQuery DataFrames uses a 10 minute timeout. `None` can be passed to let the cloud functions default timeout take effect.
`cloud_function_max_instances`	`int, Optional` The maximumm instance count for the cloud function created. This can be used to control how many cloud function instances can be active at max at any given point of time. Lower setting can help control the spike in the billing. Higher setting can help support processing larger scale data. When not specified, cloud function's default setting applies. For more details see https://cloud.google.com/functions/docs/configuring/max-instances.
`cloud_function_vpc_connector`	`str, Optional` The VPC connector you would like to configure for your cloud function. This is useful if your code needs access to data or service(s) that are on a VPC network. See for more details https://cloud.google.com/functions/docs/networking/connecting-vpc.
`cloud_function_vpc_connector_egress_settings`	`str, Optional` Egress settings for the VPC connector, controlling what outbound traffic is routed through the VPC connector. Options are: `all`, `private-ranges-only`, or `unspecified`. If not specified, `private-ranges-only` is used by default. See for more details https://cloud.google.com/run/docs/configuring/vpc-connectors#egress-job.
`cloud_function_memory_mib`	`int, Optional` The amounts of memory (in mebibytes) to allocate for the cloud function (2nd gen) created. This also dictates a corresponding amount of allocated CPU for the function. By default a memory of 1024 MiB is set for the cloud functions created to support BigQuery DataFrames remote function. If you want to let the default memory of cloud functions be allocated, pass `None`. See for more details https://cloud.google.com/functions/docs/configuring/memory.
`cloud_function_ingress_settings`	`str, Optional` Ingress settings controls dictating what traffic can reach the function. Options are: `all`, `internal-only`, or `internal-and-gclb`. If no setting is provided, `internal-only` will be used by default. See for more details https://cloud.google.com/functions/docs/networking/network-settings#ingress_settings.
`cloud_build_service_account`	`str, Optional` Service account in the fully qualified format `projects/PROJECT_ID/serviceAccounts/SERVICE_ACCOUNT_EMAIL`, or just the SERVICE_ACCOUNT_EMAIL. The latter would be interpreted as belonging to the BigQuery DataFrames session project. This is to be used by Cloud Build to build the function source code into a deployable artifact. If not provided, the default Cloud Build service account is used. See https://cloud.google.com/build/docs/cloud-build-service-account for more details.

Returns
Type	Description
`collections.abc.Callable`	A remote function object pointing to the cloud assets created in the background to support the remote execution. The cloud assets can be located through the following properties set in the object: `bigframes_cloud_function` - The google cloud function deployed for the user defined code. `bigframes_remote_function` - The bigquery remote function capable of calling into `bigframes_cloud_function`.

udf

udf(
    *,
    input_types: typing.Union[None, type, typing.Sequence[type]] = None,
    output_type: typing.Optional[type] = None,
    dataset: str,
    bigquery_connection: typing.Optional[str] = None,
    name: str,
    packages: typing.Optional[typing.Sequence[str]] = None,
    max_batching_rows: typing.Optional[int] = None,
    container_cpu: typing.Optional[float] = None,
    container_memory: typing.Optional[str] = None
)

Decorator to turn a Python user defined function (udf) into a BigQuery managed user-defined function.

Examples:

>>> import bigframes.pandas as bpd
>>> import datetime
>>> bpd.options.display.progress_bar = None

Turning an arbitrary python function into a BigQuery managed python udf:

>>> bq_name = datetime.datetime.now().strftime("bigframes_%Y%m%d%H%M%S%f")
>>> @bpd.udf(dataset="bigfranes_testing", name=bq_name)
... def minutes_to_hours(x: int) -> float:
...     return x/60

>>> minutes = bpd.Series([0, 30, 60, 90, 120])
>>> minutes
0      0
1     30
2     60
3     90
4    120
dtype: Int64

>>> hours = minutes.apply(minutes_to_hours)
>>> hours
0    0.0
1    0.5
2    1.0
3    1.5
4    2.0
dtype: Float64

To turn a user defined function with external package dependencies into a BigQuery managed python udf, you would provide the names of the packages (optionally with the package version) via packages param.

>>> bq_name = datetime.datetime.now().strftime("bigframes_%Y%m%d%H%M%S%f")
>>> @bpd.udf(
...     dataset="bigfranes_testing",
...     name=bq_name,
...     packages=["cryptography"]
... )
... def get_hash(input: str) -> str:
...     from cryptography.fernet import Fernet
...
...     # handle missing value
...     if input is None:
...         input = ""
...
...     key = Fernet.generate_key()
...     f = Fernet(key)
...     return f.encrypt(input.encode()).decode()

>>> names = bpd.Series(["Alice", "Bob"])
>>> hashes = names.apply(get_hash)

You can clean-up the BigQuery functions created above using the BigQuery client from the BigQuery DataFrames session:

>>> session = bpd.get_global_session()
>>> session.bqclient.delete_routine(minutes_to_hours.bigframes_bigquery_function)
>>> session.bqclient.delete_routine(get_hash.bigframes_bigquery_function)

Parameters
Name	Description
`input_types`	`type or sequence(type), Optional` For scalar user defined function it should be the input type or sequence of input types. The supported scalar input types are `bool`, `bytes`, `float`, `int`, `str`.
`output_type`	`type, Optional` Data type of the output in the user defined function. If the user defined function returns an array, then `list[type]` should be specified. The supported output types are `bool`, `bytes`, `float`, `int`, `str`, `list[bool]`, `list[float]`, `list[int]` and `list[str]`.
`dataset`	`str` Dataset in which to create a BigQuery managed function. It should be in `<project_id>.<dataset_name>` or `<dataset_name>` format.
`bigquery_connection`	`str, Optional` Name of the BigQuery connection. It is used to provide an identity to the serverless instances running the user code. It helps BigQuery manage and track the resources used by the udf. This connection is required for internet access and for interacting with other GCP services. To access GCP services, the appropriate IAM permissions must also be granted to the connection's Service Account. When it defaults to None, the udf will be created without any connection. A udf without a connection has no internet access and no access to other GCP services.
`name`	`str` Explicit name of the persisted BigQuery managed function. Use it with caution, because more than one users working in the same project and dataset could overwrite each other's managed functions if they use the same persistent name. Please note that any session specific clean up ( `bigframes.session.Session.close`/ `bigframes.pandas.close_session`/ `bigframes.pandas.reset_session`/ `bigframes.pandas.clean_up_by_session_id`) does not clean up this function, and leaves it for the user to manage the function directly.
`packages`	`str[], Optional` Explicit name of the external package dependencies. Each dependency is added to the `requirements.txt` as is, and can be of the form supported in https://pip.pypa.io/en/stable/reference/requirements-file-format/.
`max_batching_rows`	`int, Optional` The maximum number of rows in each batch. If you specify max_batching_rows, BigQuery determines the number of rows in a batch, up to the max_batching_rows limit. If max_batching_rows is not specified, the number of rows to batch is determined automatically.
`container_cpu`	`float, Optional` The CPU limits for containers that run Python UDFs. By default, the CPU allocated is 0.33 vCPU. See details at https://cloud.google.com/bigquery/docs/user-defined-functions-python#configure-container-limits.
`container_memory`	`str, Optional` The memory limits for containers that run Python UDFs. By default, the memory allocated to each container instance is 512 MiB. See details at https://cloud.google.com/bigquery/docs/user-defined-functions-python#configure-container-limits.

Returns
Type	Description
`collections.abc.Callable`	A managed function object pointing to the cloud assets created in the background to support the remote execution. The cloud ssets can be located through the following properties set in the object: `bigframes_bigquery_function` - The bigquery managed function deployed for the user defined code.

Parameters
Name	Description
`query`	`str` A SQL query to execute.
`index_col`	`Iterable[str] or str, optional` The column(s) to use as the index for the DataFrame. This can be a single column name or a list of column names. If not provided, a default index will be used.
`columns`	`Iterable[str], optional` The columns to read from the query result. If not specified, all columns will be read.
`configuration`	`dict, optional` A dictionary of query job configuration options. See the BigQuery REST API documentation for a list of available options: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query
`max_results`	`int, optional` The maximum number of rows to retrieve from the query result. If not specified, all rows will be loaded.
`use_cache`	`bool, optional` Whether to use cached results for the query. Defaults to `True`. Setting this to `False` will force a re-execution of the query.
`col_order`	`Iterable[str], optional` The desired order of columns in the resulting DataFrame. This parameter is deprecated and will be removed in a future version. Use `columns` instead.
`filters`	`list[tuple], optional` A list of filters to apply to the data. Filters are specified as a list of tuples, where each tuple contains a column name, an operator (e.g., '==', '!='), and a value.
`dry_run`	`bool, optional` If `True`, the function will not actually execute the query but will instead return statistics about the query. Defaults to `False`.
`allow_large_results`	`bool, optional` Whether to allow large query results. If `True`, the query results can be larger than the maximum response size. Defaults to `bpd.options.compute.allow_large_results`.

Parameters
Name	Description
`table_id`	`str` The identifier of the BigQuery table to read.
`index_col`	`Iterable[str] or str, optional` The column(s) to use as the index for the DataFrame. This can be a single column name or a list of column names. If not provided, a default index will be used.
`columns`	`Iterable[str], optional` The columns to read from the table. If not specified, all columns will be read.
`max_results`	`int, optional` The maximum number of rows to retrieve from the table. If not specified, all rows will be loaded.
`filters`	`list[tuple], optional` A list of filters to apply to the data. Filters are specified as a list of tuples, where each tuple contains a column name, an operator (e.g., '==', '!='), and a value.
`use_cache`	`bool, optional` Whether to use cached results for the query. Defaults to `True`. Setting this to `False` will force a re-execution of the query.
`col_order`	`Iterable[str], optional` The desired order of columns in the resulting DataFrame. This parameter is deprecated and will be removed in a future version. Use `columns` instead.
`dry_run`	`bool, optional` If `True`, the function will not actually execute the query but will instead return statistics about the table. Defaults to `False`.

Returns
Type	Description
`bigframes.pandas.DataFrame or pandas.Series`	A DataFrame representing the result of the query. If `dry_run` is `True`, a `pandas.Series` containing query statistics is returned.

Class Session (2.23.0) Stay organized with collections Save and categorize content based on your preferences.

Parameters

Properties

bqclient

bqconnectionclient

bqconnectionmanager

bqstoragereadclient

bytes_processed_sum

cloudfunctionsclient

objects

resourcemanagerclient

session_id

slot_millis_sum

Methods

__del__

__enter__

__exit__

close

deploy_remote_function

deploy_udf

from_glob_path

read_arrow

read_csv

read_gbq

read_gbq_function

read_gbq_model

read_gbq_object_table

read_gbq_query

read_gbq_table

read_gbq_table_streaming

read_json

read_pandas

read_parquet

read_pickle

remote_function

udf

Class Session (2.23.0)

del

enter

exit