Package bigquery (1.17.0)

API documentation for bigquery package.

Packages Functions

array_agg

array_agg(
    obj: groupby.SeriesGroupBy | groupby.DataFrameGroupBy,
) -> series.Series | dataframe.DataFrame

Group data and create arrays from selected columns, omitting NULLs to avoid BigQuery errors (NULLs not allowed in arrays).

Examples:

>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> import numpy as np
>>> bpd.options.display.progress_bar = None

For a SeriesGroupBy object:

>>> lst = ['a', 'a', 'b', 'b', 'a']
>>> s = bpd.Series([1, 2, 3, 4, np.nan], index=lst)
>>> bbq.array_agg(s.groupby(level=0))
a    [1. 2.]
b    [3. 4.]
dtype: list<item: double>[pyarrow]

For a DataFrameGroupBy object:

>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = bpd.DataFrame(l, columns=["a", "b", "c"])
>>> bbq.array_agg(df.groupby(by=["b"]))
         a      c
b
1.0    [2]    [3]
2.0  [1 1]  [3 2]
<BLANKLINE>
[2 rows x 2 columns]
Parameter
Name Description
obj

A GroupBy object to be applied the function.

array_length

array_length(series: series.Series) -> series.Series

Compute the length of each array element in the Series.

Examples:

>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> bpd.options.display.progress_bar = None

>>> s = bpd.Series([[1, 2, 8, 3], [], [3, 4]])
>>> bbq.array_length(s)
0    4
1    0
2    2
dtype: Int64

You can also apply this function directly to Series.

>>> s.apply(bbq.array_length, by_row=False)
0    4
1    0
2    2
dtype: Int64
Parameter
Name Description
series

A Series with array columns.

array_to_string

array_to_string(series: series.Series, delimiter: str) -> series.Series

Converts array elements within a Series into delimited strings.

Examples:

>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> import numpy as np
>>> bpd.options.display.progress_bar = None

>>> s = bpd.Series([["H", "i", "!"], ["Hello", "World"], np.nan, [], ["Hi"]])
>>> bbq.array_to_string(s, delimiter=", ")
    0         H, i, !
    1    Hello, World
    2
    3
    4              Hi
    dtype: string
Parameters
Name Description
series

A Series containing arrays.

delimiter

The string used to separate array elements.

json_extract

json_extract(series: series.Series, json_path: str) -> series.Series

Extracts a JSON value and converts it to a SQL JSON-formatted STRING or JSON value. This function uses single quotes and brackets to escape invalid JSONPath characters in JSON keys.

Examples:

>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> bpd.options.display.progress_bar = None

>>> s = bpd.Series(['{"class": {"students": [{"id": 5}, {"id": 12}]}}'])
>>> bbq.json_extract(s, json_path="$.class")
0    {"students":[{"id":5},{"id":12}]}
dtype: string
Parameters
Name Description
series

The Series containing JSON data (as native JSON objects or JSON-formatted strings).

json_path

The JSON path identifying the data that you want to obtain from the input.

json_extract_array

json_extract_array(series: series.Series, json_path: str = "$") -> series.Series

Extracts a JSON array and converts it to a SQL array of JSON-formatted STRING or JSON values. This function uses single quotes and brackets to escape invalid JSONPath characters in JSON keys.

Examples:

>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> bpd.options.display.progress_bar = None

>>> s = bpd.Series(['[1, 2, 3]', '[4, 5]'])
>>> bbq.json_extract_array(s)
0    ['1' '2' '3']
1        ['4' '5']
dtype: list<item: string>[pyarrow]
Parameters
Name Description
series

The Series containing JSON data (as native JSON objects or JSON-formatted strings).

json_path

The JSON path identifying the data that you want to obtain from the input.

json_set

json_set(
    series: series.Series,
    json_path_value_pairs: typing.Sequence[typing.Tuple[str, typing.Any]],
) -> series.Series

Produces a new JSON value within a Series by inserting or replacing values at specified paths.

Examples:

>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> import numpy as np
>>> bpd.options.display.progress_bar = None

>>> s = bpd.read_gbq("SELECT JSON '{\"a\": 1}' AS data")["data"]
>>> bbq.json_set(s, json_path_value_pairs=[("$.a", 100), ("$.b", "hi")])
    0    {"a":100,"b":"hi"}
    Name: data, dtype: string
Parameters
Name Description
series

The Series containing JSON data (as native JSON objects or JSON-formatted strings).

json_path_value_pairs

Pairs of JSON path and the new value to insert/replace.

struct

struct(value: dataframe.DataFrame) -> series.Series

Takes a DataFrame and converts it into a Series of structs with each struct entry corresponding to a DataFrame row and each struct field corresponding to a DataFrame column

Examples:

>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> import bigframes.series as series
>>> bpd.options.display.progress_bar = None

>>> srs = series.Series([{"version": 1, "project": "pandas"}, {"version": 2, "project": "numpy"},])
>>> df = srs.struct.explode()
>>> bbq.struct(df)
0    {'project': 'pandas', 'version': 1}
1     {'project': 'numpy', 'version': 2}
dtype: struct<project: string, version: int64>[pyarrow]

Args:
    value (bigframes.dataframe.DataFrame):
        The DataFrame to be converted to a Series of structs

Returns:
    bigframes.series.Series: A new Series with struct entries representing rows of the original DataFrame
vector_search(
    base_table: str,
    column_to_search: str,
    query: Union[dataframe.DataFrame, series.Series],
    *,
    query_column_to_search: Optional[str] = None,
    top_k: Optional[int] = 10,
    distance_type: Literal["euclidean", "cosine"] = "euclidean",
    fraction_lists_to_search: Optional[float] = None,
    use_brute_force: bool = False
) -> dataframe.DataFrame

Conduct vector search which searches embeddings to find semantically similar entities.

Examples:

>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> bpd.options.display.progress_bar = None

DataFrame embeddings for which to find nearest neighbors. The ARRAY<FLOAT64> column is used as the search query:

>>> search_query = bpd.DataFrame({"query_id": ["dog", "cat"],
...                               "embedding": [[1.0, 2.0], [3.0, 5.2]]})
>>> bbq.vector_search(
...             base_table="bigframes-dev.bigframes_tests_sys.base_table",
...             column_to_search="my_embedding",
...             query=search_query,
...             top_k=2)
  query_id  embedding  id my_embedding  distance
1      cat  [3.  5.2]   5    [5.  5.4]  2.009975
0      dog    [1. 2.]   1      [1. 2.]       0.0
0      dog    [1. 2.]   4    [1.  3.2]       1.2
1      cat  [3.  5.2]   2      [2. 4.]   1.56205
<BLANKLINE>
[4 rows x 5 columns]

Series embeddings for which to find nearest neighbors:

>>> search_query = bpd.Series([[1.0, 2.0], [3.0, 5.2]],
...                            index=["dog", "cat"],
...                            name="embedding")
>>> bbq.vector_search(
...             base_table="bigframes-dev.bigframes_tests_sys.base_table",
...             column_to_search="my_embedding",
...             query=search_query,
...             top_k=2)
     embedding  id my_embedding  distance
dog    [1. 2.]   1      [1. 2.]       0.0
cat  [3.  5.2]   5    [5.  5.4]  2.009975
dog    [1. 2.]   4    [1.  3.2]       1.2
cat  [3.  5.2]   2      [2. 4.]   1.56205
<BLANKLINE>
[4 rows x 4 columns]

You can specify the name of the column in the query DataFrame embeddings and distance type. If you specify query_column_to_search_value, it will use the provided column which contains the embeddings for which to find nearest neighbors. Otherwiese, it uses the column_to_search value.

>>> search_query = bpd.DataFrame({"query_id": ["dog", "cat"],
...                               "embedding": [[1.0, 2.0], [3.0, 5.2]],
...                               "another_embedding": [[0.7, 2.2], [3.3, 5.2]]})
>>> bbq.vector_search(
...             base_table="bigframes-dev.bigframes_tests_sys.base_table",
...             column_to_search="my_embedding",
...             query=search_query,
...             distance_type="cosine",
...             query_column_to_search="another_embedding",
...             top_k=2)
  query_id  embedding another_embedding  id my_embedding  distance
1      cat  [3.  5.2]         [3.3 5.2]   2      [2. 4.]  0.005181
0      dog    [1. 2.]         [0.7 2.2]   4    [1.  3.2]  0.000013
1      cat  [3.  5.2]         [3.3 5.2]   1      [1. 2.]  0.005181
0      dog    [1. 2.]         [0.7 2.2]   3    [1.5 7. ]  0.004697
<BLANKLINE>
[4 rows x 6 columns]
Parameters
Name Description
base_table

The table to search for nearest neighbor embeddings.

column_to_search

The name of the base table column to search for nearest neighbor embeddings. The column must have a type of ARRAY. All elements in the array must be non-NULL.

query

A Series or DataFrame that provides the embeddings for which to find nearest neighbors.

query_column_to_search

Specifies the name of the column in the query that contains the embeddings for which to find nearest neighbors. The column must have a type of ARRAY. All elements in the array must be non-NULL and all values in the column must have the same array dimensions as the values in the column_to_search column. Can only be set when query is a DataFrame.

top_k

Sepecifies the number of nearest neighbors to return. Default to 10.

distance_type

Specifies the type of metric to use to compute the distance between two vectors. Possible values are "euclidean" and "cosine". Default to "euclidean".

fraction_lists_to_search

Specifies the percentage of lists to search. Specifying a higher percentage leads to higher recall and slower performance, and the converse is true when specifying a lower percentage. It is only used when a vector index is also used. You can only specify fraction_lists_to_search when use_brute_force is set to False.

use_brute_force

Determines whether to use brute force search by skipping the vector index if one is available. Default to False.