BigQuery DataFrames를 사용하여 원격 함수 배포 및 적용

BigQuery DataFrames API를 사용하여 Python 함수를 Cloud 함수로 배포하고 원격 함수로 사용합니다.

코드 샘플

Python

이 샘플을 사용해 보기 전에 BigQuery 빠른 시작: 클라이언트 라이브러리 사용Python 설정 안내를 따르세요. 자세한 내용은 BigQuery Python API 참고 문서를 확인하세요.

BigQuery에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 클라이언트 라이브러리의 인증 설정을 참조하세요.

import bigframes.pandas as bpd

# Set BigQuery DataFrames options
bpd.options.bigquery.project = your_gcp_project_id
bpd.options.bigquery.location = "us"

# BigQuery DataFrames gives you the ability to turn your custom scalar
# functions into a BigQuery remote function. It requires the GCP project to
# be set up appropriately and the user having sufficient privileges to use
# them. One can find more details about the usage and the requirements via
# `help` command.
help(bpd.remote_function)

# Read a table and inspect the column of interest.
df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
df["body_mass_g"].head(10)

# Define a custom function, and specify the intent to turn it into a remote
# function. It requires a BigQuery connection. If the connection is not
# already created, BigQuery DataFrames will attempt to create one assuming
# the necessary APIs and IAM permissions are setup in the project. In our
# examples we will be letting the default connection `bigframes-default-connection`
# be used. We will also set `reuse=False` to make sure we don't
# step over someone else creating remote function in the same project from
# the exact same source code at the same time. Let's try a `pandas`-like use
# case in which we want to apply a user defined scalar function to every
# value in a `Series`, more specifically bucketize the `body_mass_g` value
# of the penguins, which is a real number, into a category, which is a
# string.
@bpd.remote_function(
    float,
    str,
    reuse=False,
)
def get_bucket(num: float) -> str:
    if not num:
        return "NA"
    boundary = 4000
    return "at_or_above_4000" if num >= boundary else "below_4000"

# Then we can apply the remote function on the `Series`` of interest via
# `apply` API and store the result in a new column in the DataFrame.
df = df.assign(body_mass_bucket=df["body_mass_g"].apply(get_bucket))

# This will add a new column `body_mass_bucket` in the DataFrame. You can
# preview the original value and the bucketized value side by side.
df[["body_mass_g", "body_mass_bucket"]].head(10)

# The above operation was possible by doing all the computation on the
# cloud. For that, there is a google cloud function deployed by serializing
# the user code, and a BigQuery remote function created to call the cloud
# function via the latter's http endpoint on the data in the DataFrame.

# The BigQuery remote function created to support the BigQuery DataFrames
# remote function can be located via a property `bigframes_remote_function`
# set in the remote function object.
print(f"Created BQ remote function: {get_bucket.bigframes_remote_function}")

# The cloud function can be located via another property
# `bigframes_cloud_function` set in the remote function object.
print(f"Created cloud function: {get_bucket.bigframes_cloud_function}")

# Warning: The deployed cloud function may be visible to other users with
# sufficient privilege in the project, so the user should be careful about
# having any sensitive data in the code that will be deployed as a remote
# function.

# Let's continue trying other potential use cases of remote functions. Let's
# say we consider the `species`, `island` and `sex` of the penguins
# sensitive information and want to redact that by replacing with their hash
# code instead. Let's define another scalar custom function and decorate it
# as a remote function. The custom function in this example has external
# package dependency, which can be specified via `packages` parameter.
@bpd.remote_function(
    str,
    str,
    reuse=False,
    packages=["cryptography"],
)
def get_hash(input: str) -> str:
    from cryptography.fernet import Fernet

    # handle missing value
    if input is None:
        input = ""

    key = Fernet.generate_key()
    f = Fernet(key)
    return f.encrypt(input.encode()).decode()

# We can use this remote function in another `pandas`-like API `map` that
# can be applied on a DataFrame
df_redacted = df[["species", "island", "sex"]].map(get_hash)
df_redacted.head(10)

다음 단계

다른 Google Cloud 제품의 코드 샘플을 검색하고 필터링하려면 Google Cloud 샘플 브라우저를 참조하세요.