使用 BigQuery DataFrames 直观呈现图表

本文档演示了如何使用 BigQuery DataFrames 可视化库绘制各种类型的图表。

bigframes.pandas API 为 Python 提供了一整套工具生态系统。此 API 支持高级统计操作,您可以直观呈现从 BigQuery DataFrames 生成的聚合。您还可以使用内置的采样操作从 BigQuery DataFrames 切换到 pandas DataFrame。

直方图

以下示例从 bigquery-public-data.ml_datasets.penguins 表中读取数据,以绘制企鹅鸟嘴深度的分布直方图:

import bigframes.pandas as bpd

penguins = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
penguins["culmen_depth_mm"].plot.hist(bins=40)

BigQuery DataFrames 中的直方图示例。

折线图

以下示例使用 bigquery-public-data.noaa_gsod.gsod2021 表中的数据绘制一年中平均温度变化的折线图:

import bigframes.pandas as bpd

noaa_surface = bpd.read_gbq("bigquery-public-data.noaa_gsod.gsod2021")

# Calculate median temperature for each day
noaa_surface_median_temps = noaa_surface[["date", "temp"]].groupby("date").median()

noaa_surface_median_temps.plot.line()

BigQuery DataFrames 中的折线图示例。

面积图

以下示例使用 bigquery-public-data.usa_names.usa_1910_2013 表来跟踪美国历史上的名字受欢迎程度,重点关注 MaryEmilyLisa 这三个名字:

import bigframes.pandas as bpd

usa_names = bpd.read_gbq("bigquery-public-data.usa_names.usa_1910_2013")

# Count the occurences of the target names each year. The result is a dataframe with a multi-index.
name_counts = (
    usa_names[usa_names["name"].isin(("Mary", "Emily", "Lisa"))]
    .groupby(("year", "name"))["number"]
    .sum()
)

# Flatten the index of the dataframe so that the counts for each name has their own columns.
name_counts = name_counts.unstack(level=1).fillna(0)

name_counts.plot.area(stacked=False, alpha=0.5)

BigQuery DataFrames 中的面积图示例。

条形图

以下示例使用 bigquery-public-data.ml_datasets.penguins 表直观呈现企鹅性别的分布情况:

import bigframes.pandas as bpd

penguins = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")

penguin_count_by_sex = (
    penguins[penguins["sex"].isin(("MALE", "FEMALE"))]
    .groupby("sex")["species"]
    .count()
)
penguin_count_by_sex.plot.bar()

BigQuery DataFrames 中的条形图示例。

散点图

以下示例使用 bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2021 表来探索出租车车费金额与行程距离之间的关系:

import bigframes.pandas as bpd

taxi_trips = bpd.read_gbq(
    "bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2021"
).dropna()

# Data Cleaning
taxi_trips = taxi_trips[
    taxi_trips["trip_distance"].between(0, 10, inclusive="right")
]
taxi_trips = taxi_trips[taxi_trips["fare_amount"].between(0, 50, inclusive="right")]

# If you are using partial ordering mode, you will also need to assign an order to your dataset.
# Otherwise, the next line can be skipped.
taxi_trips = taxi_trips.sort_values("pickup_datetime")

taxi_trips.plot.scatter(x="trip_distance", y="fare_amount", alpha=0.5)

BigQuery DataFrames 中的散点图示例。

直观呈现大型数据集

BigQuery DataFrames 会将数据下载到本地计算机以进行可视化。默认情况下,要下载的数据点数量上限为 1,000。如果数据点数量超过上限,BigQuery DataFrames 会随机抽样上限数量的数据点。

您可以在绘制图表时设置 sampling_n 参数来替换此上限,如以下示例所示:

import bigframes.pandas as bpd

noaa_surface = bpd.read_gbq("bigquery-public-data.noaa_gsod.gsod2021")

# Calculate median temperature for each day
noaa_surface_median_temps = noaa_surface[["date", "temp"]].groupby("date").median()

noaa_surface_median_temps.plot.line(sampling_n=40)

一个折线图示例,用于直观呈现 BigQuery DataFrames 中的大型数据集。

使用 pandas 和 Matplotlib 参数进行高级绘图

您可以传入更多参数来微调图表,就像使用 Pandas 一样,因为 BigQuery DataFrames 的绘图库由 Pandas 和 Matplotlib 提供支持。以下部分介绍了示例。

带有子图的姓名热门程度趋势

使用面积图示例中的名称历史记录数据,以下示例通过在 plot.area() 函数调用中设置 subplots=True,为每个名称创建单独的图表:

import bigframes.pandas as bpd

usa_names = bpd.read_gbq("bigquery-public-data.usa_names.usa_1910_2013")

# Count the occurences of the target names each year. The result is a dataframe with a multi-index.
name_counts = (
    usa_names[usa_names["name"].isin(("Mary", "Emily", "Lisa"))]
    .groupby(("year", "name"))["number"]
    .sum()
)

# Flatten the index of the dataframe so that the counts for each name has their own columns.
name_counts = name_counts.unstack(level=1).fillna(0)

name_counts.plot.area(subplots=True, alpha=0.5)

BigQuery DataFrames 中包含子图的各个图表的示例。

具有多个维度的出租车行程散点图

以下示例使用散点图示例中的数据,重命名了 x 轴和 y 轴的标签,使用 passenger_count 参数设置点大小,使用 tip_amount 参数设置彩色点,并调整了图表大小:

import bigframes.pandas as bpd

taxi_trips = bpd.read_gbq(
    "bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2021"
).dropna()

# Data Cleaning
taxi_trips = taxi_trips[
    taxi_trips["trip_distance"].between(0, 10, inclusive="right")
]
taxi_trips = taxi_trips[taxi_trips["fare_amount"].between(0, 50, inclusive="right")]

# If you are using partial ordering mode, you also need to assign an order to your dataset.
# Otherwise, the next line can be skipped.
taxi_trips = taxi_trips.sort_values("pickup_datetime")

taxi_trips["passenger_count_scaled"] = taxi_trips["passenger_count"] * 30

taxi_trips.plot.scatter(
    x="trip_distance",
    xlabel="trip distance (miles)",
    y="fare_amount",
    ylabel="fare amount (usd)",
    alpha=0.5,
    s="passenger_count_scaled",
    label="passenger_count",
    c="tip_amount",
    cmap="jet",
    colorbar=True,
    legend=True,
    figsize=(15, 7),
    sampling_n=1000,
)

BigQuery DataFrames 中具有多个维度的散点图示例。

后续步骤