importnumpyasnpimportpandasaspdLABEL_COLUMN="species"# Define the BigQuery source datasetBQ_SOURCE="bigquery-public-data.ml_datasets.penguins"# Define NA valuesNA_VALUES=["NA","."]# Download a tabletable=bq_client.get_table(BQ_SOURCE)df=bq_client.list_rows(table).to_dataframe()# Drop unusable rowsdf=df.replace(to_replace=NA_VALUES,value=np.NaN).dropna()# Convert categorical columns to numericdf["island"],island_values=pd.factorize(df["island"])df["species"],species_values=pd.factorize(df["species"])df["sex"],sex_values=pd.factorize(df["sex"])# Split into a training and holdout datasetdf_train=df.sample(frac=0.8,random_state=100)df_for_prediction=df[~df.index.isin(df_train.index)]# Map numeric values to string valuesindex_to_island=dict(enumerate(island_values))index_to_species=dict(enumerate(species_values))index_to_sex=dict(enumerate(sex_values))# View the mapped island, species, and sex dataprint(index_to_island)print(index_to_species)print(index_to_sex)
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# Create a Vertex AI tabular dataset\n\nThe model you create later in this tutorial requires a *dataset* to train it. The data that this tutorial uses is a publicly available dataset that contains details about three species of penguins. The following data are used to predict which of the three species a penguin is.\n\n\u003cbr /\u003e\n\n- `island` - The island where a species of penguin is found.\n- `culmen_length_mm` - The length of the ridge along the top of the bill of a penguin.\n- `culmen_depth_mm` - The height of the bill of a penguin.\n- `flipper_length_mm` - The length of the flipper-like wing of a penguin.\n- `body_mass_g` - The mass of the body of a penguin.\n- `sex` - The sex of the penguin.\n\nDownload, preprocess, and split the data\n----------------------------------------\n\nIn this section, you download the publicly available BigQuery dataset\nand prepare its data. To prepare the data, you do the following:\n\n- Convert categorical features (features described with a string instead of a\n number) to numeric data. For example, you convert the names of the three types\n of penguins to the numerical values `0`, `1`, and `2`.\n\n- Remove any columns in the dataset that aren't used.\n\n- Remove any rows that cannot be used.\n\n- Split the data into two distinct sets of data. Each set of data is stored in a\n [pandas `DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)\n object.\n\n - The `df_train` `DataFrame` contains data used to train your model.\n\n - the `df_for_prediction` `DataFrame` contains data used to generate predictions.\n\nAfter processing the data, the code maps the three categorical columns'\nnumerical values to their string values, then prints them so that you can see\nwhat the data looks like.\n\nTo download and process your data, run the following code in your notebook: \n\n import numpy as np\n import pandas as pd\n\n LABEL_COLUMN = \"species\"\n\n # Define the BigQuery source dataset\n BQ_SOURCE = \"bigquery-public-data.ml_datasets.penguins\"\n\n # Define NA values\n NA_VALUES = [\"NA\", \".\"]\n\n # Download a table\n table = bq_client.get_table(BQ_SOURCE)\n df = bq_client.list_rows(table).to_dataframe()\n\n # Drop unusable rows\n df = df.replace(to_replace=NA_VALUES, value=np.NaN).dropna()\n\n # Convert categorical columns to numeric\n df[\"island\"], island_values = pd.factorize(df[\"island\"])\n df[\"species\"], species_values = pd.factorize(df[\"species\"])\n df[\"sex\"], sex_values = pd.factorize(df[\"sex\"])\n\n # Split into a training and holdout dataset\n df_train = df.sample(frac=0.8, random_state=100)\n df_for_prediction = df[~df.index.isin(df_train.index)]\n\n # Map numeric values to string values\n index_to_island = dict(enumerate(island_values))\n index_to_species = dict(enumerate(species_values))\n index_to_sex = dict(enumerate(sex_values))\n\n # View the mapped island, species, and sex data\n print(index_to_island)\n print(index_to_species)\n print(index_to_sex)\n\nThe following are the printed mapped values for characteristics that are not\nnumeric: \n\n {0: 'Dream', 1: 'Biscoe', 2: 'Torgersen'}\n {0: 'Adelie Penguin (Pygoscelis adeliae)', 1: 'Chinstrap penguin (Pygoscelis antarctica)', 2: 'Gentoo penguin (Pygoscelis papua)'}\n {0: 'FEMALE', 1: 'MALE'}\n\nThe first three values are the islands a penguin might inhabit. The second three\nvalues are important because they map to the predictions you receive at the end\nof this tutorial. The third row shows the `FEMALE` sex characteristic maps to\n`0` and the `MALE` the sex characteristic maps to `1`.\n\nCreate a tabular dataset for training your model\n------------------------------------------------\n\nIn the previous step you downloaded and processed your data. In this step, you\nload the data stored in your `df_train` `DataFrame` into a BigQuery\ndataset. Then, you use the BigQuery dataset to create a\nVertex AI tabular dataset. This tabular dataset is used to train your\nmodel. For more information, see [Use managed\ndatasets](/vertex-ai/docs/training/using-managed-datasets).\n\n### Create a BigQuery dataset\n\nTo create your BigQuery dataset that's used to create a\nVertex AI dataset, run the following code. The [`create_dataset`](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.client.Client#google_cloud_bigquery_client_Client_create_dataset) command\nreturns a new BigQuery [`DataSet`](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.dataset.Dataset). \n\n # Create a BigQuery dataset\n bq_dataset_id = f\"{project_id}.dataset_id_unique\"\n bq_dataset = bigquery.Dataset(bq_dataset_id)\n bq_client.create_dataset(bq_dataset, exists_ok=True)\n\n### Create a Vertex AI tabular dataset\n\nTo convert your BigQuery dataset a Vertex AI tabular\ndataset, run the following code. You can ignore the warning about the required\nnumber of rows to train using tabular data. Because the purpose of this tutorial\nis to quickly show you how to get predictions, a relatively small set of data is\nused to show you how to generate predictions. In a real world scenario, you want\nat least 1000 rows in a tabular dataset. The\n[`create_from_dataframe`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.TabularDataset#google_cloud_aiplatform_TabularDataset_create_from_dataframe)\ncommand returns a Vertex AI\n[`TabularDataset`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.TabularDataset#google_cloud_aiplatform_TabularDataset). \n\n # Create a Vertex AI tabular dataset\n dataset = aiplatform.TabularDataset.create_from_dataframe(\n df_source=df_train,\n staging_path=f\"bq://{bq_dataset_id}.table-unique\",\n display_name=\"sample-penguins\",\n )\n\nYou now have the Vertex AI tabular dataset used to train your model.\n\n(Optional) View the public dataset in BigQuery\n----------------------------------------------\n\nIf you want to view the public data used in this tutorial, you can open it in\nBigQuery.\n\n1. In **Search** in the Google Cloud, enter BigQuery, then press\n return.\n\n2. In the search results, click on BigQuery\n\n3. In the **Explorer** window, expand **bigquery-public-data**.\n\n4. Under **bigquery-public-data** , expand **ml_datasets** , then click **penguins**.\n\n5. Click any of the names under **Field name** to view that field's data."]]