多模态数据集

借助 Vertex AI 上的多模态数据集，您可以创建、管理、共享和使用多模态数据集来完成生成式 AI 相关的任务。多模态数据集提供以下主要功能：

可以从 BigQuery、DataFrame 或者 Cloud Storage 中的 JSONL 文件加载数据集。
数据集创建一次后，即可在各种类型的作业（例如监督式微调和批量预测）中使用，从而避免数据重复和格式问题。
将所有生成式 AI 数据集保存在一个受管理的统一位置。
验证架构和结构，量化下游任务所需的资源，从而帮助您在开始任务之前发现错误并估算费用。

可以通过 Vertex AI SDK for Python 或 REST API 使用多模态数据集。

多模态数据集是 Vertex AI 上的一种托管式数据集。它们与其他类型的托管式数据集在以下方面有所不同：

多模态数据集可以包含任何模态的数据（文本、图片、音频、视频）。其他类型的托管式数据集只能包含单一模态的数据。
多模态数据集只能用于 Vertex AI 上的生成式 AI 服务，例如使用生成式模型进行调优和批量预测。其他类型的托管式数据集只能用于 Vertex AI 预测模型。
多模态数据集支持其他方法，例如 assemble 和 assess，这些方法可用于预览数据、验证请求和估算费用。
多模态数据集存储在 BigQuery 中，而 BigQuery 针对大型数据集进行了优化。

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI, BigQuery, and Cloud Storage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI, BigQuery, and Cloud Storage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

安装并初始化 Python 版 Vertex AI SDK

导入以下库：

from google.cloud.aiplatform.preview import datasets

# To use related features, you may also need to import some of the following features:
from vertexai.preview.tuning import sft
from vertexai.batch_prediction import BatchPredictionJob

from vertexai.generative_models import Content, Part, Tool, ToolConfig, SafetySetting, GenerationConfig, FunctionDeclaration

创建数据集

您可以从各种来源创建多模态 dataset：

从 Pandas DataFrame

my_dataset = datasets.MultimodalDataset.from_pandas(
    dataframe=my_dataframe,
    target_table_id=table_id    # optional
)

从 BigQuery DataFrame：

my_dataset = datasets.MultimodalDataset.from_bigframes(
    dataframe=my_dataframe,
    target_table_id=table_id    # optional
)

从 BigQuery 表

my_dataset_from_bigquery = datasets.MultimodalDataset.from_bigquery(
    bigquery_uri=f"bq://projectId.datasetId.tableId"
)

从 BigQuery 表（使用 REST API）

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT/locations/LOCATION/datasets" \
-d '{
  "display_name": "TestDataset",
  "metadataSchemaUri": "gs://google-cloud-aiplatform/schema/dataset/metadata/multimodal_1.0.0.yaml",
  "metadata": {
    "inputConfig": {
      "bigquery_source": {
        "uri": "bq://projectId.datasetId.tableId"
      }
    }
  }
}'

从 Cloud Storage 中的 JSONL 文件。在以下示例中，JSONL 文件中的请求已针对 Gemini 完成格式设置，因此无需进行组装。
```
my_dataset = datasets.MultimodalDataset.from_gemini_request_jsonl(
  gcs_uri = gcs_uri_of_jsonl_file,
)
```

从现有的多模态数据集

# Get the most recently created dataset
first_dataset = datasets.MultimodalDataset.list()[0]

# Load dataset based on its name
same_dataset = datasets.MultimodalDataset(first_dataset.name)

构建并附加模板

模板定义了如何将多模态数据集转换为可传递给模型的格式。运行调优作业或批量预测作业时都需要用到模板。

Vertex AI SDK for Python

构建模板。您可以通过以下两种方式构建模板：

使用 construct_single_turn_template 辅助方法：

template_config = datasets.construct_single_turn_template(
        prompt="This is the image: {image_uris}",
        response="{labels}",
        system_instruction='You are a botanical image classifier. Analyze the provided image '
                'and determine the most accurate classification of the flower.'
                'These are the only flower categories: [\'daisy\', \'dandelion\', \'roses\', \'sunflowers\', \'tulips\'].'
                'Return only one category per image.'
)

手动从 GeminiExample 构建模板，这样可以实现更精细的粒度，例如可以用在多轮对话中。以下代码示例还包含一些被注释掉的用于指定 field_mapping 的可选代码，这些代码可让您使用占位符名称来替代数据集列名。例如：

# Define a GeminiExample
gemini_example = datasets.GeminiExample(
    contents=[
        Content(role="user", parts=[Part.from_text("This is the image: {image_uris}")]),
        Content(role="model", parts=[Part.from_text("This is the flower class: {label}.")]),
      Content(role="user", parts=[Part.from_text("Your response should only contain the class label.")]),
      Content(role="model", parts=[Part.from_text("{label}")]),

      # Optional: If you specify a field_mapping, you can use different placeholder values. For example:
      # Content(role="user", parts=[Part.from_text("This is the image: {uri_placeholder}")]),
      # Content(role="model", parts=[Part.from_text("This is the flower class: {flower_placeholder}.")]),
      # Content(role="user", parts=[Part.from_text("Your response should only contain the class label.")]),
      # Content(role="model", parts=[Part.from_text("{flower_placeholder}")]),
    ],
    system_instruction=Content(
        parts=[
            Part.from_text(
                'You are a botanical image classifier. Analyze the provided image '
                'and determine the most accurate classification of the flower.'
                'These are the only flower categories: [\'daisy\', \'dandelion\', \'roses\', \'sunflowers\', \'tulips\'].'
                'Return only one category per image.'
            )
        ]
    ),
)

# construct the template, specifying a map for the placeholder
template_config = datasets.GeminiTemplateConfig(
    gemini_example=gemini_example,

    # Optional: Map the template placeholders to the column names of your dataset.
    # Not required if the template placesholders are column names of the dataset.
    # field_mapping={"uri_placeholder": "image_uris", "flower_placeholder": "labels"},
)

将模板附加到数据集：

my_dataset.attach_template_config(template_config=template_config)

REST

调用 patch 方法，并用以下内容更新 metadata 字段：

BigQuery 表的 URI。对于从 BigQuery 表创建的数据集，这是您的来源 bigquery_uri。对于从其他来源（例如 JSONL 或 DataFrame）创建的数据集，这是从其复制数据的 BigQuery 表。
gemini_template_config。

curl -X PATCH \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-d $'{
  "metadata": {
    "input_config": {
      "bigquery_source": {
        "uri": "bq://projectId.datasetId.tableId"
      }
    },
    "gemini_template_config_source": {
      "gemini_template_config": {
        "gemini_example": {
          "contents": [
            {
              "role": "user",
              "parts": [
                {
                  "text": "This is the image: {image_uris}"

                }
              ]
            },
            {
              "role": "model",
              "parts": [
                {
                  "text": "response"
                }
              ]
            }
          ]
        "systemInstruction": {
            "parts": [
                {
                    "text": "You are a botanical image classifier."
                }
            ]
          }
        }
      }
    }
  }
}' \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID?updateMask=metadata"

（可选）组装数据集

assemble 方法会应用模板来转换数据集，并将输出存储在新的 BigQuery 表中。这样，您就可以在将数据传递给模型之前预览数据。

默认情况下，系统会使用附加到数据集的 template_config，不过您可以指定一个模板来替换默认行为。

Vertex AI SDK for Python

table_id, assembly = my_dataset.assemble(template_config=template_config)

# Inspect the results
assembly.head()

REST

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:assemble" \
-d '{}'

例如，假设您的多模态数据集包含以下数据：

行	image_uris	标签
1	gs://cloud-samples-data/ai-platform/flowers/daisy/1396526833_fb867165be_n.jpg	雏菊

之后，assemble 方法会创建一个名为 table_id 的新 BigQuery 表，其中每行都包含请求正文。例如：

{
  "contents": [
    {
      "parts": [
        {
          "text": "This is the image: "
        },
        {
          "fileData": {
            "fileUri": "gs://cloud-samples-data/ai-platform/flowers/daisy/1396526833_fb867165be_n.jpg",
            "mimeType": "image/jpeg"
          }
        }
      ],
      "role": "user"
    },
    {
      "parts": [
        {
          "text": "daisy"
        }
      ],
      "role": "model"
    }
  ],
  "systemInstruction": {
    "parts": [
      {
        "text": "You are a botanical image classifier. Analyze the provided image and determine the most accurate classification of the flower.These are the only flower categories: ['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips'].Return only one category per image."
      }
    ]
  }
}

模型调优

您可以使用多模态数据集对 Gemini 模型进行调优。

（可选）验证数据集

评估数据集，检查其中是否包含错误，例如数据集格式错误或模型错误。

Vertex AI SDK for Python

调用 assess_tuning_validity()。默认情况下，系统会使用附加到数据集的 template_config，不过您可以指定一个模板来替换默认行为。

# Attach template
my_dataset.attach_template_config(template_config=template_config)

# Validation for tuning
validation = my_dataset.assess_tuning_validity(
    model_name="gemini-2.0-flash-001",
    dataset_usage="SFT_TRAINING"
)

# Inspect validation result
validation.errors

REST

调用 assess 方法并提供 TuningValidationAssessmentConfig。

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:assess" \
-d '{
  "tuningValidationAssessmentConfig": {
    "modelName": "projects/PROJECT_ID/locations/LOCATION/models/gemini-2.0-flash-001",
    "datasetUsage": "SFT_TRAINING"
  }
}'

（可选）估算资源用量

评估数据集，以估算调优作业所需的 token 数和计费字符数。

Vertex AI SDK for Python

调用 assess_tuning_resources()。

# Resource estimation for tuning.
tuning_resources = my_dataset.assess_tuning_resources(
    model_name="gemini-2.0-flash-001"
)

print(tuning_resources)
# For example, TuningResourceUsageAssessmentResult(token_count=362688, billable_character_count=122000)

REST

调用 assess 方法并提供 TuningResourceUsageAssessmentConfig。

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:assess" \
-d '{
  "tuningResourceUsageAssessmentConfig": {
    "modelName": "projects/PROJECT_ID/locations/LOCATION/models/gemini-2.0-flash-001"
  }
}'

运行调优作业

Vertex AI SDK for Python

from vertexai.tuning import sft

sft_tuning_job = sft.train(
  source_model="gemini-2.0-flash-001",
  # Pass the Vertex Multimodal Datasets directly
  train_dataset=my_multimodal_dataset,
  validation_dataset=my_multimodal_validation_dataset,
)

Google Gen AI SDK

from google import genai
from google.genai.types import HttpOptions, CreateTuningJobConfig

client = genai.Client(http_options=HttpOptions(api_version="v1"))

tuning_job = client.tunings.tune(
  base_model="gemini-2.0-flash-001",
  # Pass the resource name of the Vertex Multimodal Dataset, not the dataset object
  training_dataset={
      "vertex_dataset_resource": my_multimodal_dataset.resource_name
  },
  # Optional
  config=CreateTuningJobConfig(
      tuned_model_display_name="Example tuning job"),
)

如需了解详情，请参阅创建调优作业。

批量预测

您可以使用多模态数据集来获取批量预测结果。

（可选）验证数据集

评估数据集，检查其中是否包含错误，例如数据集格式错误或模型错误。

Vertex AI SDK for Python

调用 assess_batch_prediction_validity()。默认情况下，系统会使用附加到数据集的 template_config，不过您可以指定一个模板来替换默认行为。

# Attach template
my_dataset.attach_template_config(template_config=template_config)

# Validation for batch prediction
validation = my_dataset.assess_batch_prediction_validity(
    model_name="gemini-2.0-flash-001",
    dataset_usage="SFT_TRAINING"
)

# Inspect validation result
validation.errors

REST

调用 assess 方法并提供 batchPredictionValidationAssessmentConfig。

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:assess" \
-d '{
  "batchPredictionValidationAssessmentConfig": {
    "modelName": "projects/PROJECT_ID/locations/LOCATION/models/gemini-2.0-flash-001",
  }
}'

（可选）估算资源用量

评估数据集，以估算作业所需的 token 数。

Vertex AI SDK for Python

调用 assess_batch_prediction_resources()。

batch_prediction_resources = my_dataset.assess_batch_prediction_resources(
    model_name="gemini-2.0-flash"
)

print(batch_prediction_resources)
# For example, BatchPredictionResourceUsageAssessmentResult(token_count=362688, audio_token_count=122000)

REST

调用 assess 方法并提供 batchPredictionResourceUsageAssessmentConfig。

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:assess" \
-d '{
  "batchPredictionResourceUsageAssessmentConfig": {
    "modelName": "projects/PROJECT_ID/locations/LOCATION/models/gemini-2.0-flash-001"
  }
}'

运行批量预测作业

您可以通过传递包含所组装输出的 BigQuerytable_id，使用多模态数据集进行批量预测：

Vertex AI SDK for Python

from vertexai.batch_prediction import BatchPredictionJob

# Dataset needs to have an attached template_config to batch prediction
my_dataset.attach_template_config(template_config=template_config)

# assemble dataset to get assembly table id
assembly_table_id, _ = my_dataset.assemble()

batch_prediction_job = BatchPredictionJob.submit(
    source_model="gemini-2.0-flash-001",
    input_dataset=assembly_table_id,
)

Google Gen AI SDK

from google import genai

client = genai.Client(http_options=HttpOptions(api_version="v1"))

# Attach template_config and assemble dataset
my_dataset.attach_template_config(template_config=template_config)
assembly_table_id, _ = my_dataset.assemble()

job = client.batches.create(
    model="gemini-2.0-flash-001",
    src=assembly_table_id,
)

如需了解详情，请参阅请求批量预测作业。

限制

只能在生成式 AI 功能中使用多模态数据集。非生成式 AI 功能（例如 AutoML 训练和自定义训练）无法使用多模态数据集。
多模态数据集只能与 Google 模型（例如 Gemini）搭配使用；它们不能与第三方模型搭配使用。

价格

在您对模型进行调优或运行批量预测作业时，系统会向您收取生成式 AI 用量费用以及 BigQuery 数据集查询费用。

在您创建、组装或评估多模态数据集时，系统会向您收取在 BigQuery 中存储和查询多模态数据集而产生的费用。具体而言，您需要按如下方式为以下操作使用的底层服务支付费用：

Create 个数据集
- 从现有 BigQuery 表或 DataFrame 创建的数据集不会产生额外的存储费用。这是因为我们使用的是逻辑视图，而不是存储另一份数据副本。
- 从其他来源创建的数据集会将数据复制到新的 BigQuery 表中，这会在 BigQuery 中产生存储费用。例如，活跃逻辑存储空间的费用为每月每 GiB 0.02 美元。
Assemble 个数据集
- 此方法会创建一个新的 BigQuery 表，其中包含采用模型请求格式的完整数据集，这会在 BigQuery 中产生存储费用。例如，活跃逻辑存储空间的费用为每月每 GiB 0.02 美元。
- 此方法还会读取数据集一次，而这会在 BigQuery 中产生查询费用。例如，按需计算价格为每 TiB 6.25 美元。数据集验证和资源估算
Assess 会读取数据集一次，这会在 BigQuery 中产生查询费用。例如，按需计算价格为每 TiB 6.25 美元。

您可使用价格计算器根据您的预计使用情况来估算费用。