此页面由 Cloud Translation API 翻译。

为自定义训练配置容器设置

执行自定义训练时，您必须指定希望 Vertex AI 运行的机器学习 (ML) 代码。为此，请为自定义容器或在预构建容器上运行的 Python 训练应用配置训练容器设置。

如需确定要使用自定义容器还是预构建容器，请参阅训练代码要求。

本文档介绍上述两种情况中必须指定的 Vertex AI API 字段。

在何处指定容器设置

在 WorkerPoolSpec 中指定配置详细信息。根据您执行自定义训练的方式，请将此 WorkerPoolSpec 放入以下某个 API 字段中：

如果您要创建一个CustomJob资源，在 CustomJob.jobSpec.workerPoolSpecs 中指定 WorkerPoolSpec。

如果您使用的是 Google Cloud CLI，则可以在 gcloud ai custom-jobs create 命令上使用 --worker-pool-spec 标志或 --config 标志来指定工作器池选项。

详细了解如何创建 CustomJob。
如果您要创建一个HyperparameterTuningJob资源，在 HyperparameterTuningJob.trialJobSpec.workerPoolSpecs 中指定 WorkerPoolSpec。

如果您使用的是 gcloud CLI，则可以使用 gcloud ai hpt-tuning-jobs create 命令上的 --config 标志来指定工作器池选项。

详细了解如何创建 HyperparameterTuningJob。
如果您要创建一个未进行超参数调节的TrainingPipeline资源，在 TrainingPipeline.trainingTaskInputs.workerPoolSpecs 中指定 WorkerPoolSpec。

详细了解如何创建自定义TrainingPipeline。
如果要使用超参数调节创建 TrainingPipeline，请在 TrainingPipeline.trainingTaskInputs.trialJobSpec.workerPoolSpecs 中指定 WorkerPoolSpec。

如果您执行分布式训练，则可以对每个工作器池使用不同的设置。

配置容器设置

根据您使用的是预建容器还是自定义容器，您必须指定 WorkerPoolSpec 中的不同字段。请根据具体情况选择标签页：

预构建容器

选择支持你计划用于训练的机器学习框架的预构建容器。在 pythonPackageSpec.executorImageUri 字段中指定容器映像的一个 URI。
在 pythonPackageSpec.packageUris 字段中指定 Python 训练应用的 Cloud Storage URI。
在 pythonPackageSpec.pythonModule 字段中指定训练应用的入口点模块。
（可选）在 pythonPackageSpec.args 字段中指定要传递给训练应用入口点模块的命令行参数列表。

以下示例突出显示了创建 CustomJob 时指定这些容器设置的位置：

控制台

在 Google Cloud 控制台中，您无法直接创建 CustomJob。但是，您可以创建一个创建 CustomJob 的 TrainingPipeline。在 Google Cloud 控制台中创建 TrainingPipeline 时，您可以在训练容器步骤的某些字段中指定预构建容器设置：

pythonPackageSpec.executorImageUri：使用模型框架和模型框架版本下拉列表。
pythonPackageSpec.packageUris：使用软件包位置字段。
pythonPackageSpec.pythonModule：使用 Python 模块字段。
pythonPackageSpec.args：使用参数字段。

gcloud

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --python-package-uris=PYTHON_PACKAGE_URIS \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,executor-image-uri=PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,python-module=PYTHON_MODULE

如需了解详情，请参阅创建 CustomJob 指南。

自定义容器

在 containerSpec.imageUri 字段中指定自定义容器的 Artifact Registry 或 Docker Hub URI。
（可选）如果您要替换容器中的 ENTRYPOINT 或 CMD 说明，请指定 containerSpec.command 或 containerSpec.args 字段。这些字段按以下规则影响容器的运行方式：
- 未指定任何一个字段：容器根据其 ENTRYPOINT 指令和 CMD 指令（如果存在）运行。请参阅关于 CMD 和 ENTRYPOINT 如何交互的 Docker 文档。
- 仅指定 containerSpec.command：containerSpec.command 的值替换容器的 ENTRYPOINT 指令，然后容器将据此运行。如果容器具有 CMD 指令，则会被忽略。
- 仅指定 containerSpec.args：containerSpec.args 的值替换容器的 CMD 指令，容器根据其 ENTRYPOINT 指令运行。
- 同时指定两个字段：containerSpec.command 替换容器 ENTRYPOINT 指令并且 containerSpec.args 替换 CMD 指令，然后容器据此运行。

以下示例突出显示了在创建 CustomJob 时指定其中某些容器设置的位置：

控制台

在 Google Cloud 控制台中，您无法直接创建 CustomJob。但是，您可以创建一个创建 CustomJob 的 TrainingPipeline。在 Google Cloud 控制台中创建 TrainingPipeline 时，您可以在训练容器步骤的某些字段中指定自定义容器设置：

containerSpec.imageUri：使用容器映像字段。
containerSpec.command：此 API 字段无法在 Google Cloud 控制台中配置。
containerSpec.args：使用参数字段。

gcloud

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,container-image-uri=CUSTOM_CONTAINER_IMAGE_URI

Java

在尝试此示例之前，请按照《Vertex AI 快速入门：使用客户端库》中的 Java 设置说明执行操作。如需了解详情，请参阅 Vertex AI Java API 参考文档。

如需向 Vertex AI 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为本地开发环境设置身份验证。


import com.google.cloud.aiplatform.v1.AcceleratorType;
import com.google.cloud.aiplatform.v1.ContainerSpec;
import com.google.cloud.aiplatform.v1.CustomJob;
import com.google.cloud.aiplatform.v1.CustomJobSpec;
import com.google.cloud.aiplatform.v1.JobServiceClient;
import com.google.cloud.aiplatform.v1.JobServiceSettings;
import com.google.cloud.aiplatform.v1.LocationName;
import com.google.cloud.aiplatform.v1.MachineSpec;
import com.google.cloud.aiplatform.v1.WorkerPoolSpec;
import java.io.IOException;

// Create a custom job to run machine learning training code in Vertex AI
public class CreateCustomJobSample {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String project = "PROJECT";
    String displayName = "DISPLAY_NAME";

    // Vertex AI runs your training application in a Docker container image. A Docker container
    // image is a self-contained software package that includes code and all dependencies. Learn
    // more about preparing your training application at
    // https://cloud.google.com/vertex-ai/docs/training/overview#prepare_your_training_application
    String containerImageUri = "CONTAINER_IMAGE_URI";
    createCustomJobSample(project, displayName, containerImageUri);
  }

  static void createCustomJobSample(String project, String displayName, String containerImageUri)
      throws IOException {
    JobServiceSettings settings =
        JobServiceSettings.newBuilder()
            .setEndpoint("us-central1-aiplatform.googleapis.com:443")
            .build();
    String location = "us-central1";

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (JobServiceClient client = JobServiceClient.create(settings)) {
      MachineSpec machineSpec =
          MachineSpec.newBuilder()
              .setMachineType("n1-standard-4")
              .setAcceleratorType(AcceleratorType.NVIDIA_TESLA_T4)
              .setAcceleratorCount(1)
              .build();

      ContainerSpec containerSpec =
          ContainerSpec.newBuilder().setImageUri(containerImageUri).build();

      WorkerPoolSpec workerPoolSpec =
          WorkerPoolSpec.newBuilder()
              .setMachineSpec(machineSpec)
              .setReplicaCount(1)
              .setContainerSpec(containerSpec)
              .build();

      CustomJobSpec customJobSpecJobSpec =
          CustomJobSpec.newBuilder().addWorkerPoolSpecs(workerPoolSpec).build();

      CustomJob customJob =
          CustomJob.newBuilder()
              .setDisplayName(displayName)
              .setJobSpec(customJobSpecJobSpec)
              .build();
      LocationName parent = LocationName.of(project, location);
      CustomJob response = client.createCustomJob(parent, customJob);
      System.out.format("response: %s\n", response);
      System.out.format("Name: %s\n", response.getName());
    }
  }
}

Node.js

在尝试此示例之前，请按照《Vertex AI 快速入门：使用客户端库》中的 Node.js 设置说明执行操作。如需了解详情，请参阅 Vertex AI Node.js API 参考文档。

如需向 Vertex AI 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为本地开发环境设置身份验证。

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */

// const customJobDisplayName = 'YOUR_CUSTOM_JOB_DISPLAY_NAME';
// const containerImageUri = 'YOUR_CONTAINER_IMAGE_URI';
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';

// Imports the Google Cloud Job Service Client library
const {JobServiceClient} = require('@google-cloud/aiplatform');

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};

// Instantiates a client
const jobServiceClient = new JobServiceClient(clientOptions);

async function createCustomJob() {
  // Configure the parent resource
  const parent = `projects/${project}/locations/${location}`;
  const customJob = {
    displayName: customJobDisplayName,
    jobSpec: {
      workerPoolSpecs: [
        {
          machineSpec: {
            machineType: 'n1-standard-4',
            acceleratorType: 'NVIDIA_TESLA_K80',
            acceleratorCount: 1,
          },
          replicaCount: 1,
          containerSpec: {
            imageUri: containerImageUri,
            command: [],
            args: [],
          },
        },
      ],
    },
  };
  const request = {parent, customJob};

  // Create custom job request
  const [response] = await jobServiceClient.createCustomJob(request);

  console.log('Create custom job response:\n', JSON.stringify(response));
}
createCustomJob();

Python

如需了解如何安装或更新 Vertex AI SDK for Python，请参阅安装 Vertex AI SDK for Python。如需了解详情，请参阅 Python API 参考文档。

from google.cloud import aiplatform


def create_custom_job_sample(
    project: str,
    display_name: str,
    container_image_uri: str,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.JobServiceClient(client_options=client_options)
    custom_job = {
        "display_name": display_name,
        "job_spec": {
            "worker_pool_specs": [
                {
                    "machine_spec": {
                        "machine_type": "n1-standard-4",
                        "accelerator_type": aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_K80,
                        "accelerator_count": 1,
                    },
                    "replica_count": 1,
                    "container_spec": {
                        "image_uri": container_image_uri,
                        "command": [],
                        "args": [],
                    },
                }
            ]
        },
    }
    parent = f"projects/{project}/locations/{location}"
    response = client.create_custom_job(parent=parent, custom_job=custom_job)
    print("response:", response)

如需了解详情，请参阅创建 CustomJob 指南。

后续步骤

了解如何通过创建 CustomJob 执行自定义训练。