此页面由 Cloud Translation API 翻译。

创建和运行使用 GPU 的作业

本文档介绍了如何创建和运行使用图形处理单元 (GPU) 的作业。如需详细了解 GPU 的功能和限制，请参阅 Compute Engine 文档中的 GPU 简介。

创建 Batch 作业时，您可以选择使用 GPU 来加速特定工作负载。使用 GPU 的作业的常见应用场景包括密集型数据处理和人工智能 (AI) 工作负载，例如机器学习 (ML)。

准备工作

如果您之前未使用过 Batch，请查看开始使用 Batch 并完成项目和用户的前提条件，以启用 Batch。
如需获得创建作业所需的权限，请让您的管理员为您授予以下 IAM 角色：
- 项目的 Batch Job Editor (roles/batch.jobsEditor)
- 服务账号用户 (roles/iam.serviceAccountUser) 作业的服务账号（默认情况下为默认 Compute Engine 服务账号）
如需详细了解如何授予角色，请参阅管理对项目、文件夹和组织的访问权限。

您也可以通过自定义角色或其他预定义角色来获取所需的权限。

创建使用 GPU 的作业

如需创建使用 GPU 的作业，请执行以下操作：

规划使用 GPU 的作业的要求。
根据您确定的要求和方法创建作业。如需查看如何使用建议的选项创建作业的示例，请参阅本文档中的创建使用 GPU 的示例作业。

规划使用 GPU 的作业的要求

在创建使用 GPU 的作业之前，请按以下部分所述规划作业的要求：

选择 GPU 机器类型和预配方法
安装 GPU 驱动程序
定义兼容的虚拟机资源

第 1 步：选择 GPU 机器类型和配置方法

作业的要求因您偏好的 GPU 机器类型和配置方法而异，并且每种要求对应的选项可能相互依赖。您可以根据自己的需求和优先事项，先选择 GPU 机器类型，也可以先选择配置方法。一般来说，GPU 机器类型主要影响性能和基本价格，而预配方法主要影响资源可用性和额外费用或折扣。

选择 GPU 机器类型

Compute Engine 文档的 GPU 机器类型页面上列出了可用的 GPU 机器类型（GPU 类型、GPU 数量和机器类型（vCPU 和内存）的有效组合）及其用例。

作业指定 GPU 机器类型所需的字段因下表中的类别而异：

GPU 机器类型及其作业要求
加速器优化虚拟机的 GPU：如果虚拟机的机器类型属于加速器优化机器系列，则会自动挂接特定类型和数量的此类 GPU。	如需为加速器优化虚拟机使用 GPU，我们建议您指定机器类型。每种加速器优化机器类型仅支持特定类型和数量的 GPU，因此无论您是否指定这些值，其功能都是等效的。具体而言，Batch 还支持仅为加速器优化型虚拟机指定 GPU 的类型和数量，但由此产生的 vCPU 和内存选项往往非常有限。因此，我们建议您验证可用的 vCPU 和内存选项是否与作业的任务要求兼容。
适用于 N1 虚拟机的 GPU：这些 GPU 需要您指定要挂接到每个虚拟机的类型和数量，并且必须挂接到机器类型属于 N1 机器系列的虚拟机。	如需为 N1 虚拟机使用 GPU，建议您至少指定 GPU 类型和 GPU 数量。确保值组合与 N1 机器类型的有效 GPU 选项之一相匹配。对于使用任何特定类型和数量的 GPU 的 N1 虚拟机，vCPU 和内存选项非常灵活。除非您使用 Google Cloud 控制台创建作业，否则您可以让 Batch 自动选择满足作业任务要求的机器类型。注意：对于指定了 N1 机器类型但未指定 GPU 类型和 GPU 数量的作业，Batch 不会使用 GPU。

GPU 机器类型及其作业要求

加速器优化虚拟机的 GPU：如果虚拟机的机器类型属于加速器优化机器系列，则会自动挂接特定类型和数量的此类 GPU。

如需为加速器优化虚拟机使用 GPU，我们建议您指定机器类型。每种加速器优化机器类型仅支持特定类型和数量的 GPU，因此无论您是否指定这些值，其功能都是等效的。

具体而言，Batch 还支持仅为加速器优化型虚拟机指定 GPU 的类型和数量，但由此产生的 vCPU 和内存选项往往非常有限。因此，我们建议您验证可用的 vCPU 和内存选项是否与作业的任务要求兼容。

适用于 N1 虚拟机的 GPU：这些 GPU 需要您指定要挂接到每个虚拟机的类型和数量，并且必须挂接到机器类型属于 N1 机器系列的虚拟机。

如需为 N1 虚拟机使用 GPU，建议您至少指定 GPU 类型和 GPU 数量。确保值组合与 N1 机器类型的有效 GPU 选项之一相匹配。对于使用任何特定类型和数量的 GPU 的 N1 虚拟机，vCPU 和内存选项非常灵活。除非您使用 Google Cloud 控制台创建作业，否则您可以让 Batch 自动选择满足作业任务要求的机器类型。

选择配置方法

Batch 会根据作业请求的资源类型，使用不同的方法为使用 GPU 的作业预配虚拟机资源。下表介绍了可用的配置方法及其要求，并根据用例（从最高到最低的资源可用性）列出了这些方法。

总而言之，我们建议大多数用户执行以下操作：

如果您打算在不进行预留的情况下使用 A3 GPU 机器类型，请使用 Dynamic Workload Scheduler for Batch（预览版）。

注意：如果您想将动态工作负载调度器用于 Batch 和其他 GPU 机器类型，请与 Google Cloud 销售团队或您的客户团队联系。
对于所有其他 GPU 机器类型，请使用默认的配置方法。默认的配置方法通常是按需配置；但如果您的项目有作业可自动使用的未用预留，则例外。

配置方法及其作业要求

配置方法及其作业要求
预留使用场景：如果您希望确保资源可用性达到非常高的水平，或者您已经有可能会未使用的现有预留，我们建议您为作业使用预留。详情：预留会产生指定虚拟机的费用，费用与运行虚拟机的费用相同，直到您删除预留为止。使用预留的虚拟机不会产生单独的费用，但无论是否使用预留，都会产生费用。	Batch 使用预留来运行可占用未使用的预留的作业。如需详细了解预留及其要求，请参阅使用虚拟机预留确保资源可用性页面。
动态工作负载调度程序（批处理）（预览版）使用情形：如果您想使用 GPU，但不想占用预留，我们建议您使用动态工作负载调度器，以便为 A3 机器系列中的虚拟机使用 GPU。详情：借助动态工作负载调度器，您可以更轻松地同时访问许多资源，从而加快 AI 和机器学习工作负载的运行速度。例如，动态工作负载调度器可以通过缓解因资源不可用而导致的延迟或问题，帮助您进行作业调度。重要提示：与其他作业不同，通过动态工作负载调度器使用 GPU 的 Batch 作业会使用 Compute Engine 托管式实例组 (MIG) 的规模调整请求，这些请求的行为略有不同。具体而言，通过动态工作负载调度器使用 GPU 的作业可能需要抢占式分配配额，建议使用此选项来缓解动态工作负载调度器 GPU 的配额摩擦。如需了解详情，请参阅 GPU 虚拟机和抢占式分配配额。	对于满足以下所有条件的作业，Batch 会使用动态工作负载调度器：指定 A3 GPU 机器类型。屏蔽预留。具体而言，作业必须将 `reservation` 字段设置为 `NO_RESERVATION`。如需了解详情，请参阅创建和运行无法使用预留虚拟机的作业。请勿使用 Spot 虚拟机。具体而言，作业可以省略 `provisioningModel` 字段，也可以将 `provisioningModel` 字段设置为 `STANDARD`。提示：虽然您可以在提供 A3 虚拟机的任何位置运行作业，但我们建议您使用位置 `us-central1`，因为该位置为动态工作负载调度器提供了专用容量。
按需应用场景：我们建议所有其他作业都采用按需付费模式。详情：按需通常是访问 Compute Engine 虚拟机的默认方式。借助按需模式，您可以一次请求一个虚拟机，并在资源可用时立即访问。	对于所有其他作业，Batch 使用按需模式。
Spot 虚拟机使用情形：建议尝试使用 Spot 虚拟机来降低容错型工作负载的费用。注意：Spot 虚拟机可能并非始终可用。您可以遵循 Spot 虚拟机的最佳实践来提高资源可用性。不过，如果问题仍然存在，您可能需要改用其他配置方法。详情：Spot 虚拟机的折扣力度很大，但可能并非始终可用，并且可以随时被抢占。如需了解详情，请参阅 Compute Engine 文档中的 Spot 虚拟机。	对于将 `provisioningModel` 字段设置为 `SPOT` 的作业，Batch 会使用 Spot 虚拟机。

预留

使用场景：如果您希望确保资源可用性达到非常高的水平，或者您已经有可能会未使用的现有预留，我们建议您为作业使用预留。
详情：预留会产生指定虚拟机的费用，费用与运行虚拟机的费用相同，直到您删除预留为止。使用预留的虚拟机不会产生单独的费用，但无论是否使用预留，都会产生费用。

Batch 使用预留来运行可占用未使用的预留的作业。如需详细了解预留及其要求，请参阅使用虚拟机预留确保资源可用性页面。

动态工作负载调度程序（批处理）（预览版）

使用情形：如果您想使用 GPU，但不想占用预留，我们建议您使用动态工作负载调度器，以便为 A3 机器系列中的虚拟机使用 GPU。
详情：借助动态工作负载调度器，您可以更轻松地同时访问许多资源，从而加快 AI 和机器学习工作负载的运行速度。例如，动态工作负载调度器可以通过缓解因资源不可用而导致的延迟或问题，帮助您进行作业调度。

重要提示：与其他作业不同，通过动态工作负载调度器使用 GPU 的 Batch 作业会使用 Compute Engine 托管式实例组 (MIG) 的规模调整请求，这些请求的行为略有不同。具体而言，通过动态工作负载调度器使用 GPU 的作业可能需要抢占式分配配额，建议使用此选项来缓解动态工作负载调度器 GPU 的配额摩擦。如需了解详情，请参阅 GPU 虚拟机和抢占式分配配额。

对于满足以下所有条件的作业，Batch 会使用动态工作负载调度器：

指定 A3 GPU 机器类型。
屏蔽预留。具体而言，作业必须将 reservation 字段设置为 NO_RESERVATION。如需了解详情，请参阅创建和运行无法使用预留虚拟机的作业。
请勿使用 Spot 虚拟机。具体而言，作业可以省略 provisioningModel 字段，也可以将 provisioningModel 字段设置为 STANDARD。

按需

应用场景：我们建议所有其他作业都采用按需付费模式。
详情：按需通常是访问 Compute Engine 虚拟机的默认方式。借助按需模式，您可以一次请求一个虚拟机，并在资源可用时立即访问。

对于所有其他作业，Batch 使用按需模式。

Spot 虚拟机

使用情形：建议尝试使用 Spot 虚拟机来降低容错型工作负载的费用。

注意：Spot 虚拟机可能并非始终可用。您可以遵循 Spot 虚拟机的最佳实践来提高资源可用性。不过，如果问题仍然存在，您可能需要改用其他配置方法。
详情：Spot 虚拟机的折扣力度很大，但可能并非始终可用，并且可以随时被抢占。如需了解详情，请参阅 Compute Engine 文档中的 Spot 虚拟机。

对于将 provisioningModel 字段设置为 SPOT 的作业，Batch 会使用 Spot 虚拟机。

第 2 步：安装 GPU 驱动程序

如需将 GPU 用于作业，您必须安装 GPU 驱动程序。如需安装 GPU 驱动程序，请选择以下方法之一：

自动安装 GPU 驱动程序（建议尽可能使用此方法）：如示例所示，若要让 Batch 从第三方位置提取所需的 GPU 驱动程序并代表您进行安装，请将作业的 installGpuDrivers 字段设置为 true。如果您的工作不需要您手动安装驱动程序，建议使用此方法。

（可选）如果您需要指定 Batch 安装的 GPU 驱动程序版本，还可以设置 driverVersion 字段。
手动安装 GPU 驱动程序：如果符合以下任一条件，则必须使用此方法：

重要提示：由于存在已知问题，对于指定了某些 Compute Engine 映像的作业，您可能还需要手动安装驱动程序。如需了解详情，请参阅仅当自动安装驱动程序时，使用 GPU 的作业和具有过时内核的虚拟机操作系统映像才会失败。
- 作业同时使用脚本和容器可运行对象，且没有互联网访问权限。如需详细了解作业拥有的访问权限，请参阅批量网络概览。
- 作业使用自定义虚拟机映像。如需详细了解虚拟机操作系统映像以及您可以使用的虚拟机操作系统映像，请参阅虚拟机操作系统环境概览。
如需手动安装所需的 GPU 驱动程序，建议采用以下方法：
1. 创建包含 GPU 驱动程序的自定义虚拟机映像。
  1. 如需安装 GPU 驱动程序，请运行与您要使用的操作系统对应的安装脚本：
    - Container-Optimized OS 的 GPU 驱动程序
    - 其他操作系统的 GPU 驱动程序
  2. 如果您的作业包含任何容器可运行对象，并且未使用 Container-Optimized OS，您还必须安装 NVIDIA 容器工具包
2. 创建并提交使用 GPU 的作业时，指定包含 GPU 驱动程序的自定义虚拟机映像，并将作业的 installGpuDrivers 字段设置为 false（默认值）。

第 3 步：定义兼容的虚拟机资源

如需了解定义作业的虚拟机资源的要求和选项，请参阅作业资源。

总而言之，在为使用 GPU 的作业定义虚拟机资源时，您必须执行以下所有操作：

确保 GPU 机器类型在作业虚拟机的所在位置可用。

如需了解 GPU 机器类型的可用区域，请参阅 Compute Engine 文档中的 GPU 区域和可用区可用性。
如果您指定作业的机器类型，请确保该机器类型具有足够的 vCPU 和内存来满足作业的任务要求。无论何时使用 Google Cloud 控制台创建作业，都必须指定作业的机器类型；无论何时创建使用 GPU 的作业（用于加速器优化型虚拟机），都建议指定作业的机器类型。
请确保您使用有效方法为作业定义虚拟机资源：
- 使用 instances[].policy 字段直接定义虚拟机资源（建议尽可能使用）。 此方法在示例中进行了演示。
- 使用 instances[].instanceTemplate 字段通过模板定义虚拟机资源。 此方法是必需的，用于通过自定义映像手动安装 GPU 驱动程序。如需了解详情，请参阅使用虚拟机实例模板定义作业资源。

创建使用 GPU 的作业示例

以下部分介绍了如何使用推荐的选项为每种 GPU 机器类型创建示例作业。具体而言，示例作业都会自动安装 GPU 驱动程序，都会直接定义虚拟机资源，并且会指定预配方法或使用默认预配方法。

通过动态工作负载调度器将 GPU 用于 A3 虚拟机（预览版）
为加速器优化型虚拟机使用 GPU
为 N1 虚拟机使用 GPU

通过批量动态工作负载调度器（预览版）为 A3 虚拟机使用 GPU

您可以使用 gcloud CLI 或 Batch API 通过动态工作负载调度程序创建使用 GPU 的 A3 虚拟机作业。

gcloud

创建一个 JSON 文件，用于安装 GPU 驱动程序、指定 A3 机器系列中的机器类型、阻止预留，并在具有 GPU 机器类型的位置运行。

例如，如需通过 Dynamic Workload Scheduler 创建一个使用 GPU 的 A3 虚拟机基本脚本作业，请创建一个包含以下内容的 JSON 文件：

{
    "taskGroups": [
        {
            "taskSpec": {
                "runnables": [
                    {
                        "script": {
                            "text": "echo Hello world from task ${BATCH_TASK_INDEX}."
                        }
                    }
                ]
            },
            "taskCount": 3,
            "parallelism": 1
        }
    ],
    "allocationPolicy": {
        "instances": [
            {
                "installGpuDrivers": INSTALL_GPU_DRIVERS,
                "policy": {
                    "machineType": "MACHINE_TYPE",
                    "reservation": "NO_RESERVATION"
                }
            }
        ],
        "location": {
            "allowedLocations": [
                "ALLOWED_LOCATIONS"
            ]
        }
    },
    "logsPolicy": {
        "destination": "CLOUD_LOGGING"
    }
}

替换以下内容：

INSTALL_GPU_DRIVERS：如果设置为 true，Batch 会从第三方位置提取您在 policy 字段中指定的 GPU 类型所需的驱动程序，并代表您安装这些驱动程序。如果您将此字段设置为 false（默认值），则需要手动安装 GPU 驱动程序才能使用任何 GPU 来运行此作业。
MACHINE_TYPE：A3 机器系列中的机器类型。
ALLOWED_LOCATIONS：您可以选择使用 allowedLocations[] 字段来指定作业的虚拟机允许运行的区域或区域中的特定可用区，例如，regions/us-central1 允许在 us-central1 区域中的所有可用区运行。请务必指定提供您希望用于此作业的 GPU 机器类型的位置。否则，如果您省略此字段，请确保作业的位置提供 GPU 机器类型。

如需创建并运行作业，请使用 gcloud batch jobs submit 命令：
```
gcloud batch jobs submit JOB_NAME \
    --location LOCATION \
    --config JSON_CONFIGURATION_FILE
```
替换以下内容：
- JOB_NAME：作业的名称。
- LOCATION：作业的位置。
- JSON_CONFIGURATION_FILE：包含作业配置详情的 JSON 文件的路径。

API

向 jobs.create 方法发出 POST 请求，该请求会安装 GPU 驱动程序、指定 A3 机器系列中的机器类型、阻止预留，并在具有 GPU 机器类型的位置运行。

例如，如需通过动态工作负载调度器为 A3 虚拟机创建使用 GPU 的基本脚本作业，请发出以下请求：

POST https://batch.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/jobs?job_id=JOB_NAME

{
    "taskGroups": [
        {
            "taskSpec": {
                "runnables": [
                    {
                        "script": {
                            "text": "echo Hello world from task ${BATCH_TASK_INDEX}."
                        }
                    }
                ]
            },
            "taskCount": 3,
            "parallelism": 1
        }
    ],
    "allocationPolicy": {
        "instances": [
            {
                "installGpuDrivers": INSTALL_GPU_DRIVERS,
                "policy": {
                    "machineType": "MACHINE_TYPE",
                    "reservation": "NO_RESERVATION"
                }
            }
        ],
        "location": {
            "allowedLocations": [
                "ALLOWED_LOCATIONS"
            ]
        }
    },
    "logsPolicy": {
        "destination": "CLOUD_LOGGING"
    }
}

替换以下内容：

PROJECT_ID：您的项目的项目 ID。
LOCATION：作业的位置。
JOB_NAME：作业的名称。
INSTALL_GPU_DRIVERS：如果设置为 true，Batch 会从第三方位置提取您在 policy 字段中指定的 GPU 类型所需的驱动程序，并代表您安装这些驱动程序。如果您将此字段设置为 false（默认值），则需要手动安装 GPU 驱动程序才能使用任何 GPU 来运行此作业。
MACHINE_TYPE：A3 机器系列中的机器类型。
ALLOWED_LOCATIONS：您可以选择使用 allowedLocations[] 字段来指定作业的虚拟机允许运行的区域或区域中的特定可用区，例如，regions/us-central1 允许在 us-central1 区域中的所有可用区运行。请务必指定提供您希望用于此作业的 GPU 机器类型的位置。否则，如果您省略此字段，请确保作业的位置提供 GPU 机器类型。

为加速器优化型虚拟机使用 GPU

您可以使用Google Cloud 控制台、gcloud CLI、Batch API、Java、Node.js 或 Python 创建使用 GPU 的加速器优化虚拟机作业。

控制台

如需使用 Google Cloud 控制台创建使用 GPU 的作业，请执行以下操作：

在 Google Cloud 控制台中，前往作业列表页面。

前往“作业列表”
点击创建。系统会打开创建批处理作业页面。在左侧窗格中，作业详情页面处于选中状态。
配置作业详情页面：
1. 可选：在作业名称字段中，自定义作业名称。
  
  例如，输入 example-gpu-job。
2. 配置任务详情部分：
  1. 在新的可运行对象窗口中，添加至少一个脚本或容器，以便运行此作业。
    
    例如，如需创建基本脚本作业，请执行以下操作：
    1. 选中脚本复选框。系统会显示一个字段。
    2. 在该字段中，输入以下脚本：
      echo Hello world from task ${BATCH_TASK_INDEX}.
    3. 点击完成。
  2. 在任务数字段中，输入相应作业的任务数。
    
    例如，输入 3。
  3. 可选：在并行性字段中，输入要同时运行的任务数。
    
    例如，输入 1（默认值）。
配置资源规范页面：
1. 在左侧窗格中，点击资源规范。系统会打开资源规范页面。
2. 可选：在虚拟机配置模型部分中，为相应作业的虚拟机选择以下配置模型选项之一：
  - 如果您的作业可以承受抢占，并且您希望使用折扣虚拟机，请选择 Spot。
  - 否则，请选择标准（默认）。
3. 为此作业选择位置。
  1. 在区域字段中，选择一个区域。
  2. 在可用区字段中，执行以下操作之一：
    - 如果您希望将此作业限制为仅在特定可用区中运行，请选择一个可用区。
    - 否则，请选择任意（默认）。
  重要提示：请务必仅指定提供您希望用于此作业的 GPU 机器类型的位置。
4. 为相应作业的虚拟机选择 GPU 机器类型：
  1. 在机器系列选项中，点击 GPU。
  2. 在 GPU 类型字段中，选择 GPU 类型。然后，在 GPU 数量字段中，选择每个虚拟机的 GPU 数量。
    
    如果您选择的是针对加速器优化型虚拟机的 GPU 类型，则机器类型字段仅允许根据您选择的 GPU 类型和数量选择一种机器类型。
  3. 如需自动安装 GPU 驱动程序，请选择 GPU 驱动程序安装（默认）。
5. 配置每项任务所需的虚拟机资源量：
  
  重要提示：请确保 GPU 机器类型具有足够的虚拟机资源来满足作业的任务要求。
  1. 在核心数字段中，输入每个任务的 vCPUs 数量。
    
    例如，输入 1（默认值）。
  2. 在内存字段中，输入每个任务的 RAM 量（以 GB 为单位）。
    
    例如，输入 0.5（默认值）。
6. 点击完成。
可选：配置此作业的其他字段。
可选：如需查看作业配置，请在左侧窗格中点击预览。
点击创建。

作业详情页面会显示您创建的作业。

gcloud

创建一个 JSON 文件，用于安装 GPU 驱动程序、指定加速器优化机器系列中的机器类型，并在具有 GPU 机器类型的位置运行。

例如，如需创建使用 GPU 的基本脚本作业（适用于经过加速器优化的虚拟机），请创建一个包含以下内容的 JSON 文件：

{
    "taskGroups": [
        {
            "taskSpec": {
                "runnables": [
                    {
                        "script": {
                            "text": "echo Hello world from task ${BATCH_TASK_INDEX}."
                        }
                    }
                ]
            },
            "taskCount": 3,
            "parallelism": 1
        }
    ],
    "allocationPolicy": {
        "instances": [
            {
                "installGpuDrivers": INSTALL_GPU_DRIVERS,
                "policy": {
                    "machineType": "MACHINE_TYPE"
                }
            }
        ],
        "location": {
            "allowedLocations": [
                "ALLOWED_LOCATIONS"
            ]
        }
    },
    "logsPolicy": {
        "destination": "CLOUD_LOGGING"
    }
}

替换以下内容：

INSTALL_GPU_DRIVERS：如果设置为 true，Batch 会从第三方位置提取您在 policy 字段中指定的 GPU 类型所需的驱动程序，并代表您安装这些驱动程序。如果您将此字段设置为 false（默认值），则需要手动安装 GPU 驱动程序才能使用任何 GPU 来运行此作业。
MACHINE_TYPE：加速器优化机器系列中的机器类型。
ALLOWED_LOCATIONS：您可以选择使用 allowedLocations[] 字段来指定作业的虚拟机允许运行的区域或区域中的特定可用区，例如，regions/us-central1 允许在 us-central1 区域中的所有可用区运行。请务必指定提供您希望用于此作业的 GPU 机器类型的位置。否则，如果您省略此字段，请确保作业的位置提供 GPU 机器类型。

如需创建并运行作业，请使用 gcloud batch jobs submit 命令：
```
gcloud batch jobs submit JOB_NAME \
    --location LOCATION \
    --config JSON_CONFIGURATION_FILE
```
替换以下内容：
- JOB_NAME：作业的名称。
- LOCATION：作业的位置。
- JSON_CONFIGURATION_FILE：包含作业配置详情的 JSON 文件的路径。

API

向 jobs.create 方法发出 POST 请求，该请求会安装 GPU 驱动程序、指定加速器优化机器系列中的机器类型，并在具有 GPU 机器类型的位置运行。

例如，如需创建使用 GPU 的基本脚本作业（适用于经过加速器优化的虚拟机），请发出以下请求：

POST https://batch.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/jobs?job_id=JOB_NAME

{
    "taskGroups": [
        {
            "taskSpec": {
                "runnables": [
                    {
                        "script": {
                            "text": "echo Hello world from task ${BATCH_TASK_INDEX}."
                        }
                    }
                ]
            },
            "taskCount": 3,
            "parallelism": 1
        }
    ],
    "allocationPolicy": {
        "instances": [
            {
                "installGpuDrivers": INSTALL_GPU_DRIVERS,
                "policy": {
                    "machineType": "MACHINE_TYPE"
                }
            }
        ],
        "location": {
            "allowedLocations": [
                "ALLOWED_LOCATIONS"
            ]
        }
    },
    "logsPolicy": {
        "destination": "CLOUD_LOGGING"
    }
}

替换以下内容：

PROJECT_ID：您的项目的项目 ID。
LOCATION：作业的位置。
JOB_NAME：作业的名称。
INSTALL_GPU_DRIVERS：如果设置为 true，Batch 会从第三方位置提取您在 policy 字段中指定的 GPU 类型所需的驱动程序，并代表您安装这些驱动程序。如果您将此字段设置为 false（默认值），则需要手动安装 GPU 驱动程序才能使用任何 GPU 来运行此作业。
MACHINE_TYPE：加速器优化机器系列中的机器类型。
ALLOWED_LOCATIONS：您可以选择使用 allowedLocations[] 字段来指定作业的虚拟机允许运行的区域或区域中的特定可用区，例如，regions/us-central1 允许在 us-central1 区域中的所有可用区运行。请务必指定提供您希望用于此作业的 GPU 机器类型的位置。否则，如果您省略此字段，请确保作业的位置提供 GPU 机器类型。

Java


import com.google.cloud.batch.v1.AllocationPolicy;
import com.google.cloud.batch.v1.AllocationPolicy.Accelerator;
import com.google.cloud.batch.v1.AllocationPolicy.InstancePolicy;
import com.google.cloud.batch.v1.AllocationPolicy.InstancePolicyOrTemplate;
import com.google.cloud.batch.v1.BatchServiceClient;
import com.google.cloud.batch.v1.CreateJobRequest;
import com.google.cloud.batch.v1.Job;
import com.google.cloud.batch.v1.LogsPolicy;
import com.google.cloud.batch.v1.Runnable;
import com.google.cloud.batch.v1.Runnable.Script;
import com.google.cloud.batch.v1.TaskGroup;
import com.google.cloud.batch.v1.TaskSpec;
import com.google.protobuf.Duration;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class CreateGpuJob {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "YOUR_PROJECT_ID";
    // Name of the region you want to use to run the job. Regions that are
    // available for Batch are listed on: https://cloud.google.com/batch/docs/get-started#locations
    String region = "europe-central2";
    // The name of the job that will be created.
    // It needs to be unique for each project and region pair.
    String jobName = "JOB_NAME";
    // Optional. When set to true, Batch fetches the drivers required for the GPU type
    // that you specify in the policy field from a third-party location,
    // and Batch installs them on your behalf. If you set this field to false (default),
    // you need to install GPU drivers manually to use any GPUs for this job.
    boolean installGpuDrivers = false;
    // Accelerator-optimized machine types are available to Batch jobs. See the list
    // of available types on: https://cloud.google.com/compute/docs/accelerator-optimized-machines
    String machineType = "g2-standard-4";

    createGpuJob(projectId, region, jobName, installGpuDrivers, machineType);
  }

  // Create a job that uses GPUs
  public static Job createGpuJob(String projectId, String region, String jobName,
                                  boolean installGpuDrivers, String machineType)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (BatchServiceClient batchServiceClient = BatchServiceClient.create()) {
      // Define what will be done as part of the job.
      Runnable runnable =
          Runnable.newBuilder()
              .setScript(
                  Script.newBuilder()
                      .setText(
                          "echo Hello world! This is task ${BATCH_TASK_INDEX}. "
                                  + "This job has a total of ${BATCH_TASK_COUNT} tasks.")
                      // You can also run a script from a file. Just remember, that needs to be a
                      // script that's already on the VM that will be running the job.
                      // Using setText() and setPath() is mutually exclusive.
                      // .setPath("/tmp/test.sh")
                      .build())
              .build();

      TaskSpec task = TaskSpec.newBuilder()
                  // Jobs can be divided into tasks. In this case, we have only one task.
                  .addRunnables(runnable)
                  .setMaxRetryCount(2)
                  .setMaxRunDuration(Duration.newBuilder().setSeconds(3600).build())
                  .build();

      // Tasks are grouped inside a job using TaskGroups.
      // Currently, it's possible to have only one task group.
      TaskGroup taskGroup = TaskGroup.newBuilder()
          .setTaskCount(3)
          .setParallelism(1)
          .setTaskSpec(task)
          .build();

      // Policies are used to define on what kind of virtual machines the tasks will run.
      // Read more about machine types here: https://cloud.google.com/compute/docs/machine-types
      InstancePolicy instancePolicy =
          InstancePolicy.newBuilder().setMachineType(machineType).build();  

      // Policies are used to define on what kind of virtual machines the tasks will run on.
      AllocationPolicy allocationPolicy =
          AllocationPolicy.newBuilder()
              .addInstances(
                  InstancePolicyOrTemplate.newBuilder()
                      .setInstallGpuDrivers(installGpuDrivers)
                      .setPolicy(instancePolicy)
                      .build())
              .build();

      Job job =
          Job.newBuilder()
              .addTaskGroups(taskGroup)
              .setAllocationPolicy(allocationPolicy)
              .putLabels("env", "testing")
              .putLabels("type", "script")
              // We use Cloud Logging as it's an out of the box available option.
              .setLogsPolicy(
                  LogsPolicy.newBuilder().setDestination(LogsPolicy.Destination.CLOUD_LOGGING))
              .build();

      CreateJobRequest createJobRequest =
          CreateJobRequest.newBuilder()
              // The job's parent is the region in which the job will run.
              .setParent(String.format("projects/%s/locations/%s", projectId, region))
              .setJob(job)
              .setJobId(jobName)
              .build();

      Job result =
          batchServiceClient
              .createJobCallable()
              .futureCall(createJobRequest)
              .get(5, TimeUnit.MINUTES);

      System.out.printf("Successfully created the job: %s", result.getName());

      return result;
    }
  }
}

Node.js

// Imports the Batch library
const batchLib = require('@google-cloud/batch');
const batch = batchLib.protos.google.cloud.batch.v1;

// Instantiates a client
const batchClient = new batchLib.v1.BatchServiceClient();

/**
 * TODO(developer): Update these variables before running the sample.
 */
// Project ID or project number of the Google Cloud project you want to use.
const projectId = await batchClient.getProjectId();
// Name of the region you want to use to run the job. Regions that are
// available for Batch are listed on: https://cloud.google.com/batch/docs/get-started#locations
const region = 'europe-central2';
// The name of the job that will be created.
// It needs to be unique for each project and region pair.
const jobName = 'batch-gpu-job';
// The GPU type. You can view a list of the available GPU types
// by using the `gcloud compute accelerator-types list` command.
const gpuType = 'nvidia-l4';
// The number of GPUs of the specified type.
const gpuCount = 1;
// Optional. When set to true, Batch fetches the drivers required for the GPU type
// that you specify in the policy field from a third-party location,
// and Batch installs them on your behalf. If you set this field to false (default),
// you need to install GPU drivers manually to use any GPUs for this job.
const installGpuDrivers = false;
// Accelerator-optimized machine types are available to Batch jobs. See the list
// of available types on: https://cloud.google.com/compute/docs/accelerator-optimized-machines
const machineType = 'g2-standard-4';

// Define what will be done as part of the job.
const runnable = new batch.Runnable({
  script: new batch.Runnable.Script({
    commands: ['-c', 'echo Hello world! This is task ${BATCH_TASK_INDEX}.'],
  }),
});

const task = new batch.TaskSpec({
  runnables: [runnable],
  maxRetryCount: 2,
  maxRunDuration: {seconds: 3600},
});

// Tasks are grouped inside a job using TaskGroups.
const group = new batch.TaskGroup({
  taskCount: 3,
  taskSpec: task,
});

// Policies are used to define on what kind of virtual machines the tasks will run on.
// In this case, we tell the system to use "g2-standard-4" machine type.
// Read more about machine types here: https://cloud.google.com/compute/docs/machine-types
const instancePolicy = new batch.AllocationPolicy.InstancePolicy({
  machineType,
  // Accelerator describes Compute Engine accelerators to be attached to the VM
  accelerators: [
    new batch.AllocationPolicy.Accelerator({
      type: gpuType,
      count: gpuCount,
      installGpuDrivers,
    }),
  ],
});

const allocationPolicy = new batch.AllocationPolicy.InstancePolicyOrTemplate({
  instances: [{installGpuDrivers, policy: instancePolicy}],
});

const job = new batch.Job({
  name: jobName,
  taskGroups: [group],
  labels: {env: 'testing', type: 'script'},
  allocationPolicy,
  // We use Cloud Logging as it's an option available out of the box
  logsPolicy: new batch.LogsPolicy({
    destination: batch.LogsPolicy.Destination.CLOUD_LOGGING,
  }),
});
// The job's parent is the project and region in which the job will run
const parent = `projects/${projectId}/locations/${region}`;

async function callCreateBatchGPUJob() {
  // Construct request
  const request = {
    parent,
    jobId: jobName,
    job,
  };

  // Run request
  const [response] = await batchClient.createJob(request);
  console.log(JSON.stringify(response));
}

await callCreateBatchGPUJob();

Python

from google.cloud import batch_v1


def create_gpu_job(project_id: str, region: str, job_name: str) -> batch_v1.Job:
    """
    This method shows how to create a sample Batch Job that will run
    a simple command on Cloud Compute instances on GPU machines.

    Args:
        project_id: project ID or project number of the Cloud project you want to use.
        region: name of the region you want to use to run the job. Regions that are
            available for Batch are listed on: https://cloud.google.com/batch/docs/get-started#locations
        job_name: the name of the job that will be created.
            It needs to be unique for each project and region pair.

    Returns:
        A job object representing the job created.
    """
    client = batch_v1.BatchServiceClient()

    # Define what will be done as part of the job.
    task = batch_v1.TaskSpec()
    runnable = batch_v1.Runnable()
    runnable.script = batch_v1.Runnable.Script()
    runnable.script.text = "echo Hello world! This is task ${BATCH_TASK_INDEX}. This job has a total of ${BATCH_TASK_COUNT} tasks."
    # You can also run a script from a file. Just remember, that needs to be a script that's
    # already on the VM that will be running the job. Using runnable.script.text and runnable.script.path is mutually
    # exclusive.
    # runnable.script.path = '/tmp/test.sh'
    task.runnables = [runnable]

    # We can specify what resources are requested by each task.
    resources = batch_v1.ComputeResource()
    resources.cpu_milli = 2000  # in milliseconds per cpu-second. This means the task requires 2 whole CPUs.
    resources.memory_mib = 16  # in MiB
    task.compute_resource = resources

    task.max_retry_count = 2
    task.max_run_duration = "3600s"

    # Tasks are grouped inside a job using TaskGroups.
    # Currently, it's possible to have only one task group.
    group = batch_v1.TaskGroup()
    group.task_count = 4
    group.task_spec = task

    # Policies are used to define on what kind of virtual machines the tasks will run on.
    # In this case, we tell the system to use "g2-standard-4" machine type.
    # Read more about machine types here: https://cloud.google.com/compute/docs/machine-types
    policy = batch_v1.AllocationPolicy.InstancePolicy()
    policy.machine_type = "g2-standard-4"

    instances = batch_v1.AllocationPolicy.InstancePolicyOrTemplate()
    instances.policy = policy
    instances.install_gpu_drivers = True
    allocation_policy = batch_v1.AllocationPolicy()
    allocation_policy.instances = [instances]

    job = batch_v1.Job()
    job.task_groups = [group]
    job.allocation_policy = allocation_policy
    job.labels = {"env": "testing", "type": "container"}
    # We use Cloud Logging as it's an out of the box available option
    job.logs_policy = batch_v1.LogsPolicy()
    job.logs_policy.destination = batch_v1.LogsPolicy.Destination.CLOUD_LOGGING

    create_request = batch_v1.CreateJobRequest()
    create_request.job = job
    create_request.job_id = job_name
    # The job's parent is the region in which the job will run
    create_request.parent = f"projects/{project_id}/locations/{region}"

    return client.create_job(create_request)

为 N1 虚拟机使用 GPU

您可以使用 Google Cloud 控制台、gcloud CLI、Batch API、Java、Node.js 或 Python 创建使用 GPU 的 N1 虚拟机作业。

控制台

如需使用 Google Cloud 控制台创建使用 GPU 的作业，请执行以下操作：

在 Google Cloud 控制台中，前往作业列表页面。

前往“作业列表”
点击创建。系统会打开创建批处理作业页面。在左侧窗格中，作业详情页面处于选中状态。
配置作业详情页面：
1. 可选：在作业名称字段中，自定义作业名称。
  
  例如，输入 example-gpu-job。
2. 配置任务详情部分：
  1. 在新的可运行对象窗口中，添加至少一个脚本或容器，以便运行此作业。
    
    例如，如需创建基本脚本作业，请执行以下操作：
    1. 选中脚本复选框。系统会显示一个字段。
    2. 在该字段中，输入以下脚本：
      echo Hello world from task ${BATCH_TASK_INDEX}.
    3. 点击完成。
  2. 在任务数字段中，输入相应作业的任务数。
    
    例如，输入 3。
  3. 可选：在并行性字段中，输入要同时运行的任务数。
    
    例如，输入 1（默认值）。
配置资源规范页面：
1. 在左侧窗格中，点击资源规范。系统会打开资源规范页面。
2. 可选：在虚拟机配置模型部分中，为相应作业的虚拟机选择以下配置模型选项之一：
  - 如果您的作业可以承受抢占，并且您希望使用折扣虚拟机，请选择 Spot。
  - 否则，请选择标准（默认）。
3. 为此作业选择位置。
  1. 在区域字段中，选择一个区域。
  2. 在可用区字段中，执行以下操作之一：
    - 如果您希望将此作业限制为仅在特定可用区中运行，请选择一个可用区。
    - 否则，请选择任意（默认）。
  重要提示：请务必仅指定提供您希望用于此作业的 GPU 机器类型的位置。
4. 为相应作业的虚拟机选择 GPU 机器类型：
  1. 在机器系列选项中，点击 GPU。
  2. 在 GPU 类型字段中，选择 GPU 类型。
    
    如果您选择了适用于 N1 虚拟机的 GPU 类型之一，则系列字段会设置为 N1。
  3. 在 GPU 数量字段中，为每个虚拟机选择 GPU 数量。
  4. 在机器类型字段中，选择机器类型。
  5. 如需自动安装 GPU 驱动程序，请选择 GPU 驱动程序安装（默认）。
5. 配置每项任务所需的虚拟机资源量：
  
  重要提示：请确保 GPU 机器类型具有足够的虚拟机资源来满足作业的任务要求。
  1. 在核心数字段中，输入每个任务的 vCPUs 数量。
    
    例如，输入 1（默认值）。
  2. 在内存字段中，输入每个任务的 RAM 量（以 GB 为单位）。
    
    例如，输入 0.5（默认值）。
6. 点击完成。
可选：配置此作业的其他字段。
可选：如需查看作业配置，请在左侧窗格中点击预览。
点击创建。

作业详情页面会显示您创建的作业。

gcloud

创建一个 JSON 文件，该文件用于安装 GPU 驱动程序、定义 accelerators[] 字段的 type 和 count 子字段，并在具有 GPU 机器类型的位置运行。

例如，如需创建使用 GPU 的 N1 虚拟机的基本脚本作业，并让 Batch 选择确切的 N1 机器类型，请创建一个包含以下内容的 JSON 文件：

{
    "taskGroups": [
        {
            "taskSpec": {
                "runnables": [
                    {
                        "script": {
                            "text": "echo Hello world from task ${BATCH_TASK_INDEX}."
                        }
                    }
                ]
            },
            "taskCount": 3,
            "parallelism": 1
        }
    ],
    "allocationPolicy": {
        "instances": [
            {
                "installGpuDrivers": INSTALL_GPU_DRIVERS,
                "policy": {
                    "accelerators": [
                        {
                            "type": "GPU_TYPE",
                            "count": GPU_COUNT
                        }
                    ]
                }
            }
        ],
        "location": {
            "allowedLocations": [
                "ALLOWED_LOCATIONS"
            ]
        }
    },
    "logsPolicy": {
        "destination": "CLOUD_LOGGING"
    }
}

替换以下内容：

INSTALL_GPU_DRIVERS：如果设置为 true，Batch 会从第三方位置提取您在 policy 字段中指定的 GPU 类型所需的驱动程序，并代表您安装这些驱动程序。如果您将此字段设置为 false（默认值），则需要手动安装 GPU 驱动程序才能使用任何 GPU 来运行此作业。
GPU_TYPE：GPU 类型。您可以使用 gcloud compute accelerator-types list 命令查看可用 GPU 类型的列表。仅将此字段用于 N1 虚拟机的 GPU。
GPU_COUNT：指定类型的 GPU 数量。如需详细了解有效选项，请参阅 N1 机器系列的 GPU 机器类型。仅将此字段用于 N1 虚拟机的 GPU。
ALLOWED_LOCATIONS：您可以选择使用 allowedLocations[] 字段来指定作业的虚拟机允许运行的区域或区域中的特定可用区，例如，regions/us-central1 允许在 us-central1 区域中的所有可用区运行。请务必指定提供您希望用于此作业的 GPU 机器类型的位置。否则，如果您省略此字段，请确保作业的位置提供 GPU 机器类型。

如需创建并运行作业，请使用 gcloud batch jobs submit 命令：
```
gcloud batch jobs submit JOB_NAME \
    --location LOCATION \
    --config JSON_CONFIGURATION_FILE
```
替换以下内容：
- JOB_NAME：作业的名称。
- LOCATION：作业的位置。
- JSON_CONFIGURATION_FILE：包含作业配置详情的 JSON 文件的路径。

API

向 jobs.create 方法发出 POST 请求，该请求会安装 GPU 驱动程序，定义 accelerators[] 字段的 type 和 count 子字段，并使用具有 GPU 机器类型的位置。

例如，如需创建使用 GPU 的 N1 虚拟机，并让 Batch 选择确切的 N1 机器类型，请发出以下请求：

POST https://batch.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/jobs?job_id=JOB_NAME

{
    "taskGroups": [
        {
            "taskSpec": {
                "runnables": [
                    {
                        "script": {
                            "text": "echo Hello world from task ${BATCH_TASK_INDEX}."
                        }
                    }
                ]
            },
            "taskCount": 3,
            "parallelism": 1
        }
    ],
    "allocationPolicy": {
        "instances": [
            {
                "installGpuDrivers": INSTALL_GPU_DRIVERS,
                "policy": {
                    "accelerators": [
                        {
                            "type": "GPU_TYPE",
                            "count": GPU_COUNT
                        }
                    ]
                }
            }
        ],
        "location": {
            "allowedLocations": [
                "ALLOWED_LOCATIONS"
            ]
        }
    },
    "logsPolicy": {
        "destination": "CLOUD_LOGGING"
    }
}

替换以下内容：

PROJECT_ID：您的项目的项目 ID。
LOCATION：作业的位置。
JOB_NAME：作业的名称。
INSTALL_GPU_DRIVERS：如果设置为 true，Batch 会从第三方位置提取您在 policy 字段中指定的 GPU 类型所需的驱动程序，并代表您安装这些驱动程序。如果您将此字段设置为 false（默认值），则需要手动安装 GPU 驱动程序才能使用任何 GPU 来运行此作业。
GPU_TYPE：GPU 类型。您可以使用 gcloud compute accelerator-types list 命令查看可用 GPU 类型的列表。仅将此字段用于 N1 虚拟机的 GPU。
GPU_COUNT：指定类型的 GPU 数量。如需详细了解有效选项，请参阅 N1 机器系列的 GPU 机器类型。仅将此字段用于 N1 虚拟机的 GPU。
ALLOWED_LOCATIONS：您可以选择使用 allowedLocations[] 字段来指定作业的虚拟机允许运行的区域或区域中的特定可用区，例如，regions/us-central1 允许在 us-central1 区域中的所有可用区运行。请务必指定提供您希望用于此作业的 GPU 机器类型的位置。否则，如果您省略此字段，请确保作业的位置提供 GPU 机器类型。

Java


import com.google.cloud.batch.v1.AllocationPolicy;
import com.google.cloud.batch.v1.AllocationPolicy.Accelerator;
import com.google.cloud.batch.v1.AllocationPolicy.InstancePolicy;
import com.google.cloud.batch.v1.AllocationPolicy.InstancePolicyOrTemplate;
import com.google.cloud.batch.v1.BatchServiceClient;
import com.google.cloud.batch.v1.CreateJobRequest;
import com.google.cloud.batch.v1.Job;
import com.google.cloud.batch.v1.LogsPolicy;
import com.google.cloud.batch.v1.Runnable;
import com.google.cloud.batch.v1.Runnable.Script;
import com.google.cloud.batch.v1.TaskGroup;
import com.google.cloud.batch.v1.TaskSpec;
import com.google.protobuf.Duration;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class CreateGpuJobN1 {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "YOUR_PROJECT_ID";
    // Name of the region you want to use to run the job. Regions that are
    // available for Batch are listed on: https://cloud.google.com/batch/docs/get-started#locations
    String region = "europe-central2";
    // The name of the job that will be created.
    // It needs to be unique for each project and region pair.
    String jobName = "JOB_NAME";
    // Optional. When set to true, Batch fetches the drivers required for the GPU type
    // that you specify in the policy field from a third-party location,
    // and Batch installs them on your behalf. If you set this field to false (default),
    // you need to install GPU drivers manually to use any GPUs for this job.
    boolean installGpuDrivers = false;
    // The GPU type. You can view a list of the available GPU types
    // by using the `gcloud compute accelerator-types list` command.
    String gpuType = "nvidia-tesla-t4";
    // The number of GPUs of the specified type.
    int gpuCount = 2;

    createGpuJob(projectId, region, jobName, installGpuDrivers, gpuType, gpuCount);
  }

  // Create a job that uses GPUs
  public static Job createGpuJob(String projectId, String region, String jobName,
                                  boolean installGpuDrivers, String gpuType, int gpuCount)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (BatchServiceClient batchServiceClient = BatchServiceClient.create()) {
      // Define what will be done as part of the job.
      Runnable runnable =
          Runnable.newBuilder()
              .setScript(
                  Script.newBuilder()
                      .setText(
                          "echo Hello world! This is task ${BATCH_TASK_INDEX}. "
                                  + "This job has a total of ${BATCH_TASK_COUNT} tasks.")
                      // You can also run a script from a file. Just remember, that needs to be a
                      // script that's already on the VM that will be running the job.
                      // Using setText() and setPath() is mutually exclusive.
                      // .setPath("/tmp/test.sh")
                      .build())
              .build();

      TaskSpec task = TaskSpec.newBuilder()
                  // Jobs can be divided into tasks. In this case, we have only one task.
                  .addRunnables(runnable)
                  .setMaxRetryCount(2)
                  .setMaxRunDuration(Duration.newBuilder().setSeconds(3600).build())
                  .build();

      // Tasks are grouped inside a job using TaskGroups.
      // Currently, it's possible to have only one task group.
      TaskGroup taskGroup = TaskGroup.newBuilder()
          .setTaskCount(3)
          .setParallelism(1)
          .setTaskSpec(task)
          .build();

      // Accelerator describes Compute Engine accelerators to be attached to the VM.
      Accelerator accelerator = Accelerator.newBuilder()
          .setType(gpuType)
          .setCount(gpuCount)
          .build();

      // Policies are used to define on what kind of virtual machines the tasks will run on.
      AllocationPolicy allocationPolicy =
          AllocationPolicy.newBuilder()
              .addInstances(
                  InstancePolicyOrTemplate.newBuilder()
                      .setInstallGpuDrivers(installGpuDrivers)
                      .setPolicy(InstancePolicy.newBuilder().addAccelerators(accelerator))
                      .build())
              .build();

      Job job =
          Job.newBuilder()
              .addTaskGroups(taskGroup)
              .setAllocationPolicy(allocationPolicy)
              .putLabels("env", "testing")
              .putLabels("type", "script")
              // We use Cloud Logging as it's an out of the box available option.
              .setLogsPolicy(
                  LogsPolicy.newBuilder().setDestination(LogsPolicy.Destination.CLOUD_LOGGING))
              .build();

      CreateJobRequest createJobRequest =
          CreateJobRequest.newBuilder()
              // The job's parent is the region in which the job will run.
              .setParent(String.format("projects/%s/locations/%s", projectId, region))
              .setJob(job)
              .setJobId(jobName)
              .build();

      Job result =
          batchServiceClient
              .createJobCallable()
              .futureCall(createJobRequest)
              .get(5, TimeUnit.MINUTES);

      System.out.printf("Successfully created the job: %s", result.getName());

      return result;
    }
  }
}

Node.js

// Imports the Batch library
const batchLib = require('@google-cloud/batch');
const batch = batchLib.protos.google.cloud.batch.v1;

// Instantiates a client
const batchClient = new batchLib.v1.BatchServiceClient();

/**
 * TODO(developer): Update these variables before running the sample.
 */
// Project ID or project number of the Google Cloud project you want to use.
const projectId = await batchClient.getProjectId();
// Name of the region you want to use to run the job. Regions that are
// available for Batch are listed on: https://cloud.google.com/batch/docs/get-started#locations
const region = 'europe-central2';
// The name of the job that will be created.
// It needs to be unique for each project and region pair.
const jobName = 'batch-gpu-job-n1';
// The GPU type. You can view a list of the available GPU types
// by using the `gcloud compute accelerator-types list` command.
const gpuType = 'nvidia-tesla-t4';
// The number of GPUs of the specified type.
const gpuCount = 1;
// Optional. When set to true, Batch fetches the drivers required for the GPU type
// that you specify in the policy field from a third-party location,
// and Batch installs them on your behalf. If you set this field to false (default),
// you need to install GPU drivers manually to use any GPUs for this job.
const installGpuDrivers = false;
// Accelerator-optimized machine types are available to Batch jobs. See the list
// of available types on: https://cloud.google.com/compute/docs/accelerator-optimized-machines
const machineType = 'n1-standard-16';

// Define what will be done as part of the job.
const runnable = new batch.Runnable({
  script: new batch.Runnable.Script({
    commands: ['-c', 'echo Hello world! This is task ${BATCH_TASK_INDEX}.'],
  }),
});

const task = new batch.TaskSpec({
  runnables: [runnable],
  maxRetryCount: 2,
  maxRunDuration: {seconds: 3600},
});

// Tasks are grouped inside a job using TaskGroups.
const group = new batch.TaskGroup({
  taskCount: 3,
  taskSpec: task,
});

// Policies are used to define on what kind of virtual machines the tasks will run on.
// In this case, we tell the system to use "g2-standard-4" machine type.
// Read more about machine types here: https://cloud.google.com/compute/docs/machine-types
const instancePolicy = new batch.AllocationPolicy.InstancePolicy({
  machineType,
  // Accelerator describes Compute Engine accelerators to be attached to the VM
  accelerators: [
    new batch.AllocationPolicy.Accelerator({
      type: gpuType,
      count: gpuCount,
      installGpuDrivers,
    }),
  ],
});

const allocationPolicy = new batch.AllocationPolicy.InstancePolicyOrTemplate({
  instances: [{installGpuDrivers, policy: instancePolicy}],
});

const job = new batch.Job({
  name: jobName,
  taskGroups: [group],
  labels: {env: 'testing', type: 'script'},
  allocationPolicy,
  // We use Cloud Logging as it's an option available out of the box
  logsPolicy: new batch.LogsPolicy({
    destination: batch.LogsPolicy.Destination.CLOUD_LOGGING,
  }),
});
// The job's parent is the project and region in which the job will run
const parent = `projects/${projectId}/locations/${region}`;

async function callCreateBatchGPUJobN1() {
  // Construct request
  const request = {
    parent,
    jobId: jobName,
    job,
  };

  // Run request
  const [response] = await batchClient.createJob(request);
  console.log(JSON.stringify(response));
}

await callCreateBatchGPUJobN1();

Python

from google.cloud import batch_v1


def create_gpu_job(
    project_id: str, region: str, zone: str, job_name: str
) -> batch_v1.Job:
    """
    This method shows how to create a sample Batch Job that will run
    a simple command on Cloud Compute instances on GPU machines.

    Args:
        project_id: project ID or project number of the Cloud project you want to use.
        region: name of the region you want to use to run the job. Regions that are
            available for Batch are listed on: https://cloud.google.com/batch/docs/get-started#locations
        zone: name of the zone you want to use to run the job. Important in regard to GPUs availability.
            GPUs availability can be found here: https://cloud.google.com/compute/docs/gpus/gpu-regions-zones
        job_name: the name of the job that will be created.
            It needs to be unique for each project and region pair.

    Returns:
        A job object representing the job created.
    """
    client = batch_v1.BatchServiceClient()

    # Define what will be done as part of the job.
    task = batch_v1.TaskSpec()
    runnable = batch_v1.Runnable()
    runnable.script = batch_v1.Runnable.Script()
    runnable.script.text = "echo Hello world! This is task ${BATCH_TASK_INDEX}. This job has a total of ${BATCH_TASK_COUNT} tasks."
    # You can also run a script from a file. Just remember, that needs to be a script that's
    # already on the VM that will be running the job. Using runnable.script.text and runnable.script.path is mutually
    # exclusive.
    # runnable.script.path = '/tmp/test.sh'
    task.runnables = [runnable]

    # We can specify what resources are requested by each task.
    resources = batch_v1.ComputeResource()
    resources.cpu_milli = 2000  # in milliseconds per cpu-second. This means the task requires 2 whole CPUs.
    resources.memory_mib = 16  # in MiB
    task.compute_resource = resources

    task.max_retry_count = 2
    task.max_run_duration = "3600s"

    # Tasks are grouped inside a job using TaskGroups.
    # Currently, it's possible to have only one task group.
    group = batch_v1.TaskGroup()
    group.task_count = 4
    group.task_spec = task

    # Policies are used to define on what kind of virtual machines the tasks will run on.
    # Read more about machine types here: https://cloud.google.com/compute/docs/machine-types
    policy = batch_v1.AllocationPolicy.InstancePolicy()
    policy.machine_type = "n1-standard-16"

    accelerator = batch_v1.AllocationPolicy.Accelerator()
    # Note: not every accelerator is compatible with instance type
    # Read more here: https://cloud.google.com/compute/docs/gpus#t4-gpus
    accelerator.type_ = "nvidia-tesla-t4"
    accelerator.count = 1

    policy.accelerators = [accelerator]
    instances = batch_v1.AllocationPolicy.InstancePolicyOrTemplate()
    instances.policy = policy
    instances.install_gpu_drivers = True
    allocation_policy = batch_v1.AllocationPolicy()
    allocation_policy.instances = [instances]

    location = batch_v1.AllocationPolicy.LocationPolicy()
    location.allowed_locations = ["zones/us-central1-b"]
    allocation_policy.location = location

    job = batch_v1.Job()
    job.task_groups = [group]
    job.allocation_policy = allocation_policy
    job.labels = {"env": "testing", "type": "container"}
    # We use Cloud Logging as it's an out of the box available option
    job.logs_policy = batch_v1.LogsPolicy()
    job.logs_policy.destination = batch_v1.LogsPolicy.Destination.CLOUD_LOGGING

    create_request = batch_v1.CreateJobRequest()
    create_request.job = job
    create_request.job_id = job_name
    # The job's parent is the region in which the job will run
    create_request.parent = f"projects/{project_id}/locations/{region}"

    return client.create_job(create_request)

后续步骤

如果您在创建或运行作业时遇到问题，请参阅问题排查。
查看作业和任务。
了解更多作业创建选项。

创建和运行使用 GPU 的作业 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

准备工作

创建使用 GPU 的作业

规划使用 GPU 的作业的要求

第 1 步：选择 GPU 机器类型和配置方法

选择 GPU 机器类型

选择配置方法

第 2 步：安装 GPU 驱动程序

第 3 步：定义兼容的虚拟机资源

创建使用 GPU 的作业示例

通过批量动态工作负载调度器（预览版）为 A3 虚拟机使用 GPU

gcloud

API

为加速器优化型虚拟机使用 GPU

控制台

gcloud

API

Java

Node.js

Python

为 N1 虚拟机使用 GPU

控制台

gcloud

API

Java

Node.js

Python

后续步骤

创建和运行使用 GPU 的作业