创建和运行使用 GPU 的作业

本文档介绍了如何创建和运行使用 图形处理器 (GPU)

创建批量作业时,您可以选择添加 将一个或多个 GPU 添加到运行它的虚拟机。作业的常见用例 包含密集型数据处理和 机器学习 (ML) 工作负载。

准备工作

  1. 如果您以前没有使用过 Batch,请参阅 Batch 使用入门 并通过填写 针对项目和用户的前提条件
  2. 如需获得创建作业所需的权限,请让管理员向您授予以下 IAM 角色:

    如需详细了解如何授予角色,请参阅管理对项目、文件夹和组织的访问权限

    您也可以通过自定义角色或其他预定义角色来获取所需的权限。

创建使用 GPU 的作业

如需创建使用 GPU 的作业,请执行以下操作:

  1. 查看使用 GPU 的作业需满足的要求 部分,以确定可用于创建作业的方法。
  2. 使用您选择的方法创建作业。如需查看有关如何创建 如何使用推荐的方法创建作业,请参阅 创建使用 GPU 的示例作业部分。

作业使用 GPU 的要求

如需使用 GPU,作业必须执行以下所有操作:

确定如何满足作业的要求后,您还需要定义作业的 GPU 和位置。一个作业的虚拟机各自可以使用 一个或多个指定类型的 GPU。对于 作业的虚拟机(或者如果未定义,则作业的位置)必须具有指定的 GPU 类型。如需详细了解如何定义 GPU 类型、GPU 数量和 有效的工作地点,请参阅示例

安装 GPU 驱动程序

如需安装所需的 GPU 驱动程序,请选择以下方法之一:

定义兼容的虚拟机资源

如果您的作业定义了任何虚拟机资源 instances[] 个子字段) 您必须以兼容的方式定义这些虚拟机资源。

如需为作业的虚拟机(包括任何 GPU)定义资源,您只能使用以下方法:

  • 直接定义资源(推荐):如示例所示,如需直接为作业的虚拟机定义资源,请使用 policy 字段
  • 在模板中定义资源:为作业的虚拟机定义资源 方法是指定 Compute Engine 实例模板

此外,您定义的所有资源都必须与 作业的 GPU 的类型和数量。如需详细了解 您可以与 GPU 搭配使用的虚拟机资源,请参阅 GPU 平台

创建使用 GPU 的示例作业

您可以使用 gcloud CLI 创建使用 GPU 的作业, 批处理 API、Java 或 Python。

gcloud

  1. 创建一个 JSON 文件,用于指定作业的配置详细信息、 typecount 子字段 accelerators[] 字段的 1000 个字符,以及包含这些 GPU 类型。

    例如,如需创建使用 GPU 的基本脚本作业, 自动安装所需的 GPU 驱动程序 并为作业的虚拟机指定了允许的位置 创建一个包含以下内容的 JSON 文件:

    {
        "taskGroups": [
            {
                "taskSpec": {
                    "runnables": [
                        {
                            "script": {
                                "text": "echo Hello world from task ${BATCH_TASK_INDEX}."
                            }
                        }
                    ]
                },
                "taskCount": 3,
                "parallelism": 1
            }
        ],
        "allocationPolicy": {
            "instances": [
                {
                    "installGpuDrivers": INSTALL_GPU_DRIVERS,
                    "policy": {
                        "accelerators": [
                            {
                                "type": "GPU_TYPE",
                                "count": GPU_COUNT
                            }
                        ]
                    }
                }
            ],
            "location": {
                "allowedLocations": [
                    "ALLOWED_LOCATIONS"
                ]
            }
        }
    }
    

    替换以下内容:

    • INSTALL_GPU_DRIVERS:可选。设置为 true,批量提取 您在 policy 字段中指定的 GPU 类型 而 Batch 则会安装这些命令 。如果您将此字段设置为 false(默认值),则需要手动安装 GPU 驱动程序,才能将任何 GPU 用于此作业。

    • GPU_TYPEGPU 类型。您可以使用 gcloud compute accelerator-types list 命令

    • GPU_COUNT:指定类型的 GPU 数量。

    • ALLOWED_LOCATIONS:可选。营业地点 运行作业的虚拟机实例的位置,例如 regions/us-central1, zones/us-central1-a 允许可用区 us-central1-a。如果您指定了允许的位置,则必须选择相应区域,还可以选择一个或多个可用区(可选)。您要指定 选择必须具有 GPU 类型 所需的资源否则,如果您省略此字段,作业的所在位置必须具有相应的 GPU 类型。有关详情,请参阅 allowedLocations[] 字段

  2. 如需创建和运行作业,请使用 gcloud batch jobs submit 命令

    gcloud batch jobs submit JOB_NAME \
        --location LOCATION \
        --config JSON_CONFIGURATION_FILE
    

    替换以下内容:

    • JOB_NAME:作业的名称。

    • LOCATION位置 作业的组成部分。

    • JSON_CONFIGURATION_FILE: 包含作业配置详情的 JSON 文件。

API

请发送 POST 请求至 jobs.create 方法 指定了作业的配置详情, typecount 子字段 accelerators[]字段的相应位置,以及具有 GPU 类型。

例如,如需创建使用 GPU 的基本脚本作业, 自动安装所需的 GPU 驱动程序 并为作业的虚拟机指定了允许的位置 创建一个包含以下内容的 JSON 文件:

POST https://batch.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/jobs?job_id=JOB_NAME

{
    "taskGroups": [
        {
            "taskSpec": {
                "runnables": [
                    {
                        "script": {
                            "text": "echo Hello world from task ${BATCH_TASK_INDEX}."
                        }
                    }
                ]
            },
            "taskCount": 3,
            "parallelism": 1
        }
    ],
    "allocationPolicy": {
        "instances": [
            {
                "installGpuDrivers": INSTALL_GPU_DRIVERS,
                "policy": {
                    "accelerators": [
                        {
                            "type": "GPU_TYPE",
                            "count": GPU_COUNT
                        }
                    ]
                }
            }
        ],
        "location": {
            "allowedLocations": [
                "ALLOWED_LOCATIONS"
            ]
        }
    }
}

替换以下内容:

  • PROJECT_ID项目 ID 项目名称

  • LOCATION:作业的位置

  • JOB_NAME:作业的名称。

  • INSTALL_GPU_DRIVERS:可选。设置为 true,批量提取 您在 policy 字段中指定的 GPU 类型 而 Batch 则会安装这些命令 。如果将此字段设置为 false(默认值),则需要 手动安装 GPU 驱动程序,以便为此作业使用任何 GPU。

  • GPU_TYPEGPU 类型。您可以使用 gcloud compute accelerator-types list 命令

  • GPU_COUNT: 指定类型。

  • ALLOWED_LOCATIONS:可选。营业地点 运行作业的虚拟机实例的位置,例如 regions/us-central1, zones/us-central1-a 允许可用区 us-central1-a。如果您指定了允许的位置,则必须选择 区域以及(可选)一个或多个可用区。您要指定 选择必须具有 GPU 类型 所需的资源否则,如果您省略此字段, 作业的位置必须具有 GPU 类型。有关详情,请参阅 allowedLocations[] 字段

Java

如需使用 Java 创建具有 GPU 的作业,请选择以下选项之一 基于 GPU 模型的机器类型

创建将 GPU 与加速器优化的虚拟机搭配使用的作业

如需将 GPU 与加速器优化的虚拟机搭配使用,只需指定机器类型 您希望用于该作业的虚拟机:


import com.google.cloud.batch.v1.AllocationPolicy;
import com.google.cloud.batch.v1.AllocationPolicy.Accelerator;
import com.google.cloud.batch.v1.AllocationPolicy.InstancePolicy;
import com.google.cloud.batch.v1.AllocationPolicy.InstancePolicyOrTemplate;
import com.google.cloud.batch.v1.BatchServiceClient;
import com.google.cloud.batch.v1.CreateJobRequest;
import com.google.cloud.batch.v1.Job;
import com.google.cloud.batch.v1.LogsPolicy;
import com.google.cloud.batch.v1.Runnable;
import com.google.cloud.batch.v1.Runnable.Script;
import com.google.cloud.batch.v1.TaskGroup;
import com.google.cloud.batch.v1.TaskSpec;
import com.google.protobuf.Duration;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class CreateGpuJob {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "YOUR_PROJECT_ID";
    // Name of the region you want to use to run the job. Regions that are
    // available for Batch are listed on: https://cloud.google.com/batch/docs/get-started#locations
    String region = "europe-central2";
    // The name of the job that will be created.
    // It needs to be unique for each project and region pair.
    String jobName = "JOB_NAME";
    // Optional. When set to true, Batch fetches the drivers required for the GPU type
    // that you specify in the policy field from a third-party location,
    // and Batch installs them on your behalf. If you set this field to false (default),
    // you need to install GPU drivers manually to use any GPUs for this job.
    boolean installGpuDrivers = false;
    // Accelerator-optimized machine types are available to Batch jobs. See the list
    // of available types on: https://cloud.google.com/compute/docs/accelerator-optimized-machines
    String machineType = "g2-standard-4";

    createGpuJob(projectId, region, jobName, installGpuDrivers, machineType);
  }

  // Create a job that uses GPUs
  public static Job createGpuJob(String projectId, String region, String jobName,
                                  boolean installGpuDrivers, String machineType)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (BatchServiceClient batchServiceClient = BatchServiceClient.create()) {
      // Define what will be done as part of the job.
      Runnable runnable =
          Runnable.newBuilder()
              .setScript(
                  Script.newBuilder()
                      .setText(
                          "echo Hello world! This is task ${BATCH_TASK_INDEX}. "
                                  + "This job has a total of ${BATCH_TASK_COUNT} tasks.")
                      // You can also run a script from a file. Just remember, that needs to be a
                      // script that's already on the VM that will be running the job.
                      // Using setText() and setPath() is mutually exclusive.
                      // .setPath("/tmp/test.sh")
                      .build())
              .build();

      TaskSpec task = TaskSpec.newBuilder()
                  // Jobs can be divided into tasks. In this case, we have only one task.
                  .addRunnables(runnable)
                  .setMaxRetryCount(2)
                  .setMaxRunDuration(Duration.newBuilder().setSeconds(3600).build())
                  .build();

      // Tasks are grouped inside a job using TaskGroups.
      // Currently, it's possible to have only one task group.
      TaskGroup taskGroup = TaskGroup.newBuilder()
          .setTaskCount(3)
          .setParallelism(1)
          .setTaskSpec(task)
          .build();

      // Policies are used to define on what kind of virtual machines the tasks will run.
      // Read more about machine types here: https://cloud.google.com/compute/docs/machine-types
      InstancePolicy instancePolicy =
          InstancePolicy.newBuilder().setMachineType(machineType).build();  

      // Policies are used to define on what kind of virtual machines the tasks will run on.
      AllocationPolicy allocationPolicy =
          AllocationPolicy.newBuilder()
              .addInstances(
                  InstancePolicyOrTemplate.newBuilder()
                      .setInstallGpuDrivers(installGpuDrivers)
                      .setPolicy(instancePolicy)
                      .build())
              .build();

      Job job =
          Job.newBuilder()
              .addTaskGroups(taskGroup)
              .setAllocationPolicy(allocationPolicy)
              .putLabels("env", "testing")
              .putLabels("type", "script")
              // We use Cloud Logging as it's an out of the box available option.
              .setLogsPolicy(
                  LogsPolicy.newBuilder().setDestination(LogsPolicy.Destination.CLOUD_LOGGING))
              .build();

      CreateJobRequest createJobRequest =
          CreateJobRequest.newBuilder()
              // The job's parent is the region in which the job will run.
              .setParent(String.format("projects/%s/locations/%s", projectId, region))
              .setJob(job)
              .setJobId(jobName)
              .build();

      Job result =
          batchServiceClient
              .createJobCallable()
              .futureCall(createJobRequest)
              .get(5, TimeUnit.MINUTES);

      System.out.printf("Successfully created the job: %s", result.getName());

      return result;
    }
  }
}

创建将 GPU 与 N1 虚拟机搭配使用的作业

要将 GPU 与 N1 虚拟机搭配使用,您需要指定 您希望为作业的每个虚拟机使用的 GPU:


import com.google.cloud.batch.v1.AllocationPolicy;
import com.google.cloud.batch.v1.AllocationPolicy.Accelerator;
import com.google.cloud.batch.v1.AllocationPolicy.InstancePolicy;
import com.google.cloud.batch.v1.AllocationPolicy.InstancePolicyOrTemplate;
import com.google.cloud.batch.v1.BatchServiceClient;
import com.google.cloud.batch.v1.CreateJobRequest;
import com.google.cloud.batch.v1.Job;
import com.google.cloud.batch.v1.LogsPolicy;
import com.google.cloud.batch.v1.Runnable;
import com.google.cloud.batch.v1.Runnable.Script;
import com.google.cloud.batch.v1.TaskGroup;
import com.google.cloud.batch.v1.TaskSpec;
import com.google.protobuf.Duration;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class CreateGpuJobN1 {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "YOUR_PROJECT_ID";
    // Name of the region you want to use to run the job. Regions that are
    // available for Batch are listed on: https://cloud.google.com/batch/docs/get-started#locations
    String region = "europe-central2";
    // The name of the job that will be created.
    // It needs to be unique for each project and region pair.
    String jobName = "JOB_NAME";
    // Optional. When set to true, Batch fetches the drivers required for the GPU type
    // that you specify in the policy field from a third-party location,
    // and Batch installs them on your behalf. If you set this field to false (default),
    // you need to install GPU drivers manually to use any GPUs for this job.
    boolean installGpuDrivers = false;
    // The GPU type. You can view a list of the available GPU types
    // by using the `gcloud compute accelerator-types list` command.
    String gpuType = "nvidia-tesla-t4";
    // The number of GPUs of the specified type.
    int gpuCount = 2;

    createGpuJob(projectId, region, jobName, installGpuDrivers, gpuType, gpuCount);
  }

  // Create a job that uses GPUs
  public static Job createGpuJob(String projectId, String region, String jobName,
                                  boolean installGpuDrivers, String gpuType, int gpuCount)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (BatchServiceClient batchServiceClient = BatchServiceClient.create()) {
      // Define what will be done as part of the job.
      Runnable runnable =
          Runnable.newBuilder()
              .setScript(
                  Script.newBuilder()
                      .setText(
                          "echo Hello world! This is task ${BATCH_TASK_INDEX}. "
                                  + "This job has a total of ${BATCH_TASK_COUNT} tasks.")
                      // You can also run a script from a file. Just remember, that needs to be a
                      // script that's already on the VM that will be running the job.
                      // Using setText() and setPath() is mutually exclusive.
                      // .setPath("/tmp/test.sh")
                      .build())
              .build();

      TaskSpec task = TaskSpec.newBuilder()
                  // Jobs can be divided into tasks. In this case, we have only one task.
                  .addRunnables(runnable)
                  .setMaxRetryCount(2)
                  .setMaxRunDuration(Duration.newBuilder().setSeconds(3600).build())
                  .build();

      // Tasks are grouped inside a job using TaskGroups.
      // Currently, it's possible to have only one task group.
      TaskGroup taskGroup = TaskGroup.newBuilder()
          .setTaskCount(3)
          .setParallelism(1)
          .setTaskSpec(task)
          .build();

      // Accelerator describes Compute Engine accelerators to be attached to the VM.
      Accelerator accelerator = Accelerator.newBuilder()
          .setType(gpuType)
          .setCount(gpuCount)
          .build();

      // Policies are used to define on what kind of virtual machines the tasks will run on.
      AllocationPolicy allocationPolicy =
          AllocationPolicy.newBuilder()
              .addInstances(
                  InstancePolicyOrTemplate.newBuilder()
                      .setInstallGpuDrivers(installGpuDrivers)
                      .setPolicy(InstancePolicy.newBuilder().addAccelerators(accelerator))
                      .build())
              .build();

      Job job =
          Job.newBuilder()
              .addTaskGroups(taskGroup)
              .setAllocationPolicy(allocationPolicy)
              .putLabels("env", "testing")
              .putLabels("type", "script")
              // We use Cloud Logging as it's an out of the box available option.
              .setLogsPolicy(
                  LogsPolicy.newBuilder().setDestination(LogsPolicy.Destination.CLOUD_LOGGING))
              .build();

      CreateJobRequest createJobRequest =
          CreateJobRequest.newBuilder()
              // The job's parent is the region in which the job will run.
              .setParent(String.format("projects/%s/locations/%s", projectId, region))
              .setJob(job)
              .setJobId(jobName)
              .build();

      Job result =
          batchServiceClient
              .createJobCallable()
              .futureCall(createJobRequest)
              .get(5, TimeUnit.MINUTES);

      System.out.printf("Successfully created the job: %s", result.getName());

      return result;
    }
  }
}

Python

如需使用 Python 创建使用 GPU 的作业,请根据 GPU 模型的机器类型选择以下选项之一:

创建使用加速器优化型虚拟机的 GPU 作业

如需将 GPU 与加速器优化的虚拟机搭配使用,只需指定机器类型 您希望用于该作业的虚拟机:

from google.cloud import batch_v1


def create_gpu_job(project_id: str, region: str, job_name: str) -> batch_v1.Job:
    """
    This method shows how to create a sample Batch Job that will run
    a simple command on Cloud Compute instances on GPU machines.

    Args:
        project_id: project ID or project number of the Cloud project you want to use.
        region: name of the region you want to use to run the job. Regions that are
            available for Batch are listed on: https://cloud.google.com/batch/docs/get-started#locations
        job_name: the name of the job that will be created.
            It needs to be unique for each project and region pair.

    Returns:
        A job object representing the job created.
    """
    client = batch_v1.BatchServiceClient()

    # Define what will be done as part of the job.
    task = batch_v1.TaskSpec()
    runnable = batch_v1.Runnable()
    runnable.script = batch_v1.Runnable.Script()
    runnable.script.text = "echo Hello world! This is task ${BATCH_TASK_INDEX}. This job has a total of ${BATCH_TASK_COUNT} tasks."
    # You can also run a script from a file. Just remember, that needs to be a script that's
    # already on the VM that will be running the job. Using runnable.script.text and runnable.script.path is mutually
    # exclusive.
    # runnable.script.path = '/tmp/test.sh'
    task.runnables = [runnable]

    # We can specify what resources are requested by each task.
    resources = batch_v1.ComputeResource()
    resources.cpu_milli = 2000  # in milliseconds per cpu-second. This means the task requires 2 whole CPUs.
    resources.memory_mib = 16  # in MiB
    task.compute_resource = resources

    task.max_retry_count = 2
    task.max_run_duration = "3600s"

    # Tasks are grouped inside a job using TaskGroups.
    # Currently, it's possible to have only one task group.
    group = batch_v1.TaskGroup()
    group.task_count = 4
    group.task_spec = task

    # Policies are used to define on what kind of virtual machines the tasks will run on.
    # In this case, we tell the system to use "g2-standard-4" machine type.
    # Read more about machine types here: https://cloud.google.com/compute/docs/machine-types
    policy = batch_v1.AllocationPolicy.InstancePolicy()
    policy.machine_type = "g2-standard-4"

    instances = batch_v1.AllocationPolicy.InstancePolicyOrTemplate()
    instances.policy = policy
    instances.install_gpu_drivers = True
    allocation_policy = batch_v1.AllocationPolicy()
    allocation_policy.instances = [instances]

    job = batch_v1.Job()
    job.task_groups = [group]
    job.allocation_policy = allocation_policy
    job.labels = {"env": "testing", "type": "container"}
    # We use Cloud Logging as it's an out of the box available option
    job.logs_policy = batch_v1.LogsPolicy()
    job.logs_policy.destination = batch_v1.LogsPolicy.Destination.CLOUD_LOGGING

    create_request = batch_v1.CreateJobRequest()
    create_request.job = job
    create_request.job_id = job_name
    # The job's parent is the region in which the job will run
    create_request.parent = f"projects/{project_id}/locations/{region}"

    return client.create_job(create_request)

创建将 GPU 与 N1 虚拟机搭配使用的作业

要将 GPU 与 N1 虚拟机搭配使用,您需要指定 您希望为作业的每个虚拟机使用的 GPU:

from google.cloud import batch_v1


def create_gpu_job(
    project_id: str, region: str, zone: str, job_name: str
) -> batch_v1.Job:
    """
    This method shows how to create a sample Batch Job that will run
    a simple command on Cloud Compute instances on GPU machines.

    Args:
        project_id: project ID or project number of the Cloud project you want to use.
        region: name of the region you want to use to run the job. Regions that are
            available for Batch are listed on: https://cloud.google.com/batch/docs/get-started#locations
        zone: name of the zone you want to use to run the job. Important in regard to GPUs availability.
            GPUs availability can be found here: https://cloud.google.com/compute/docs/gpus/gpu-regions-zones
        job_name: the name of the job that will be created.
            It needs to be unique for each project and region pair.

    Returns:
        A job object representing the job created.
    """
    client = batch_v1.BatchServiceClient()

    # Define what will be done as part of the job.
    task = batch_v1.TaskSpec()
    runnable = batch_v1.Runnable()
    runnable.script = batch_v1.Runnable.Script()
    runnable.script.text = "echo Hello world! This is task ${BATCH_TASK_INDEX}. This job has a total of ${BATCH_TASK_COUNT} tasks."
    # You can also run a script from a file. Just remember, that needs to be a script that's
    # already on the VM that will be running the job. Using runnable.script.text and runnable.script.path is mutually
    # exclusive.
    # runnable.script.path = '/tmp/test.sh'
    task.runnables = [runnable]

    # We can specify what resources are requested by each task.
    resources = batch_v1.ComputeResource()
    resources.cpu_milli = 2000  # in milliseconds per cpu-second. This means the task requires 2 whole CPUs.
    resources.memory_mib = 16  # in MiB
    task.compute_resource = resources

    task.max_retry_count = 2
    task.max_run_duration = "3600s"

    # Tasks are grouped inside a job using TaskGroups.
    # Currently, it's possible to have only one task group.
    group = batch_v1.TaskGroup()
    group.task_count = 4
    group.task_spec = task

    # Policies are used to define on what kind of virtual machines the tasks will run on.
    # Read more about machine types here: https://cloud.google.com/compute/docs/machine-types
    policy = batch_v1.AllocationPolicy.InstancePolicy()
    policy.machine_type = "n1-standard-16"

    accelerator = batch_v1.AllocationPolicy.Accelerator()
    # Note: not every accelerator is compatible with instance type
    # Read more here: https://cloud.google.com/compute/docs/gpus#t4-gpus
    accelerator.type_ = "nvidia-tesla-t4"
    accelerator.count = 1

    policy.accelerators = [accelerator]
    instances = batch_v1.AllocationPolicy.InstancePolicyOrTemplate()
    instances.policy = policy
    instances.install_gpu_drivers = True
    allocation_policy = batch_v1.AllocationPolicy()
    allocation_policy.instances = [instances]

    location = batch_v1.AllocationPolicy.LocationPolicy()
    location.allowed_locations = ["zones/us-central1-b"]
    allocation_policy.location = location

    job = batch_v1.Job()
    job.task_groups = [group]
    job.allocation_policy = allocation_policy
    job.labels = {"env": "testing", "type": "container"}
    # We use Cloud Logging as it's an out of the box available option
    job.logs_policy = batch_v1.LogsPolicy()
    job.logs_policy.destination = batch_v1.LogsPolicy.Destination.CLOUD_LOGGING

    create_request = batch_v1.CreateJobRequest()
    create_request.job = job
    create_request.job_id = job_name
    # The job's parent is the region in which the job will run
    create_request.parent = f"projects/{project_id}/locations/{region}"

    return client.create_job(create_request)

后续步骤