カスタムトレーニング用のコンピューティングリソースを構成する

カスタムトレーニングを行う場合、トレーニングコードは 1 つ以上の仮想マシン（VM）インスタンスで実行されます。トレーニングに使用する VM のタイプを構成できます。より多くのコンピューティングリソースを備えた VM を使用すると、トレーニングを高速化し、より大きなデータセットを扱うことができますが、トレーニングコストが増大する可能性があります。

また、GPU を使用してトレーニングを加速することもできます。GPU を使用すると、追加のコストが発生します。

必要に応じて、トレーニング VM のブートディスクのタイプとサイズをカスタマイズできます。

このドキュメントでは、カスタムトレーニングで使用できるさまざまなコンピューティングリソースとその構成方法について説明します。

コストと可用性を管理する

VM リソースのコスト管理や可用性確保のために、Vertex AI には次の機能が用意されています。

トレーニングジョブで必要に応じて VM リソースを利用できるようにするには、Compute Engine の予約を使用します。予約を使用すると、Compute Engine リソースのキャパシティを確実に確保できます。詳細については、トレーニングで予約を使用するをご覧ください。
トレーニングジョブの実行コストを削減するには、Spot VM を使用します。Spot VM は、Compute Engine の余剰キャパシティを利用する仮想マシン（VM）インスタンスです。Spot VM には大幅な割引がありますが、Compute Engine はそのキャパシティを任意のタイミングで再利用するために、Spot VM をプリエンプティブに停止または削除する場合があります。詳細については、トレーニングで Spot VM を使用するをご覧ください。
GPU リソースをリクエストするカスタムトレーニングジョブの場合、Dynamic Workload Scheduler を使用すると、リクエストした GPU リソースが利用可能になったタイミングに基づいてジョブをスケジュールできます。詳細については、リソースの可用性に基づいてトレーニングジョブをスケジュールするをご覧ください。

コンピューティングリソースを指定する場所

WorkerPoolSpec 内で構成の詳細を指定します。カスタムトレーニングの実行方法に応じて、この WorkerPoolSpec を次のいずれかの API フィールドに配置します。

CustomJob リソースを作成する場合は、CustomJob.jobSpec.workerPoolSpecs に WorkerPoolSpec を指定します。

Google Cloud CLI を使用している場合は、gcloud ai custom-jobs create コマンドで --worker-pool-spec フラグまたは --config フラグを使用して、ワーカープールオプションを指定します。

詳細については、CustomJob の作成をご覧ください。
HyperparameterTuningJob リソースを作成する場合は、HyperparameterTuningJob.trialJobSpec.workerPoolSpecs に WorkerPoolSpec を指定します。

gcloud CLI を使用している場合は、gcloud ai hpt-tuning-jobs create コマンドで --config フラグを使用して、ワーカープールオプションを指定します。

詳細については、HyperparameterTuningJob の作成をご覧ください。
ハイパーパラメータ調整を行わない TrainingPipeline リソースを作成する場合は、TrainingPipeline.trainingTaskInputs.workerPoolSpecs に WorkerPoolSpec を指定します。

詳細については、カスタムの TrainingPipeline の作成をご覧ください。
ハイパーパラメータ調整を行う TrainingPipeline を作成する場合は、TrainingPipeline.trainingTaskInputs.trialJobSpec.workerPoolSpecs に WorkerPoolSpec を指定します。

分散トレーニングを行う場合、ワーカープールごとに異なる設定を使用できます。

マシンタイプ

WorkerPoolSpec で、machineSpec.machineType フィールドに次のマシンタイプのいずれかを指定する必要があります。ワーカープールの各レプリカは、指定されたマシンタイプの個別の VM で実行されます。

a2-ultragpu-1g^*
a2-ultragpu-2g^*
a2-ultragpu-4g^*
a2-ultragpu-8g^*
a2-highgpu-1g^*
a2-highgpu-2g^*
a2-highgpu-4g^*
a2-highgpu-8g^*
a2-megagpu-16g^*
a3-highgpu-8g^*
e2-standard-4
e2-standard-8
e2-standard-16
e2-standard-32
e2-highmem-2
e2-highmem-4
e2-highmem-8
e2-highmem-16
e2-highcpu-16
e2-highcpu-32
n2-standard-4
n2-standard-8
n2-standard-16
n2-standard-32
n2-standard-48
n2-standard-64
n2-standard-80
n2-highmem-2
n2-highmem-4
n2-highmem-8
n2-highmem-16
n2-highmem-32
n2-highmem-48
n2-highmem-64
n2-highmem-80
n2-highcpu-16
n2-highcpu-32
n2-highcpu-48
n2-highcpu-64
n2-highcpu-80
n1-standard-4
n1-standard-8
n1-standard-16
n1-standard-32
n1-standard-64
n1-standard-96
n1-highmem-2
n1-highmem-4
n1-highmem-8
n1-highmem-16
n1-highmem-32
n1-highmem-64
n1-highmem-96
n1-highcpu-16
n1-highcpu-32
n1-highcpu-64
n1-highcpu-96
c2-standard-4
c2-standard-8
c2-standard-16
c2-standard-30
c2-standard-60
ct5lp-hightpu-1t^*
ct5lp-hightpu-4t^*
ct5lp-hightpu-8t^*
c2-standard-60
c2-standard-60
c2-standard-60
c2-standard-60
m1-ultramem-40
m1-ultramem-80
m1-ultramem-160
m1-megamem-96
g2-standard-4^*
g2-standard-8^*
g2-standard-12^*
g2-standard-16^*
g2-standard-24^*
g2-standard-32^*
g2-standard-48^*
g2-standard-96^*
cloud-tpu^*

* 上記のリストでアスタリスクの付いたマシンタイプは、特定の GPU または TPU で使用する必要があります。このガイドの以降のセクションをご覧ください。

これらのマシンタイプの技術的な仕様については、マシンタイプに関する Compute Engine のドキュメントをご覧ください。各マシンタイプをカスタムトレーニングに使用する場合の料金については、料金をご覧ください。

次の例は、CustomJob の作成時にマシンタイプを指定する場所を示しています。

コンソール

Google Cloud コンソールでは、CustomJob を直接作成できません。ただし、CustomJob を作成する TrainingPipeline を作成することは可能です。Google Cloud コンソールで TrainingPipeline を作成する場合は、[コンピューティングと料金] のステップで、各ワーカープールのマシンタイプを [マシンタイプ] フィールドに指定します。

gcloud

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,container-image-uri=CUSTOM_CONTAINER_IMAGE_URI

Java

このサンプルを試す前に、Vertex AI クイックスタート: クライアントライブラリの使用にある Java の設定手順を完了してください。詳細については、Vertex AI Java API のリファレンスドキュメントをご覧ください。

Vertex AI に対する認証を行うには、アプリケーションのデフォルト認証情報を設定します。詳細については、ローカル開発環境の認証を設定するをご覧ください。


import com.google.cloud.aiplatform.v1.AcceleratorType;
import com.google.cloud.aiplatform.v1.ContainerSpec;
import com.google.cloud.aiplatform.v1.CustomJob;
import com.google.cloud.aiplatform.v1.CustomJobSpec;
import com.google.cloud.aiplatform.v1.JobServiceClient;
import com.google.cloud.aiplatform.v1.JobServiceSettings;
import com.google.cloud.aiplatform.v1.LocationName;
import com.google.cloud.aiplatform.v1.MachineSpec;
import com.google.cloud.aiplatform.v1.WorkerPoolSpec;
import java.io.IOException;

// Create a custom job to run machine learning training code in Vertex AI
public class CreateCustomJobSample {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String project = "PROJECT";
    String displayName = "DISPLAY_NAME";

    // Vertex AI runs your training application in a Docker container image. A Docker container
    // image is a self-contained software package that includes code and all dependencies. Learn
    // more about preparing your training application at
    // https://cloud.google.com/vertex-ai/docs/training/overview#prepare_your_training_application
    String containerImageUri = "CONTAINER_IMAGE_URI";
    createCustomJobSample(project, displayName, containerImageUri);
  }

  static void createCustomJobSample(String project, String displayName, String containerImageUri)
      throws IOException {
    JobServiceSettings settings =
        JobServiceSettings.newBuilder()
            .setEndpoint("us-central1-aiplatform.googleapis.com:443")
            .build();
    String location = "us-central1";

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (JobServiceClient client = JobServiceClient.create(settings)) {
      MachineSpec machineSpec =
          MachineSpec.newBuilder()
              .setMachineType("n1-standard-4")
              .setAcceleratorType(AcceleratorType.NVIDIA_TESLA_T4)
              .setAcceleratorCount(1)
              .build();

      ContainerSpec containerSpec =
          ContainerSpec.newBuilder().setImageUri(containerImageUri).build();

      WorkerPoolSpec workerPoolSpec =
          WorkerPoolSpec.newBuilder()
              .setMachineSpec(machineSpec)
              .setReplicaCount(1)
              .setContainerSpec(containerSpec)
              .build();

      CustomJobSpec customJobSpecJobSpec =
          CustomJobSpec.newBuilder().addWorkerPoolSpecs(workerPoolSpec).build();

      CustomJob customJob =
          CustomJob.newBuilder()
              .setDisplayName(displayName)
              .setJobSpec(customJobSpecJobSpec)
              .build();
      LocationName parent = LocationName.of(project, location);
      CustomJob response = client.createCustomJob(parent, customJob);
      System.out.format("response: %s\n", response);
      System.out.format("Name: %s\n", response.getName());
    }
  }
}

Node.js

このサンプルを試す前に、Vertex AI クイックスタート: クライアントライブラリの使用にある Node.js の設定手順を完了してください。詳細については、Vertex AI Node.js API のリファレンスドキュメントをご覧ください。

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */

// const customJobDisplayName = 'YOUR_CUSTOM_JOB_DISPLAY_NAME';
// const containerImageUri = 'YOUR_CONTAINER_IMAGE_URI';
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';

// Imports the Google Cloud Job Service Client library
const {JobServiceClient} = require('@google-cloud/aiplatform');

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};

// Instantiates a client
const jobServiceClient = new JobServiceClient(clientOptions);

async function createCustomJob() {
  // Configure the parent resource
  const parent = `projects/${project}/locations/${location}`;
  const customJob = {
    displayName: customJobDisplayName,
    jobSpec: {
      workerPoolSpecs: [
        {
          machineSpec: {
            machineType: 'n1-standard-4',
            acceleratorType: 'NVIDIA_TESLA_K80',
            acceleratorCount: 1,
          },
          replicaCount: 1,
          containerSpec: {
            imageUri: containerImageUri,
            command: [],
            args: [],
          },
        },
      ],
    },
  };
  const request = {parent, customJob};

  // Create custom job request
  const [response] = await jobServiceClient.createCustomJob(request);

  console.log('Create custom job response:\n', JSON.stringify(response));
}
createCustomJob();

Python

Vertex AI SDK for Python のインストールまたは更新の方法については、Vertex AI SDK for Python をインストールするをご覧ください。詳細については、Python API リファレンスドキュメントをご覧ください。

from google.cloud import aiplatform


def create_custom_job_sample(
    project: str,
    display_name: str,
    container_image_uri: str,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.JobServiceClient(client_options=client_options)
    custom_job = {
        "display_name": display_name,
        "job_spec": {
            "worker_pool_specs": [
                {
                    "machine_spec": {
                        "machine_type": "n1-standard-4",
                        "accelerator_type": aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_K80,
                        "accelerator_count": 1,
                    },
                    "replica_count": 1,
                    "container_spec": {
                        "image_uri": container_image_uri,
                        "command": [],
                        "args": [],
                    },
                }
            ]
        },
    }
    parent = f"projects/{project}/locations/{location}"
    response = client.create_custom_job(parent=parent, custom_job=custom_job)
    print("response:", response)

詳細については、CustomJob の作成ガイドをご覧ください。

GPU

GPU を使用するためのトレーニングコードを作成した場合は、各 VM で 1 つ以上の GPU を使用するようにワーカープールを構成できます。GPU を使用するには、A2、N1、または G2 マシンタイプを使用する必要があります。また、GPU で n1-highmem-2 などの小型のマシンタイプを使用すると、CPU の制約により、一部のワークロードでロギングが失敗する可能性があります。トレーニングジョブがログを返さなくなった場合は、より大きなマシンタイプの選択を検討してください。

Vertex AI では、カスタムトレーニング用に次のタイプの GPU がサポートされています。

NVIDIA_H100_80GB
NVIDIA_A100_80GB
NVIDIA_TESLA_A100（NVIDIA A100 40GB）
NVIDIA_TESLA_P4
NVIDIA_TESLA_P100
NVIDIA_TESLA_T4
NVIDIA_TESLA_V100
NVIDIA_L4

各タイプの GPU の技術仕様の詳細については、コンピューティングワークロード用 GPU に関する Compute Engine の簡単なドキュメントをご覧ください。各マシンタイプをカスタムトレーニングに使用する場合の料金については、料金をご覧ください。

WorkerPoolSpec で、使用する GPU のタイプを machineSpec.acceleratorType フィールドに指定し、ワーカープール内の各 VM の数を machineSpec.acceleratorCount フィールドに指定します。ただし、これらのフィールドを選択するには、次の要件を満たしている必要があります。

選択する GPU のタイプは、カスタムトレーニングを実行するロケーションで使用できる必要があります。すべての GPU がすべてのリージョンで利用できるわけではありません。利用可能なリージョンをご確認ください。
構成では特定の数の GPU のみを使用できます。たとえば、VM で 2 つまたは 4 つの NVIDIA_TESLA_T4 GPU を使用できますが、3 は指定できません。各 GPU タイプの acceleratorCount 値を確認するには、互換性テーブルをご覧ください。
使用するマシンタイプで十分な数の仮想 CPU とメモリを GPU 構成に定義する必要があります。たとえば、ワーカープールで n1-standard-32 マシンタイプを使用する場合、各 VM は 32 個の仮想 CPU と 120 GB のメモリを搭載します。各 NVIDIA_TESLA_V100 GPU で最大 12 個の仮想 CPU と 76 GB のメモリを提供できるため、要件を満たすには、各 n1-standard-32 VM で 4 個数以上の GPU を使用する必要があります（2 つの GPU では十分なリソースが提供されません。また、3 つの GPU を指定することはできません）。

この要件については、次の互換性の表をご覧ください。

カスタムトレーニングで GPU を使用する場合には、次の制限事項にも注意してください。これは、Compute Engine で GPU を使用する場合と異なります。
- 4 個の NVIDIA_TESLA_P100 GPU を使用する構成では、すべてのリージョンとゾーンに最大 64 個の仮想 CPU と最大 208 GB のメモリのみが提供されます。

次の互換性の表は、machineSpec.machineType と machineSpec.acceleratorType の選択に応じた machineSpec.acceleratorCount の有効な値を示しています。

各マシンタイプに有効な GPU の数
マシンタイプ	`NVIDIA_H100_80GB`	`NVIDIA_A100_80GB`	`NVIDIA_TESLA_A100`	`NVIDIA_TESLA_P4`	`NVIDIA_TESLA_P100`	`NVIDIA_TESLA_T4`	`NVIDIA_TESLA_V100`	`NVIDIA_L4`
`a3-highgpu-8g`	8
`a2-ultragpu-1g`		1
`a2-ultragpu-2g`		2
`a2-ultragpu-4g`		4
`a2-ultragpu-8g`		8
`a2-highgpu-1g`			1
`a2-highgpu-2g`			2
`a2-highgpu-4g`			4
`a2-highgpu-8g`			8
`a2-megagpu-16g`			16
`n1-standard-4`				1、2、4	1、2、4	1、2、4	1、2、4、8
`n1-standard-8`				1、2、4	1、2、4	1、2、4	1、2、4、8
`n1-standard-16`				1、2、4	1、2、4	1、2、4	2、4、8
`n1-standard-32`				2、4	2、4	2、4	4、8
`n1-standard-64`				4		4	8
`n1-standard-96`				4		4	8
`n1-highmem-2`				1、2、4	1、2、4	1、2、4	1、2、4、8
`n1-highmem-4`				1、2、4	1、2、4	1、2、4	1、2、4、8
`n1-highmem-8`				1、2、4	1、2、4	1、2、4	1、2、4、8
`n1-highmem-16`				1、2、4	1、2、4	1、2、4	2、4、8
`n1-highmem-32`				2、4	2、4	2、4	4、8
`n1-highmem-64`				4		4	8
`n1-highmem-96`				4		4	8
`n1-highcpu-16`				1、2、4	1、2、4	1、2、4	2、4、8
`n1-highcpu-32`				2、4	2、4	2、4	4、8
`n1-highcpu-64`				4	4	4	8
`n1-highcpu-96`				4		4	8
`g2-standard-4`								1
`g2-standard-8`								1
`g2-standard-12`								1
`g2-standard-16`								1
`g2-standard-24`								2
`g2-standard-32`								1
`g2-standard-48`								4
`g2-standard-96`								8

次の例は、CustomJob の作成時に GPU を指定できる場所を示します。

コンソール

Google Cloud コンソールでは、CustomJob を直接作成できません。ただし、TrainingPipeline を作成する CustomJob を作成することは可能です。Google Cloud コンソールで TrainingPipeline を作成する場合、[コンピューティングと料金] のステップで各ワーカープールに GPU を指定できます。まず、マシンタイプを指定します。次に、[アクセラレータタイプ] と [アクセラレータ数] フィールドに GPU の詳細を指定できます。

gcloud

Google Cloud CLI ツールを使用して GPU を指定するには、config.yaml ファイルを使用する必要があります。次に例を示します。

`config.yaml`

workerPoolSpecs:
  machineSpec:
    machineType: MACHINE_TYPE
    acceleratorType: ACCELERATOR_TYPE
    acceleratorCount: ACCELERATOR_COUNT
  replicaCount: REPLICA_COUNT
  containerSpec:
    imageUri: CUSTOM_CONTAINER_IMAGE_URI

以下のコマンドを実行します。

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --config=config.yaml

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */

// const customJobDisplayName = 'YOUR_CUSTOM_JOB_DISPLAY_NAME';
// const containerImageUri = 'YOUR_CONTAINER_IMAGE_URI';
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';

// Imports the Google Cloud Job Service Client library
const {JobServiceClient} = require('@google-cloud/aiplatform');

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};

// Instantiates a client
const jobServiceClient = new JobServiceClient(clientOptions);

async function createCustomJob() {
  // Configure the parent resource
  const parent = `projects/${project}/locations/${location}`;
  const customJob = {
    displayName: customJobDisplayName,
    jobSpec: {
      workerPoolSpecs: [
        {
          machineSpec: {
            machineType: 'n1-standard-4',
            acceleratorType: 'NVIDIA_TESLA_K80',
            acceleratorCount: 1,
          },
          replicaCount: 1,
          containerSpec: {
            imageUri: containerImageUri,
            command: [],
            args: [],
          },
        },
      ],
    },
  };
  const request = {parent, customJob};

  // Create custom job request
  const [response] = await jobServiceClient.createCustomJob(request);

  console.log('Create custom job response:\n', JSON.stringify(response));
}
createCustomJob();

Python

from google.cloud import aiplatform


def create_custom_job_sample(
    project: str,
    display_name: str,
    container_image_uri: str,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.JobServiceClient(client_options=client_options)
    custom_job = {
        "display_name": display_name,
        "job_spec": {
            "worker_pool_specs": [
                {
                    "machine_spec": {
                        "machine_type": "n1-standard-4",
                        "accelerator_type": aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_K80,
                        "accelerator_count": 1,
                    },
                    "replica_count": 1,
                    "container_spec": {
                        "image_uri": container_image_uri,
                        "command": [],
                        "args": [],
                    },
                }
            ]
        },
    }
    parent = f"projects/{project}/locations/{location}"
    response = client.create_custom_job(parent=parent, custom_job=custom_job)
    print("response:", response)

詳細については、CustomJob の作成ガイドをご覧ください。

TPU

Vertex AI でカスタムトレーニングに Tensor Processing Unit（TPU）を使用するには、TPU VM を使用するようにワーカープールを構成します。

Vertex AI で TPU VM を使用する場合は、カスタムトレーニングに 1 つのワーカープールのみを使用する必要があり、1 つのレプリカのみを使用するようにこのワーカープールを構成する必要があります。

TPU v2 と TPU v3

ワーカープールで TPU v2 または TPU v3 の VM を使用するには、次のいずれかの構成を使用する必要があります。

TPU v2 を使用して TPU VM を構成するには、WorkerPoolSpec で次のフィールドを指定します。
- machineSpec.machineType を cloud-tpu に設定します。
- machineSpec.acceleratorType を TPU_V2 に設定します。
- 単一の TPU の場合は machineSpec.acceleratorCount を 8 に、TPU Pod の場合は 32 or multiple of 32 に設定します。
- replicaCount を 1 に設定します。
TPU v3 を使用して TPU VM を構成するには、WorkerPoolSpec で次のフィールドを指定します。
- machineSpec.machineType を cloud-tpu に設定します。
- machineSpec.acceleratorType を TPU_V3 に設定します。
- 単一の TPU の場合は machineSpec.acceleratorCount を 8 に、TPU Pod の場合は 32+ に設定します。
- replicaCount を 1 に設定します。

TPU v5e

TPU v5e には JAX 0.4.6 以降、TensorFlow 2.15 以降、または PyTorch 2.1 以降が必要です。TPU v5e を使用して TPU VM を構成するには、WorkerPoolSpec で次のフィールドを指定します。

machineSpec.machineType を ct5lp-hightpu-1t、ct5lp-hightpu-4t、または ct5lp-hightpu-8t に設定します。
machineSpec.tpuTopology を、マシンタイプでサポートされているトポロジに設定します。詳しくは、次の表をご覧ください。
replicaCount を 1 に設定します。

次の表に、カスタムトレーニングでサポートされている TPU v5e のマシンタイプとトポロジを示します。

マシンタイプ	トポロジ	TPU チップの数	VM 数	推奨のユースケース
`ct5lp-hightpu-1t`	1×1	1	1	小規模から中規模のトレーニング
`ct5lp-hightpu-4t`	2x2	4	1	小規模から中規模のトレーニング
`ct5lp-hightpu-8t`	2x4	8	1	小規模から中規模のトレーニング
`ct5lp-hightpu-4t`	2x4	8	2	小規模から中規模のトレーニング
`ct5lp-hightpu-4t`	4x4	16	4	大規模なトレーニング
`ct5lp-hightpu-4t`	4x8	32	8	大規模なトレーニング
`ct5lp-hightpu-4t`	8x8	64	16	大規模なトレーニング
`ct5lp-hightpu-4t`	8x16	128	32	大規模なトレーニング
`ct5lp-hightpu-4t`	16x16	256	64	大規模なトレーニング

TPU v5e VM で実行されるカスタムトレーニングジョブは、スループットと可用性を重視して最適化されます。詳細については、v5e トレーニングのアクセラレータタイプをご覧ください。

TPU v5e マシンは us-west1 と us-west4 で Vertex AI カスタムトレーニングに利用できます。TPU v5e の詳細については、Cloud TPU v5e トレーニングをご覧ください。

マシンタイプの比較:

マシンタイプ	ct5lp-hightpu-1t	ct5lp-hightpu-4t	ct5lp-hightpu-8t
v5e チップの数	1	4	8
vCPU 数	24	112	224
RAM（GB）	48	192	384
NUMA ノードの数	1	1	2
プリエンプションの可能性	高	中	低

TPU VM を指定する `CustomJob` の例

次の例は、CustomJob の作成時に TPU VM を指定する方法を示しています。

gcloud

gcloud CLI ツールを使用して TPU VM を指定するには、config.yaml ファイルを使用する必要があります。例を表示するには、次のいずれかのタブを選択してください。

TPU v2 / v3

workerPoolSpecs:
  machineSpec:
    machineType: cloud-tpu
    acceleratorType: TPU_V2
    acceleratorCount: 8
  replicaCount: 1
  containerSpec:
    imageUri: CUSTOM_CONTAINER_IMAGE_URI

TPU v5e

workerPoolSpecs:
  machineSpec:
    machineType: ct5lp-hightpu-4t
    tpuTopology: 4x4
  replicaCount: 1
  containerSpec:
    imageUri: CUSTOM_CONTAINER_IMAGE_URI

以下のコマンドを実行します。

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --config=config.yaml

Python

このサンプルを試す前に、Vertex AI クイックスタート: クライアントライブラリの使用にある Python の設定手順を完了してください。詳細については、Vertex AI Python API のリファレンスドキュメントをご覧ください。

Vertex AI SDK for Python を使用して TPU VM を指定する方法については、次の例をご覧ください。

from google.cloud.aiplatform import aiplatform

job = aiplatform.CustomContainerTrainingJob(
    display_name='DISPLAY_NAME',
    location='us-west1',
    project='PROJECT_ID',
    staging_bucket="gs://CLOUD_STORAGE_URI",
    container_uri='CONTAINER_URI')

job.run(machine_type='ct5lp-hightpu-4t', tpu_topology='2x2')

カスタムトレーニングジョブの作成の詳細については、カスタムトレーニングジョブを作成するをご覧ください。

ブートディスクオプション

必要に応じて、トレーニング VM のブートディスクをカスタマイズできます。ワーカープール内のすべての VM は、同じ種類とサイズのブートディスクを使用します。

各トレーニング VM が使用するブートディスクの種類をカスタマイズするには、WorkerPoolSpec で diskSpec.bootDiskType フィールドを指定します。

標準ハードドライブを基盤とする標準永続ディスクを使用する場合は、このフィールドを pd-standard に設定します。あるいは、pd-ssd に設定することで、SSD（ソリッドステートドライブ）を基盤とする SSD 永続ディスクを使用することもできます。デフォルト値は pd-ssd です。

トレーニングコードがディスクに読み書きを行う場合、pd-ssd を使用すると、パフォーマンスが向上することがあります。詳しくは、ディスクタイプをご覧ください。
各トレーニング VM が使用するブートディスクのサイズ（GB 単位）をカスタマイズするには、WorkerPoolSpec で diskSpec.bootDiskSizeGb フィールドを指定します。

このフィールドは、100～64,000 の整数に設定できます。デフォルト値は 100 です。

トレーニングコードが大量の一時データをディスクに書き込む場合は、ブートディスクのサイズを増やすことをおすすめします。ブートディスクに書き込むデータは一時的なものであり、トレーニングの完了後に取得できないことに注意してください。

ブートディスクの種類とサイズを変更すると、カスタムトレーニングの料金が変わります。

次の例は、CustomJob の作成時にブートディスクオプションを指定する場所を示します。

コンソール

Google Cloud コンソールでは、CustomJob を直接作成できません。ただし、TrainingPipeline を作成する CustomJob を作成することは可能です。Google Cloud コンソールで TrainingPipeline を作成するときに、[コンピューティングと料金] のステップで、[ディスクタイプ] プルダウンリストと [ディスクサイズ（GB）] フィールドに、各ワーカープールのブートディスクオプションを指定できます。

gcloud

Google Cloud CLI ツールを使用してブートディスクオプションを指定するには、config.yaml ファイルを使用する必要があります。次に例を示します。

`config.yaml`

workerPoolSpecs:
  machineSpec:
    machineType: MACHINE_TYPE
  diskSpec:
    bootDiskType: DISK_TYPE
    bootDiskSizeGb: DISK_SIZE
  replicaCount: REPLICA_COUNT
  containerSpec:
    imageUri: CUSTOM_CONTAINER_IMAGE_URI

以下のコマンドを実行します。

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --config=config.yaml

詳細については、CustomJob の作成ガイドをご覧ください。

次のステップ

カスタムトレーニングジョブを実行する永続リソースの作成方法を学習する。
CustomJob を作成するして、カスタムトレーニングの実行方法を学習する。

カスタム トレーニング用のコンピューティング リソースを構成する

コストと可用性を管理する

コンピューティング リソースを指定する場所

マシンタイプ

コンソール

gcloud

Java

Node.js

Python

GPU

コンソール

gcloud

config.yaml

Node.js

Python

TPU

TPU v2 と TPU v3

TPU v5e

TPU VM を指定する CustomJob の例

gcloud

TPU v2 / v3

TPU v5e

Python

ブートディスク オプション

コンソール

gcloud

config.yaml

次のステップ

カスタムトレーニング用のコンピューティングリソースを構成する

コンピューティングリソースを指定する場所

`config.yaml`

TPU VM を指定する `CustomJob` の例

ブートディスクオプション

`config.yaml`