커스텀 학습용 컴퓨팅 리소스 구성

커스텀 학습을 수행할 때 학습 코드는 하나 이상의 가상 머신(VM) 인스턴스에서 실행됩니다. 학습에 사용할 VM 유형을 구성할 수 있습니다. 컴퓨팅 리소스가 더 많은 VM을 사용하면 학습 속도를 높이고 더 큰 데이터 세트로 작업할 수 있지만 학습 비용이 더 커질 수 있습니다.

경우에 따라 GPU를 사용하여 학습 속도를 더 높일 수 있습니다. GPU에는 추가 비용이 발생합니다.

선택적으로 학습 VM의 부팅 디스크 유형과 크기를 맞춤설정할 수도 있습니다.

이 문서에서는 커스텀 학습에 사용할 수 있는 여러 가지 컴퓨팅 리소스와 이를 구성하는 방법을 설명합니다.

비용 및 가용성 관리

비용을 관리하거나 VM 리소스 가용성을 보장하기 위해 Vertex AI는 다음을 제공합니다.

학습 작업에 VM 리소스가 필요할 때 VM 리소스를 사용할 수 있도록 보장하려면 Compute Engine 예약을 사용하면 됩니다. 예약을 사용하면 높은 확신으로 Compute Engine 리소스 용량을 확보할 수 있습니다. 자세한 내용은 학습과 함께 예약 사용을 참고하세요.
학습 작업 실행 비용을 줄이려면 스팟 VM을 사용하면 됩니다. 스팟 VM은 Compute Engine 용량을 초과하는 가상 머신(VM) 인스턴스입니다. 스팟 VM에는 상당한 할인이 적용되지만 Compute Engine은 언제든지 용량을 회수하기 위해 스팟 VM을 사전에 중지하거나 삭제할 수 있습니다. 자세한 내용은 학습에 스팟 VM 사용을 참조하세요.
GPU 리소스가 필요한 커스텀 학습 작업의 경우 동적 워크로드 스케줄러를 사용하면 요청된 GPU 리소스를 사용할 수 있게 되었을 때를 기준으로 작업을 예약할 수 있습니다. 자세한 내용은 리소스 가용성 기반의 학습 작업 예약을 참조하세요.

컴퓨팅 리소스 지정 위치

WorkerPoolSpec 내에 구성 세부정보를 지정합니다. 커스텀 학습을 수행하는 방법에 따라 이 WorkerPoolSpec을 다음 API 필드 중 하나에 넣습니다.

CustomJob 리소스를 만드는 경우 CustomJob.jobSpec.workerPoolSpecs에 WorkerPoolSpec을 지정합니다.

Google Cloud CLI를 사용하는 경우 gcloud ai custom-jobs create 명령어에서 --worker-pool-spec 플래그 또는 --config 플래그를 사용하여 작업자 풀 옵션을 지정할 수 있습니다.

CustomJob 만들기에 대해 자세히 알아보세요.
HyperparameterTuningJob 리소스를 만드는 경우 HyperparameterTuningJob.trialJobSpec.workerPoolSpecs에 WorkerPoolSpec을 지정합니다.

gcloud CLI를 사용하는 경우 gcloud ai hpt-tuning-jobs create 명령어에 --config 플래그를 사용하여 작업자 풀 옵션을 지정할 수 있습니다.

HyperparameterTuningJob 만들기에 대해 자세히 알아보세요.
하이퍼파라미터 조정 없이 TrainingPipeline 리소스를 만드는 경우 TrainingPipeline.trainingTaskInputs.workerPoolSpecs에 WorkerPoolSpec을 지정합니다.

커스텀 TrainingPipeline 만들기에 대해 자세히 알아보세요.
하이퍼파라미터 조정으로 TrainingPipeline을 만드는 경우 TrainingPipeline.trainingTaskInputs.trialJobSpec.workerPoolSpecs에 WorkerPoolSpec을 지정합니다.

분산 학습을 수행하는 경우 각 작업자 풀에 서로 다른 설정을 사용할 수 있습니다.

머신 유형

WorkerPoolSpec에서 machineSpec.machineType 필드에 다음 머신 유형 중 하나를 지정해야 합니다. 작업자 풀의 각 복제본은 지정된 머신 유형이 포함된 개별 VM에서 실행됩니다.

a3-megagpu-8g^*
a3-highgpu-1g^*
a3-highgpu-2g^*
a3-highgpu-4g^*
a3-highgpu-8g^*
a2-ultragpu-1g^*
a2-ultragpu-2g^*
a2-ultragpu-4g^*
a2-ultragpu-8g^*
a2-highgpu-1g^*
a2-highgpu-2g^*
a2-highgpu-4g^*
a2-highgpu-8g^*
a2-megagpu-16g^*
e2-standard-4
e2-standard-8
e2-standard-16
e2-standard-32
e2-highmem-2
e2-highmem-4
e2-highmem-8
e2-highmem-16
e2-highcpu-16
e2-highcpu-32
n2-standard-4
n2-standard-8
n2-standard-16
n2-standard-32
n2-standard-48
n2-standard-64
n2-standard-80
n2-highmem-2
n2-highmem-4
n2-highmem-8
n2-highmem-16
n2-highmem-32
n2-highmem-48
n2-highmem-64
n2-highmem-80
n2-highcpu-16
n2-highcpu-32
n2-highcpu-48
n2-highcpu-64
n2-highcpu-80
n1-standard-4
n1-standard-8
n1-standard-16
n1-standard-32
n1-standard-64
n1-standard-96
n1-highmem-2
n1-highmem-4
n1-highmem-8
n1-highmem-16
n1-highmem-32
n1-highmem-64
n1-highmem-96
n1-highcpu-16
n1-highcpu-32
n1-highcpu-64
n1-highcpu-96
c2-standard-4
c2-standard-8
c2-standard-16
c2-standard-30
c2-standard-60
ct5lp-hightpu-1t^*
ct5lp-hightpu-4t^*
ct5lp-hightpu-8t^*
m1-ultramem-40
m1-ultramem-80
m1-ultramem-160
m1-megamem-96
g2-standard-4^*
g2-standard-8^*
g2-standard-12^*
g2-standard-16^*
g2-standard-24^*
g2-standard-32^*
g2-standard-48^*
g2-standard-96^*
cloud-tpu^*

* 위 목록에 별표로 표시된 머신 유형은 특정 GPU 또는 TPU와 함께 사용해야 합니다. 이 가이드의 다음 섹션을 참조하세요.

각 머신 유형의 기술 사양을 자세히 알아보려면 머신 유형에 대한 Compute Engine 문서를 참조하세요. 커스텀 학습을 위해 각 머신 유형을 사용하는 비용은 가격 책정을 참조하세요.

다음 예시에서는 CustomJob을 만들 때 머신 유형을 지정하는 위치를 보여줍니다.

콘솔

Google Cloud 콘솔에서는 CustomJob을 직접 만들 수 없습니다. 하지만 CustomJob을 만드는 TrainingPipeline을 만들 수 있습니다. Google Cloud 콘솔에서 TrainingPipeline을 만들 때 컴퓨팅 및 가격 책정 단계에서 머신 유형 필드에 각 작업자 풀의 머신 유형을 지정합니다.

gcloud

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,container-image-uri=CUSTOM_CONTAINER_IMAGE_URI

Java

이 샘플을 사용해 보기 전에 Vertex AI 빠른 시작: 클라이언트 라이브러리 사용의 Java 설정 안내를 따르세요. 자세한 내용은 Vertex AI Java API 참고 문서를 참조하세요.

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


import com.google.cloud.aiplatform.v1.AcceleratorType;
import com.google.cloud.aiplatform.v1.ContainerSpec;
import com.google.cloud.aiplatform.v1.CustomJob;
import com.google.cloud.aiplatform.v1.CustomJobSpec;
import com.google.cloud.aiplatform.v1.JobServiceClient;
import com.google.cloud.aiplatform.v1.JobServiceSettings;
import com.google.cloud.aiplatform.v1.LocationName;
import com.google.cloud.aiplatform.v1.MachineSpec;
import com.google.cloud.aiplatform.v1.WorkerPoolSpec;
import java.io.IOException;

// Create a custom job to run machine learning training code in Vertex AI
public class CreateCustomJobSample {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String project = "PROJECT";
    String displayName = "DISPLAY_NAME";

    // Vertex AI runs your training application in a Docker container image. A Docker container
    // image is a self-contained software package that includes code and all dependencies. Learn
    // more about preparing your training application at
    // https://cloud.google.com/vertex-ai/docs/training/overview#prepare_your_training_application
    String containerImageUri = "CONTAINER_IMAGE_URI";
    createCustomJobSample(project, displayName, containerImageUri);
  }

  static void createCustomJobSample(String project, String displayName, String containerImageUri)
      throws IOException {
    JobServiceSettings settings =
        JobServiceSettings.newBuilder()
            .setEndpoint("us-central1-aiplatform.googleapis.com:443")
            .build();
    String location = "us-central1";

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (JobServiceClient client = JobServiceClient.create(settings)) {
      MachineSpec machineSpec =
          MachineSpec.newBuilder()
              .setMachineType("n1-standard-4")
              .setAcceleratorType(AcceleratorType.NVIDIA_TESLA_T4)
              .setAcceleratorCount(1)
              .build();

      ContainerSpec containerSpec =
          ContainerSpec.newBuilder().setImageUri(containerImageUri).build();

      WorkerPoolSpec workerPoolSpec =
          WorkerPoolSpec.newBuilder()
              .setMachineSpec(machineSpec)
              .setReplicaCount(1)
              .setContainerSpec(containerSpec)
              .build();

      CustomJobSpec customJobSpecJobSpec =
          CustomJobSpec.newBuilder().addWorkerPoolSpecs(workerPoolSpec).build();

      CustomJob customJob =
          CustomJob.newBuilder()
              .setDisplayName(displayName)
              .setJobSpec(customJobSpecJobSpec)
              .build();
      LocationName parent = LocationName.of(project, location);
      CustomJob response = client.createCustomJob(parent, customJob);
      System.out.format("response: %s\n", response);
      System.out.format("Name: %s\n", response.getName());
    }
  }
}

Node.js

이 샘플을 사용해 보기 전에 Vertex AI 빠른 시작: 클라이언트 라이브러리 사용의 Node.js 설정 안내를 따르세요. 자세한 내용은 Vertex AI Node.js API 참고 문서를 참조하세요.

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */

// const customJobDisplayName = 'YOUR_CUSTOM_JOB_DISPLAY_NAME';
// const containerImageUri = 'YOUR_CONTAINER_IMAGE_URI';
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';

// Imports the Google Cloud Job Service Client library
const {JobServiceClient} = require('@google-cloud/aiplatform');

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};

// Instantiates a client
const jobServiceClient = new JobServiceClient(clientOptions);

async function createCustomJob() {
  // Configure the parent resource
  const parent = `projects/${project}/locations/${location}`;
  const customJob = {
    displayName: customJobDisplayName,
    jobSpec: {
      workerPoolSpecs: [
        {
          machineSpec: {
            machineType: 'n1-standard-4',
            acceleratorType: 'NVIDIA_TESLA_K80',
            acceleratorCount: 1,
          },
          replicaCount: 1,
          containerSpec: {
            imageUri: containerImageUri,
            command: [],
            args: [],
          },
        },
      ],
    },
  };
  const request = {parent, customJob};

  // Create custom job request
  const [response] = await jobServiceClient.createCustomJob(request);

  console.log('Create custom job response:\n', JSON.stringify(response));
}
createCustomJob();

Python용 Vertex AI SDK

Python용 Vertex AI SDK를 설치하거나 업데이트하는 방법은 Python용 Vertex AI SDK 설치를 참조하세요. 자세한 내용은 Python용 Vertex AI SDK API 참조 문서를 확인하세요.

from google.cloud import aiplatform


def create_custom_job_sample(
    project: str,
    display_name: str,
    container_image_uri: str,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.JobServiceClient(client_options=client_options)
    custom_job = {
        "display_name": display_name,
        "job_spec": {
            "worker_pool_specs": [
                {
                    "machine_spec": {
                        "machine_type": "n1-standard-4",
                        "accelerator_type": aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_K80,
                        "accelerator_count": 1,
                    },
                    "replica_count": 1,
                    "container_spec": {
                        "image_uri": container_image_uri,
                        "command": [],
                        "args": [],
                    },
                }
            ]
        },
    }
    parent = f"projects/{project}/locations/{location}"
    response = client.create_custom_job(parent=parent, custom_job=custom_job)
    print("response:", response)

자세한 내용은 CustomJob 만들기 가이드를 참조하세요.

GPU

GPU를 사용하도록 학습 코드를 작성한 경우 각 VM에서 하나 이상의 GPU를 사용하도록 작업자 풀을 구성할 수 있습니다. GPU를 사용하려면 A2, N1 또는 G2 머신 유형을 사용해야 합니다. 또한 GPU가 있는 n1-highmem-2와 같은 더 작은 머신 유형을 사용하면 CPU 제약조건으로 인해 일부 워크로드에 대해 로깅이 실패할 수 있습니다. 학습 작업이 로그 반환을 중지할 경우 더 큰 머신 유형을 선택해 보세요.

Vertex AI는 커스텀 학습에 다음과 같은 유형의 GPU를 지원합니다.

NVIDIA_H100_MEGA_80GB^* (GPUDirect-TCPXO 포함)
NVIDIA_H100_80GB
NVIDIA_A100_80GB
NVIDIA_TESLA_A100(NVIDIA A100 40GB)
NVIDIA_TESLA_P4
NVIDIA_TESLA_P100
NVIDIA_TESLA_T4
NVIDIA_TESLA_V100
NVIDIA_L4

^* 공유 예약 또는 스팟 VM을 사용하여 용량을 확보하는 것이 좋습니다.

각 GPU 유형의 기술 사양을 자세히 알아보려면 컴퓨팅 워크로드를 위한 GPU에 대한 Compute Engine 짧은 문서를 참조하세요. 커스텀 학습을 위해 각 머신 유형을 사용하는 비용은 가격 책정을 참조하세요.

WorkerPoolSpec에서 machineSpec.acceleratorType 필드에 사용하려는 GPU 유형을 지정하고 machineSpec.acceleratorCount 필드에 작업자 풀의 각 VM이 사용할 GPU 수를 지정합니다. 하지만 이러한 필드 선택은 다음 제한사항을 충족해야 합니다.

커스텀 학습을 수행하는 위치에서 선택한 GPU 유형을 사용할 수 있어야 합니다. 모든 유형의 GPU를 모든 리전에서 사용할 수 있지는 않습니다. 리전별 가용성 알아보기
특정 개수의 GPU만 구성에서 사용할 수 있습니다. 예를 들어 VM에 2개 또는 4개의 NVIDIA_TESLA_T4 GPU를 사용할 수 있고 3개는 사용할 수 없습니다. 각 GPU 유형에 적합한 acceleratorCount 값을 확인하려면 다음 호환성 표를 참조하세요.
해당 GPU 구성이 사용하려는 머신 유형에 충분한 가상 CPU 및 메모리를 제공하는지 확인해야 합니다. 예를 들어 작업자 풀에서 n1-standard-32 머신 유형을 사용하는 경우 각 VM에는 32개의 가상 CPU와 120GB 메모리가 포함됩니다. 각 NVIDIA_TESLA_V100 GPU가 최대 12개의 가상 CPU와 76GB 메모리를 제공할 수 있기 때문에 각 n1-standard-32 VM이 해당 요구사항을 지원할 수 있도록 하려면 최소 4개 이상의 GPU를 사용해야 합니다. (GPU 2개는 부족한 리소스를 제공하며 GPU 3개를 지정할 수 없습니다.)

이 요구사항에 대해서는 다음 호환성 표를 참조하세요.

Compute Engine에서 GPU를 사용할 때와 다른 커스텀 학습을 위해 GPU를 사용할 때는 다음과 같은 제한사항이 추가로 적용됩니다.
- 4개의 NVIDIA_TESLA_P100 GPU가 포함된 구성은 모든 리전 및 영역에서 최대 64개의 가상 CPU 및 최대 208GB 메모리만 제공합니다.
동적 워크로드 스케줄러 또는 스팟 VM을 사용하는 작업의 경우 CustomJob의 scheduling.strategy 필드를 선택한 전략으로 업데이트합니다.

다음 호환성 표에는 선택한 machineSpec.machineType 및 machineSpec.acceleratorType에 따라 machineSpec.acceleratorCount에 적합한 값이 표시되어 있습니다.

각 머신 유형에 유효한 GPU 수
머신 유형	`NVIDIA_H100_MEGA_80GB`	`NVIDIA_H100_80GB`	`NVIDIA_A100_80GB`	`NVIDIA_TESLA_A100`	`NVIDIA_TESLA_P4`	`NVIDIA_TESLA_P100`	`NVIDIA_TESLA_T4`	`NVIDIA_TESLA_V100`	`NVIDIA_L4`
`a3-megagpu-8g`	8
`a3-highgpu-1g`		1^*
`a3-highgpu-2g`		2^*
`a3-highgpu-4g`		4^*
`a3-highgpu-8g`		8
`a2-ultragpu-1g`			1
`a2-ultragpu-2g`			2
`a2-ultragpu-4g`			4
`a2-ultragpu-8g`			8
`a2-highgpu-1g`				1
`a2-highgpu-2g`				2
`a2-highgpu-4g`				4
`a2-highgpu-8g`				8
`a2-megagpu-16g`				16
`n1-standard-4`					1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-standard-8`					1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-standard-16`					1, 2, 4	1, 2, 4	1, 2, 4	2, 4, 8
`n1-standard-32`					2, 4	2, 4	2, 4	4, 8
`n1-standard-64`					4		4	8
`n1-standard-96`					4		4	8
`n1-highmem-2`					1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-highmem-4`					1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-highmem-8`					1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-highmem-16`					1, 2, 4	1, 2, 4	1, 2, 4	2, 4, 8
`n1-highmem-32`					2, 4	2, 4	2, 4	4, 8
`n1-highmem-64`					4		4	8
`n1-highmem-96`					4		4	8
`n1-highcpu-16`					1, 2, 4	1, 2, 4	1, 2, 4	2, 4, 8
`n1-highcpu-32`					2, 4	2, 4	2, 4	4, 8
`n1-highcpu-64`					4	4	4	8
`n1-highcpu-96`					4		4	8
`g2-standard-4`									1
`g2-standard-8`									1
`g2-standard-12`									1
`g2-standard-16`									1
`g2-standard-24`									2
`g2-standard-32`									1
`g2-standard-48`									4
`g2-standard-96`									8

^* 동적 워크로드 스케줄러 또는 스팟 VM을 사용할 때만 지정된 머신 유형을 사용할 수 있습니다.

다음 예시에서는 CustomJob을 만들 때 GPU를 지정할 수 있는 위치를 보여줍니다.

콘솔

Google Cloud 콘솔에서는 CustomJob을 직접 만들 수 없습니다. 하지만 CustomJob을 만드는 TrainingPipeline을 만들 수 있습니다. Google Cloud 콘솔에서 TrainingPipeline을 만들 때 컴퓨팅 및 가격 책정 단계에서 각 작업자 풀에 대해 GPU를 지정할 수 있습니다. 먼저 머신 유형을 지정합니다. 그런 후 가속기 유형 및 가속기 수 필드에 GPU 세부정보를 지정할 수 있습니다.

gcloud

Google Cloud CLI 도구를 사용하여 GPU를 지정하려면 config.yaml 파일을 사용해야 합니다. 예를 들면 다음과 같습니다.

`config.yaml`

workerPoolSpecs:
  machineSpec:
    machineType: MACHINE_TYPE
    acceleratorType: ACCELERATOR_TYPE
    acceleratorCount: ACCELERATOR_COUNT
  replicaCount: REPLICA_COUNT
  containerSpec:
    imageUri: CUSTOM_CONTAINER_IMAGE_URI

그런 후 다음과 같은 명령어를 실행합니다.

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --config=config.yaml

Node.js

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */

// const customJobDisplayName = 'YOUR_CUSTOM_JOB_DISPLAY_NAME';
// const containerImageUri = 'YOUR_CONTAINER_IMAGE_URI';
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';

// Imports the Google Cloud Job Service Client library
const {JobServiceClient} = require('@google-cloud/aiplatform');

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};

// Instantiates a client
const jobServiceClient = new JobServiceClient(clientOptions);

async function createCustomJob() {
  // Configure the parent resource
  const parent = `projects/${project}/locations/${location}`;
  const customJob = {
    displayName: customJobDisplayName,
    jobSpec: {
      workerPoolSpecs: [
        {
          machineSpec: {
            machineType: 'n1-standard-4',
            acceleratorType: 'NVIDIA_TESLA_K80',
            acceleratorCount: 1,
          },
          replicaCount: 1,
          containerSpec: {
            imageUri: containerImageUri,
            command: [],
            args: [],
          },
        },
      ],
    },
  };
  const request = {parent, customJob};

  // Create custom job request
  const [response] = await jobServiceClient.createCustomJob(request);

  console.log('Create custom job response:\n', JSON.stringify(response));
}
createCustomJob();

Python용 Vertex AI SDK

from google.cloud import aiplatform


def create_custom_job_sample(
    project: str,
    display_name: str,
    container_image_uri: str,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.JobServiceClient(client_options=client_options)
    custom_job = {
        "display_name": display_name,
        "job_spec": {
            "worker_pool_specs": [
                {
                    "machine_spec": {
                        "machine_type": "n1-standard-4",
                        "accelerator_type": aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_K80,
                        "accelerator_count": 1,
                    },
                    "replica_count": 1,
                    "container_spec": {
                        "image_uri": container_image_uri,
                        "command": [],
                        "args": [],
                    },
                }
            ]
        },
    }
    parent = f"projects/{project}/locations/{location}"
    response = client.create_custom_job(parent=parent, custom_job=custom_job)
    print("response:", response)

자세한 내용은 CustomJob 만들기 가이드를 참조하세요.

GPUDirect 네트워킹

Vertex Training에서 일부 H100 시리즈 머신은 GPUDirect 네트워킹 스택으로 사전 구성되어 제공됩니다. GPUDirect는 GPU 간 네트워킹 속도를 GPUDirect가 없는 GPU에 비해 최대 2배까지 높일 수 있습니다.

GPUDirect는 GPU 간에 패킷 페이로드를 전송하는 데 필요한 오버헤드를 줄여 속도를 향상시키므로 대규모 처리량이 크게 개선됩니다.

a3-megagpu-8g 머신 유형에는 다음이 있습니다.

머신당 NVIDIA H100 GPU 8개
기본 NIC에서 최대 대역폭 200Gbps
GPU 데이터 전송 각각에 최대 200Gbps까지 지원하는 보조 NIC 8개
GPU와 VM 간의 통신을 더욱 향상시키는 GPUDirect-TCPXO

특히 GPUDirect가 있는 GPU는 대규모 모델 분산 학습을 위해 탑재되어 있습니다.

TPU

Vertex AI에 대한 커스텀 학습에 Tensor Processing Unit(TPU)을 사용하려면 TPU VM을 사용하도록 작업자 풀을 구성할 수 있습니다.

Vertex AI에서 TPU VM을 사용하는 경우 커스텀 학습에 단일 작업자 풀만 사용해야 하며, 이 작업자 풀은 복제본 하나만 사용하도록 구성해야 합니다.

TPU v2 및 v3

작업자 풀에서 TPU v2 또는 v3 VM을 사용하려면 다음 구성 중 하나를 사용해야 합니다.

TPU v2로 TPU VM을 구성하려면 WorkerPoolSpec에서 다음 필드를 지정합니다.
- machineSpec.machineType을 cloud-tpu로 설정합니다.
- machineSpec.acceleratorType를 TPU_V2로 설정합니다.
- 단일 TPU의 경우 machineSpec.acceleratorCount를 8로 설정하거나, TPU Pod의 경우 32 or multiple of 32로 설정합니다.
- replicaCount을 1로 설정합니다.
TPU v3로 TPU VM을 구성하려면 WorkerPoolSpec에서 다음 필드를 지정합니다.
- machineSpec.machineType을 cloud-tpu로 설정합니다.
- machineSpec.acceleratorType를 TPU_V3로 설정합니다.
- 단일 TPU의 경우 machineSpec.acceleratorCount를 8로 설정하거나, TPU Pod의 경우 32+로 설정합니다.
- replicaCount을 1로 설정합니다.

TPU v5e

TPU v5e를 사용하려면 JAX 0.4.6 이상, TensorFlow 2.15 이상 또는 PyTorch 2.1 이상이 필요합니다. TPU v5e로 TPU VM을 구성하려면 WorkerPoolSpec에서 다음 필드를 지정합니다.

machineSpec.machineType을 ct5lp-hightpu-1t, ct5lp-hightpu-4t 또는 ct5lp-hightpu-8t로 설정합니다.
machineSpec.tpuTopology를 머신 유형에 지원되는 토폴로지로 설정합니다. 자세한 내용은 다음 표를 참고하세요.
replicaCount을 1로 설정합니다.

다음 표에서는 커스텀 학습에 지원되는 TPU v5e 머신 유형과 토폴로지를 보여줍니다.

머신 유형	토폴로지	TPU 칩 수	VM 수	권장 사용 사례
`ct5lp-hightpu-1t`	1x1	1	1	중소 규모 학습
`ct5lp-hightpu-4t`	2x2	4	1	중소 규모 학습
`ct5lp-hightpu-8t`	2x4	8	1	중소 규모 학습
`ct5lp-hightpu-4t`	2x4	8	2	중소 규모 학습
`ct5lp-hightpu-4t`	4x4	16	4	대규모 학습
`ct5lp-hightpu-4t`	4x8	32	8	대규모 학습
`ct5lp-hightpu-4t`	8x8	64	16	대규모 학습
`ct5lp-hightpu-4t`	8x16	128	32	대규모 학습
`ct5lp-hightpu-4t`	16x16	256	64	대규모 학습

TPU v5e VM에서 실행되는 커스텀 학습 작업은 처리량과 가용성에 최적화되어 있습니다. 자세한 내용은 v5e 학습 가속기 유형을 참조하세요.

Vertex AI 커스텀 학습의 us-west1 및 us-west4에서 TPU v5e 머신을 사용할 수 있습니다. TPU v5e에 대한 자세한 내용은 Cloud TPU v5e 학습을 참조하세요.

머신 유형 비교:

머신 유형	ct5lp-hightpu-1t	ct5lp-hightpu-4t	ct5lp-hightpu-8t
v5e 칩 수	1	4	8
vCPU 수	24	112	224
RAM(GB)	48	192	384
NUMA 노드 수	1	1	2
선점 가능성	높음	보통	낮음

TPU v6e

TPU v6e를 사용하려면 Python 3.10 이상, JAX 0.4.37 이상, PJRT를 기본 런타임으로 사용하는 PyTorch 2.1 이상 또는 tf-nightly 런타임 버전 2.18 이상만 사용하는 TensorFlow가 필요합니다. TPU v6e로 TPU VM을 구성하려면 WorkerPoolSpec에서 다음 필드를 지정합니다.

machineSpec.machineType을 ct6e로 설정합니다.
machineSpec.tpuTopology를 머신 유형에 지원되는 토폴로지로 설정합니다. 자세한 내용은 다음 표를 참고하세요.
replicaCount을 1로 설정합니다.

다음 표에서는 커스텀 학습에 지원되는 TPU v6e 머신 유형과 토폴로지를 보여줍니다.

머신 유형	토폴로지	TPU 칩 수	VM 수	권장 사용 사례
`ct6e-standard-1t`	1x1	1	1	중소 규모 학습
`ct6e-standard-8t`	2x4	8	1	중소 규모 학습
`ct6e-standard-4t`	2x2	4	1	중소 규모 학습
`ct6e-standard-4t`	2x4	8	2	중소 규모 학습
`ct6e-standard-4t`	4x4	16	4	대규모 학습
`ct6e-standard-4t`	4x8	32	8	대규모 학습
`ct6e-standard-4t`	8x8	64	16	대규모 학습
`ct6e-standard-4t`	8x16	128	32	대규모 학습
`ct6e-standard-4t`	16x16	256	64	대규모 학습

Vertex AI 커스텀 학습의 asia-northeast1, europe-west4, us-east1, us-east5, us-south1에서 TPU v6e 머신을 사용할 수 있습니다. TPU v6e에 대한 자세한 내용은 Cloud TPU v6e 학습을 참조하세요.

머신 유형 비교:

머신 유형	ct6e-standard-1t	ct6e-standard-4t	ct6e-standard-8t
v6e 칩 수	1	4	8
vCPU 수	44	180	180
RAM(GB)	48	720	1440
NUMA 노드 수	2	1	2
선점 가능성	높음	보통	낮음

TPU VM을 지정하는 `CustomJob` 예시

다음 예시에서는 CustomJob을 만들 때 TPU VM을 지정하는 방법을 강조합니다.

gcloud

gcloud CLI 도구를 사용하여 TPU VM을 지정하려면 config.yaml 파일을 사용해야 합니다. 예시를 보려면 다음 탭 중 하나를 선택합니다.

TPU v2/v3

workerPoolSpecs:
  machineSpec:
    machineType: cloud-tpu
    acceleratorType: TPU_V2
    acceleratorCount: 8
  replicaCount: 1
  containerSpec:
    imageUri: CUSTOM_CONTAINER_IMAGE_URI

TPU v5e

workerPoolSpecs:
  machineSpec:
    machineType: ct5lp-hightpu-4t
    tpuTopology: 4x4
  replicaCount: 1
  containerSpec:
    imageUri: CUSTOM_CONTAINER_IMAGE_URI

그런 후 다음과 같은 명령어를 실행합니다.

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --config=config.yaml

Python

이 샘플을 사용해 보기 전에 Vertex AI 빠른 시작: 클라이언트 라이브러리 사용의 Python 설정 안내를 따르세요. 자세한 내용은 Vertex AI Python API 참고 문서를 참조하세요.

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

Vertex AI SDK for Python을 사용하여 TPU VM을 지정하려면 다음 예시를 참조하세요.

from google.cloud.aiplatform import aiplatform

job = aiplatform.CustomContainerTrainingJob(
    display_name='DISPLAY_NAME',
    location='us-west1',
    project='PROJECT_ID',
    staging_bucket="gs://CLOUD_STORAGE_URI",
    container_uri='CONTAINER_URI')

job.run(machine_type='ct5lp-hightpu-4t', tpu_topology='2x2')

커스텀 학습 작업 만들기에 대한 자세한 내용은 커스텀 학습 작업 만들기를 참조하세요.

부팅 디스크 옵션

학습 VM에 대해 선택적으로 부팅 디스크를 맞춤설정할 수 있습니다. 작업자 풀의 모든 VM에 동일한 부팅 디스크 유형 및 크기가 사용됩니다.

각 학습 VM에 사용되는 부팅 디스크 유형을 맞춤설정하려면 WorkerPoolSpec에 diskSpec.bootDiskType 필드를 지정합니다.

표준 하드 드라이브에서 지원되는 표준 영구 디스크를 사용하도록 이 필드를 pd-standard로 설정하거나 솔리드 스테이트 드라이브에서 지원되는 SSD 영구 디스크를 사용하도록 pd-ssd로 설정할 수 있습니다. 기본값은 pd-ssd입니다.

pd-ssd를 사용하면 학습 코드가 디스크에 읽기 및 쓰기를 수행할 경우 성능이 향상될 수 있습니다. 디스크 유형에 대해 알아보세요.
각 학습 VM에 사용되는 부팅 디스크의 크기(GB)를 맞춤설정하려면 WorkerPoolSpec에 diskSpec.bootDiskSizeGb 필드를 지정합니다.

이 필드는 100~64,000까지 정수로 설정할 수 있습니다. 기본값은 100입니다.

학습 코드가 많은 임시 데이터를 디스크에 기록할 경우 부팅 디스크 크기를 늘려야 할 수 있습니다. 부팅 디스크에 기록하는 데이터는 임시 데이터이며 학습이 완료된 후에는 이를 검색할 수 없습니다.

부팅 디스크의 유형 및 크기를 변경하면 커스텀 학습 가격 책정에 영향을 줍니다.

다음 예시에서는 CustomJob을 만들 때 부팅 디스크 옵션을 지정할 수 있는 위치를 보여줍니다.

콘솔

Google Cloud 콘솔에서는 CustomJob을 직접 만들 수 없습니다. 하지만 CustomJob을 만드는 TrainingPipeline을 만들 수 있습니다. Google Cloud 콘솔에서 TrainingPipeline을 만들 때 컴퓨팅 및 가격 책정 단계에서 디스크 유형 드롭다운 목록 및 디스크 크기(GB) 필드에 각 작업자 풀에 대해 부팅 디스크 옵션을 지정할 수 있습니다.

gcloud

Google Cloud CLI 도구를 사용하여 부팅 디스크 옵션을 지정하려면 config.yaml 파일을 사용해야 합니다. 예를 들면 다음과 같습니다.

`config.yaml`

workerPoolSpecs:
  machineSpec:
    machineType: MACHINE_TYPE
  diskSpec:
    bootDiskType: DISK_TYPE
    bootDiskSizeGb: DISK_SIZE
  replicaCount: REPLICA_COUNT
  containerSpec:
    imageUri: CUSTOM_CONTAINER_IMAGE_URI

그런 후 다음과 같은 명령어를 실행합니다.

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --config=config.yaml

자세한 내용은 CustomJob 만들기 가이드를 참조하세요.

다음 단계

커스텀 학습 작업을 실행하기 위해 영구 리소스를 만드는 방법 알아보기
CustomJob을 만들어 커스텀 학습을 수행하는 방법 알아보기

커스텀 학습용 컴퓨팅 리소스 구성 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

비용 및 가용성 관리

컴퓨팅 리소스 지정 위치

머신 유형

콘솔

gcloud

Java

Node.js

Python용 Vertex AI SDK

GPU

콘솔

gcloud

config.yaml

Node.js

Python용 Vertex AI SDK

GPUDirect 네트워킹

TPU

TPU v2 및 v3

TPU v5e

TPU v6e

TPU VM을 지정하는 CustomJob 예시

gcloud

TPU v2/v3

TPU v5e

Python

부팅 디스크 옵션

콘솔

gcloud

config.yaml

다음 단계

커스텀 학습용 컴퓨팅 리소스 구성

`config.yaml`

TPU VM을 지정하는 `CustomJob` 예시

`config.yaml`