GPU で PyTorch、Ray、Google Kubernetes Engine（GKE）を使用してモデルをトレーニングする

Autopilot Standard

このガイドでは、Ray、PyTorch、Ray Operator アドオンを使用して Google Kubernetes Engine（GKE）でモデルをトレーニングする方法について説明します。

Ray について

Ray は、AI / ML アプリケーション向けのオープンソースのスケーラブルなコンピューティングフレームワークです。Ray Train は、分散モデルのトレーニングと微調整用に設計された Ray のコンポーネントです。Ray Train API を使用すると、複数のマシンにわたってトレーニングをスケーリングし、PyTorch などの機械学習ライブラリと統合できます。

Ray トレーニングジョブをデプロイするには、RayCluster リソースまたは RayJob リソースを使用します。次の理由から、本番環境で Ray ジョブをデプロイするときは、RayJob リソースを使用する必要があります。

RayJob リソースは、ジョブの完了時に自動的に削除できるエフェメラル Ray クラスタを作成します。
RayJob リソースは、復元力のあるジョブ実行の再試行ポリシーをサポートしています。
Ray ジョブは、使い慣れた Kubernetes API パターンを使用して管理できます。

目標

このガイドは、生成 AI をご利用のお客様、GKE の新規または既存のユーザー、ML エンジニア、MLOps（DevOps）エンジニア、プラットフォーム管理者で、Ray を使用してモデルを提供するために Kubernetes コンテナオーケストレーション機能を使用することに関心のある方を対象としています。

GKE クラスタを作成します。
RayCluster カスタムリソースを使用して Ray クラスタを作成します。
Ray ジョブを使用してモデルをトレーニングします。
RayJob カスタムリソースを使用して Ray ジョブをデプロイします。

費用

このドキュメントでは、課金対象である次の Google Cloudコンポーネントを使用します。

料金計算ツールを使うと、予想使用量に基づいて費用の見積もりを生成できます。

新規の Google Cloud ユーザーは無料トライアルをご利用いただける場合があります。

このドキュメントに記載されているタスクの完了後、作成したリソースを削除すると、それ以上の請求は発生しません。詳細については、クリーンアップをご覧ください。

始める前に

Cloud Shell には、kubectl、gcloud CLI など、このチュートリアルに必要なソフトウェアがプリインストールされています。Cloud Shell を使用しない場合は、gcloud CLI をインストールする必要があります。

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

注: すでに gcloud CLI をインストールしている場合は、gcloud components update を実行して、最新バージョンがインストールされていることを確認してください。

外部 ID プロバイダ（IdP）を使用している場合は、まず連携 ID を使用して gcloud CLI にログインする必要があります。

gcloud CLI を初期化するには、次のコマンドを実行します。

gcloud init

Create or select a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the GKE API:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

gcloud services enable container.googleapis.com

Install the Google Cloud CLI.

外部 ID プロバイダ（IdP）を使用している場合は、まず連携 ID を使用して gcloud CLI にログインする必要があります。

gcloud CLI を初期化するには、次のコマンドを実行します。

gcloud init

Create or select a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the GKE API:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

gcloud services enable container.googleapis.com

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/container.clusterAdmin, roles/container.admin
```
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE
```
Replace the following:
- PROJECT_ID: Your project ID.
- USER_IDENTIFIER: The identifier for your user account. For example, myemail@example.com.
- ROLE: The IAM role that you grant to your user account.

環境を準備する

環境を準備する手順は次のとおりです。

Google Cloud コンソールで（Cloud Shell をアクティブにする）をクリックして、 Google Cloud コンソールから Cloud Shell セッションを起動します。 Google Cloud コンソールの下部ペインでセッションが起動します。
環境変数を設定します。
```
export PROJECT_ID=PROJECT_ID
export CLUSTER_NAME=ray-cluster
export COMPUTE_REGION=us-central1
export COMPUTE_ZONE=us-central1-c
export CLUSTER_VERSION=CLUSTER_VERSION
export TUTORIAL_HOME=`pwd`
```
次のように置き換えます。
- PROJECT_ID: Google Cloudのプロジェクト ID。
- CLUSTER_VERSION: 使用する GKE のバージョン。1.30.1 以降にする必要があります。

GitHub リポジトリのクローンを作成します。

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples

作業ディレクトリを変更します。

cd kubernetes-engine-samples/ai-ml/gke-ray/raytrain/pytorch-mnist

Python 仮想環境を作成します。

python -m venv myenv && \
source myenv/bin/activate

Ray をインストールします。

GKE クラスタを作成する

GKE の Autopilot または Standard GKE クラスタを作成します。

Autopilot

Autopilot クラスタを作成します。

gcloud container clusters create-auto ${CLUSTER_NAME}  \
    --enable-ray-operator \
    --cluster-version=${CLUSTER_VERSION} \
    --location=${COMPUTE_REGION}

Standard

Standard クラスタを作成します。

gcloud container clusters create ${CLUSTER_NAME} \
    --addons=RayOperator \
    --cluster-version=${CLUSTER_VERSION}  \
    --machine-type=e2-standard-8 \
    --location=${COMPUTE_ZONE} \
    --num-nodes=4

RayCluster リソースをデプロイする

RayCluster リソースをクラスタにデプロイします。

次のマニフェストを確認します。

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: pytorch-mnist-cluster
spec:
  rayVersion: '2.37.0'
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      metadata:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.37.0
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              cpu: "2"
              ephemeral-storage: "9Gi"
              memory: "4Gi"
            requests:
              cpu: "2"
              ephemeral-storage: "9Gi"
              memory: "4Gi"
  workerGroupSpecs:
  - replicas: 4
    minReplicas: 1
    maxReplicas: 5
    groupName: worker-group
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.37.0
          resources:
            limits:
              cpu: "4"
              ephemeral-storage: "9Gi"
              memory: "8Gi"
            requests:
              cpu: "4"
              ephemeral-storage: "9Gi"
              memory: "8Gi"

このマニフェストでは、RayCluster カスタムリソースを記述します。

マニフェストを GKE クラスタに適用します。
```
kubectl apply -f ray-cluster.yaml
```
RayCluster リソースの準備ができていることを確認します。
```
kubectl get raycluster
```
出力は次のようになります。
```
NAME                    DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
pytorch-mnist-cluster   2                 2                   6      20Gi     0      ready    63s
```
この出力の STATUS 列の ready は、RayCluster リソースの準備が完了したことを示します。

RayCluster リソースに接続する

RayCluster リソースに接続して Ray ジョブを送信します。

GKE が RayCluster Service を作成したことを確認します。

kubectl get svc pytorch-mnist-cluster-head-svc

出力は次のようになります。

NAME                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                AGE
pytorch-mnist-cluster-head-svc   ClusterIP   34.118.238.247   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP   109s

Ray ヘッドへのポート転送セッションを確立します。

kubectl port-forward svc/pytorch-mnist-cluster-head-svc 8265:8265 2>&1 >/dev/null &

Ray クライアントが localhost を使用して Ray クラスタに接続できることを確認します。

ray list nodes --address http://localhost:8265

出力は次のようになります。

Stats:
------------------------------
Total: 3

Table:
------------------------------
    NODE_ID                                                   NODE_IP     IS_HEAD_NODE    STATE    NODE_NAME    RESOURCES_TOTAL                 LABELS
0  1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2  10.28.1.21  False           ALIVE    10.28.1.21   CPU: 2.0                        ray.io/node_id: 1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2
# Several lines of output omitted

モデルをトレーニングする

Fashion MNIST データセットを使用して PyTorch モデルをトレーニングします。

Ray ジョブを送信し、ジョブが完了するまで待ちます。

ray job submit --submission-id pytorch-mnist-job --working-dir . --runtime-env-json='{"pip": ["torch", "torchvision"], "excludes": ["myenv"]}' --address http://localhost:8265 -- python train.py

出力は次のようになります。

Job submission server address: http://localhost:8265

--------------------------------------------
Job 'pytorch-mnist-job' submitted successfully
--------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs pytorch-mnist-job
  Query the status of the job:
    ray job status pytorch-mnist-job
  Request the job to be stopped:
    ray job stop pytorch-mnist-job

Handling connection for 8265
Tailing logs until the job exits (disable with --no-wait):
...
...

Job のステータスを確認します。
```
ray job status pytorch-mnist
```
出力は次のようになります。
```
Job submission server address: http://localhost:8265
Status for job 'pytorch-mnist-job': RUNNING
Status message: Job is currently running.
```
Status for job が COMPLETE になるまで待ちます。この処理には 15 分以上かかることがあります。

Ray ジョブログを表示します。

ray job logs pytorch-mnist

出力は次のようになります。

Training started with configuration:
╭─────────────────────────────────────────────────╮
│ Training config                                  │
├──────────────────────────────────────────────────┤
│ train_loop_config/batch_size_per_worker       8  │
│ train_loop_config/epochs                     10  │
│ train_loop_config/lr                      0.001  │
╰─────────────────────────────────────────────────╯

# Several lines omitted

Training finished iteration 10 at 2024-06-19 08:29:36. Total running time: 9min 18s
╭───────────────────────────────╮
│ Training result                │
├────────────────────────────────┤
│ checkpoint_dir_name            │
│ time_this_iter_s      25.7394  │
│ time_total_s          351.233  │
│ training_iteration         10  │
│ accuracy               0.8656  │
│ loss                  0.37827  │
╰───────────────────────────────╯

# Several lines omitted
-------------------------------
Job 'pytorch-mnist' succeeded
-------------------------------

RayJob をデプロイする

RayJob カスタムリソースは、単一の Ray ジョブの実行中に RayCluster リソースのライフサイクルを管理します。

次のマニフェストを確認します。

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: pytorch-mnist-job
spec:
  shutdownAfterJobFinishes: true
  entrypoint: python ai-ml/gke-ray/raytrain/pytorch-mnist/train.py
  runtimeEnvYAML: |
    pip:
      - torch
      - torchvision
    working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
    env_vars:
      NUM_WORKERS: "4"
      CPUS_PER_WORKER: "2"
  rayClusterSpec:
    rayVersion: '2.37.0'
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.37.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "2"
                  ephemeral-storage: "9Gi"
                  memory: "4Gi"
                requests:
                  cpu: "2"
                  ephemeral-storage: "9Gi"
                  memory: "4Gi"
    workerGroupSpecs:
      - replicas: 4
        minReplicas: 1
        maxReplicas: 5
        groupName: small-group
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.37.0
                resources:
                  limits:
                    cpu: "4"
                    ephemeral-storage: "9Gi"
                    memory: "8Gi"
                  requests:
                    cpu: "4"
                    ephemeral-storage: "9Gi"
                    memory: "8Gi"

このマニフェストでは、RayJob カスタムリソースを記述しています。

マニフェストを GKE クラスタに適用します。
```
kubectl apply -f ray-job.yaml
```
RayJob リソースが実行されていることを確認します。
```
kubectl get rayjob
```
出力は次のようになります。
```
NAME                JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGE
pytorch-mnist-job   RUNNING      Running             2024-06-19T15:43:32Z              2m29s
```
この出力では、DEPLOYMENT STATUS 列が RayJob リソースが Running であることを示しています。

RayJob リソースのステータスを表示します。

kubectl logs -f -l job-name=pytorch-mnist-job

出力は次のようになります。

Training started with configuration:
╭─────────────────────────────────────────────────╮
│ Training config                                  │
├──────────────────────────────────────────────────┤
│ train_loop_config/batch_size_per_worker       8  │
│ train_loop_config/epochs                     10  │
│ train_loop_config/lr                      0.001  │
╰─────────────────────────────────────────────────╯

# Several lines omitted

Training finished iteration 10 at 2024-06-19 08:29:36. Total running time: 9min 18s
╭───────────────────────────────╮
│ Training result                │
├────────────────────────────────┤
│ checkpoint_dir_name            │
│ time_this_iter_s      25.7394  │
│ time_total_s          351.233  │
│ training_iteration         10  │
│ accuracy               0.8656  │
│ loss                  0.37827  │
╰───────────────────────────────╯

# Several lines omitted
-------------------------------
Job 'pytorch-mnist' succeeded
-------------------------------

Ray ジョブが完了したことを確認します。

kubectl get rayjob

出力は次のようになります。

NAME                JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME               AGE
pytorch-mnist-job   SUCCEEDED    Complete            2024-06-19T15:43:32Z   2024-06-19T15:51:12Z   9m6s

この出力では、DEPLOYMENT STATUS 列が RayJob リソースが Complete であることを示しています。

クリーンアップ

プロジェクトを削除する

注意: プロジェクトを削除すると、次のような影響があります。

プロジェクト内のすべてのものが削除されます。このドキュメントのタスクで既存のプロジェクトを使用した場合、それを削除すると、そのプロジェクトで行った他の作業もすべて削除されます。
カスタムプロジェクト ID が失われます。このプロジェクトを作成したときに、将来使用するカスタムプロジェクト ID を作成した可能性があります。そのプロジェクト ID を使用した URL（たとえば、appspot.com）を保持するには、プロジェクト全体ではなくプロジェクト内の選択したリソースだけを削除します。

複数のアーキテクチャ、チュートリアル、クイックスタートを実施する予定がある場合は、プロジェクトを再利用すると、プロジェクトの割り当て上限を超えないようにすることができます。

Delete a Google Cloud project:

gcloud projects delete PROJECT_ID

リソースを個別に削除する

使用している既存のプロジェクトを削除しない場合は、リソースを個別に削除できます。クラスタを削除するには、次のように入力します。

gcloud container clusters delete ${CLUSTER_NAME}

次のステップ

Google Cloud に関するリファレンスアーキテクチャ、図、ベストプラクティスを確認する。Cloud アーキテクチャセンターをご覧ください。