本頁面由 Cloud Translation API 翻譯而成。

在 Google Kubernetes Engine (GKE) 中透過 Ray 和 PyTorch 訓練模型

自動駕駛標準

本指南示範如何使用 Ray、PyTorch 和 Ray Operator 外掛程式，在 Google Kubernetes Engine (GKE) 上訓練模型。

關於 Ray

Ray 是開放原始碼的可擴充運算架構，適用於 AI/ML 應用程式。Ray Train 是 Ray 中的元件，專為分散式模型訓練和微調而設計。您可以使用 Ray Train API，在多部機器上擴充訓練作業，並與 PyTorch 等機器學習程式庫整合。

您可以使用 RayCluster 或 RayJob 資源部署 Ray 訓練工作。在實際工作環境中部署 Ray 工作時，基於下列原因，您應使用 RayJob 資源

RayJob 資源會建立臨時 Ray 叢集，工作完成後即可自動刪除。
RayJob 資源支援重試政策，可確保工作執行作業的韌性。
您可以使用熟悉的 Kubernetes API 模式管理 Ray 工作。

目標

本指南適用於生成式 AI 客戶、GKE 新手或現有使用者、機器學習工程師、MLOps (DevOps) 工程師，或是有意使用 Kubernetes 容器協調功能，透過 Ray 服務模型的平台管理員。

建立 GKE 叢集。
使用 RayCluster 自訂資源建立 Ray 叢集。
使用 Ray 工作訓練模型。
使用 RayJob 自訂資源部署 Ray 工作。

費用

在本文件中，您會使用 Google Cloud的下列計費元件：

如要根據預測用量估算費用，請使用 Pricing Calculator。

初次使用 Google Cloud 的使用者可能符合免費試用資格。

完成本文所述工作後，您可以刪除已建立的資源，避免繼續計費。詳情請參閱清除所用資源一節。

事前準備

Cloud Shell 已預先安裝本教學課程所需的軟體，包括 kubectl 和 gcloud CLI。如果您未使用 Cloud Shell，則必須安裝 gcloud CLI。

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running gcloud components update.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

To initialize the gcloud CLI, run the following command:

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API:

gcloud services enable container.googleapis.com

Install the Google Cloud CLI.

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running gcloud components update.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

To initialize the gcloud CLI, run the following command:

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API:

gcloud services enable container.googleapis.com

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/container.clusterAdmin, roles/container.admin
```
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE
```
- Replace PROJECT_ID with your project ID.
- Replace USER_IDENTIFIER with the identifier for your user account. For example, user:myemail@example.com.
- Replace ROLE with each individual role.

準備環境

如要準備環境，請按照下列步驟操作：

在 Google Cloud 控制台中，按一下Google Cloud 控制台中的「啟用 Cloud Shell」，即可啟動 Cloud Shell 工作階段。系統會在 Google Cloud 控制台的底部窗格啟動工作階段。

設定環境變數：

export PROJECT_ID=PROJECT_ID
export CLUSTER_NAME=ray-cluster
export COMPUTE_REGION=us-central1
export COMPUTE_ZONE=us-central1-c
export CLUSTER_VERSION=CLUSTER_VERSION
export TUTORIAL_HOME=`pwd`

更改下列內容：

PROJECT_ID：您的 Google Cloud 專案 ID。
CLUSTER_VERSION：要使用的 GKE 版本。必須為 1.30.1 或之後。

複製 GitHub 存放區：

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples

變更為工作目錄：

cd kubernetes-engine-samples/ai-ml/gke-ray/raytrain/pytorch-mnist

建立 Python 虛擬環境：

python -m venv myenv && \
source myenv/bin/activate

安裝 Ray。

建立 GKE 叢集

建立 Autopilot 或 Standard GKE 叢集：

Autopilot

建立 Autopilot 叢集：

gcloud container clusters create-auto ${CLUSTER_NAME}  \
    --enable-ray-operator \
    --cluster-version=${CLUSTER_VERSION} \
    --location=${COMPUTE_REGION}

標準

建立標準叢集：

gcloud container clusters create ${CLUSTER_NAME} \
    --addons=RayOperator \
    --cluster-version=${CLUSTER_VERSION}  \
    --machine-type=e2-standard-8 \
    --location=${COMPUTE_ZONE} \
    --num-nodes=4

部署 RayCluster 資源

將 RayCluster 資源部署至叢集：

請查看下列資訊清單：

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: pytorch-mnist-cluster
spec:
  rayVersion: '2.37.0'
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      metadata:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.37.0
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              cpu: "2"
              ephemeral-storage: "9Gi"
              memory: "4Gi"
            requests:
              cpu: "2"
              ephemeral-storage: "9Gi"
              memory: "4Gi"
  workerGroupSpecs:
  - replicas: 4
    minReplicas: 1
    maxReplicas: 5
    groupName: worker-group
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.37.0
          resources:
            limits:
              cpu: "4"
              ephemeral-storage: "9Gi"
              memory: "8Gi"
            requests:
              cpu: "4"
              ephemeral-storage: "9Gi"
              memory: "8Gi"

這個資訊清單說明 RayCluster 自訂資源。

將資訊清單套用至 GKE 叢集：
```
kubectl apply -f ray-cluster.yaml
```

確認 RayCluster 資源已準備就緒：

kubectl get raycluster

輸出結果會與下列內容相似：

NAME                    DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
pytorch-mnist-cluster   2                 2                   6      20Gi     0      ready    63s

在這個輸出內容中，STATUS 資料欄中的 ready 表示 RayCluster 資源已準備就緒。

連線至 RayCluster 資源

連線至 RayCluster 資源，即可提交 Ray 工作。

確認 GKE 是否已建立 RayCluster 服務：

kubectl get svc pytorch-mnist-cluster-head-svc

輸出結果會與下列內容相似：

NAME                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                AGE
pytorch-mnist-cluster-head-svc   ClusterIP   34.118.238.247   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP   109s

建立通訊埠轉送工作階段至 Ray 標頭：

kubectl port-forward svc/pytorch-mnist-cluster-head-svc 8265:8265 2>&1 >/dev/null &

確認 Ray 用戶端可以使用 localhost 連線至 Ray 叢集：

ray list nodes --address http://localhost:8265

輸出結果會與下列內容相似：

Stats:
------------------------------
Total: 3

Table:
------------------------------
    NODE_ID                                                   NODE_IP     IS_HEAD_NODE    STATE    NODE_NAME    RESOURCES_TOTAL                 LABELS
0  1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2  10.28.1.21  False           ALIVE    10.28.1.21   CPU: 2.0                        ray.io/node_id: 1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2
# Several lines of output omitted

訓練模型

使用 Fashion MNIST 資料集訓練 PyTorch 模型：

提交 Ray 工作，並等待工作完成：

ray job submit --submission-id pytorch-mnist-job --working-dir . --runtime-env-json='{"pip": ["torch", "torchvision"], "excludes": ["myenv"]}' --address http://localhost:8265 -- python train.py

輸出結果會與下列內容相似：

Job submission server address: http://localhost:8265

--------------------------------------------
Job 'pytorch-mnist-job' submitted successfully
--------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs pytorch-mnist-job
  Query the status of the job:
    ray job status pytorch-mnist-job
  Request the job to be stopped:
    ray job stop pytorch-mnist-job

Handling connection for 8265
Tailing logs until the job exits (disable with --no-wait):
...
...

確認工作狀態：

ray job status pytorch-mnist

輸出結果會與下列內容相似：

Job submission server address: http://localhost:8265
Status for job 'pytorch-mnist-job': RUNNING
Status message: Job is currently running.

等待 Status for job 變成 COMPLETE。這項作業可能需要 15 分鐘以上才能完成。

查看 Ray 工作記錄：

ray job logs pytorch-mnist

輸出結果會與下列內容相似：

Training started with configuration:
╭─────────────────────────────────────────────────╮
│ Training config                                  │
├──────────────────────────────────────────────────┤
│ train_loop_config/batch_size_per_worker       8  │
│ train_loop_config/epochs                     10  │
│ train_loop_config/lr                      0.001  │
╰─────────────────────────────────────────────────╯

# Several lines omitted

Training finished iteration 10 at 2024-06-19 08:29:36. Total running time: 9min 18s
╭───────────────────────────────╮
│ Training result                │
├────────────────────────────────┤
│ checkpoint_dir_name            │
│ time_this_iter_s      25.7394  │
│ time_total_s          351.233  │
│ training_iteration         10  │
│ accuracy               0.8656  │
│ loss                  0.37827  │
╰───────────────────────────────╯

# Several lines omitted
-------------------------------
Job 'pytorch-mnist' succeeded
-------------------------------

部署 RayJob

RayJob 自訂資源會在執行單一 Ray 工作期間，管理 RayCluster 資源的生命週期。

請查看下列資訊清單：

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: pytorch-mnist-job
spec:
  shutdownAfterJobFinishes: true
  entrypoint: python ai-ml/gke-ray/raytrain/pytorch-mnist/train.py
  runtimeEnvYAML: |
    pip:
      - torch
      - torchvision
    working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
    env_vars:
      NUM_WORKERS: "4"
      CPUS_PER_WORKER: "2"
  rayClusterSpec:
    rayVersion: '2.37.0'
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.37.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "2"
                  ephemeral-storage: "9Gi"
                  memory: "4Gi"
                requests:
                  cpu: "2"
                  ephemeral-storage: "9Gi"
                  memory: "4Gi"
    workerGroupSpecs:
      - replicas: 4
        minReplicas: 1
        maxReplicas: 5
        groupName: small-group
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.37.0
                resources:
                  limits:
                    cpu: "4"
                    ephemeral-storage: "9Gi"
                    memory: "8Gi"
                  requests:
                    cpu: "4"
                    ephemeral-storage: "9Gi"
                    memory: "8Gi"

這個資訊清單說明 RayJob 自訂資源。

將資訊清單套用至 GKE 叢集：
```
kubectl apply -f ray-job.yaml
```

確認 RayJob 資源是否正在執行：

kubectl get rayjob

輸出結果會與下列內容相似：

NAME                JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGE
pytorch-mnist-job   RUNNING      Running             2024-06-19T15:43:32Z              2m29s

在這個輸出內容中，DEPLOYMENT STATUS 欄表示 RayJob 資源為 Running。

查看 RayJob 資源狀態：

kubectl logs -f -l job-name=pytorch-mnist-job

輸出結果會與下列內容相似：

Training started with configuration:
╭─────────────────────────────────────────────────╮
│ Training config                                  │
├──────────────────────────────────────────────────┤
│ train_loop_config/batch_size_per_worker       8  │
│ train_loop_config/epochs                     10  │
│ train_loop_config/lr                      0.001  │
╰─────────────────────────────────────────────────╯

# Several lines omitted

Training finished iteration 10 at 2024-06-19 08:29:36. Total running time: 9min 18s
╭───────────────────────────────╮
│ Training result                │
├────────────────────────────────┤
│ checkpoint_dir_name            │
│ time_this_iter_s      25.7394  │
│ time_total_s          351.233  │
│ training_iteration         10  │
│ accuracy               0.8656  │
│ loss                  0.37827  │
╰───────────────────────────────╯

# Several lines omitted
-------------------------------
Job 'pytorch-mnist' succeeded
-------------------------------

確認 Ray 工作已完成：

kubectl get rayjob

輸出結果會與下列內容相似：

NAME                JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME               AGE
pytorch-mnist-job   SUCCEEDED    Complete            2024-06-19T15:43:32Z   2024-06-19T15:51:12Z   9m6s

在這個輸出內容中，DEPLOYMENT STATUS 欄表示 RayJob 資源為 Complete。

清除所用資源

刪除專案

Delete a Google Cloud project:

gcloud projects delete PROJECT_ID

刪除個別資源

如果您使用現有專案，但不想刪除專案，可以刪除個別資源。如要刪除叢集，請輸入：

gcloud container clusters delete ${CLUSTER_NAME}

後續步驟

探索 Google Cloud 的參考架構、圖表和最佳做法。歡迎瀏覽我們的雲端架構中心。

在 Google Kubernetes Engine (GKE) 中透過 Ray 和 PyTorch 訓練模型 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

關於 Ray

目標

費用

事前準備

準備環境

建立 GKE 叢集

Autopilot

標準

部署 RayCluster 資源

連線至 RayCluster 資源

訓練模型

部署 RayJob

清除所用資源

刪除專案

刪除個別資源

後續步驟

在 Google Kubernetes Engine (GKE) 中透過 Ray 和 PyTorch 訓練模型