在 Google Kubernetes Engine (GKE) 中透過 Ray 和 PyTorch 訓練模型


本指南示範如何使用 RayPyTorchRay Operator 外掛程式,在 Google Kubernetes Engine (GKE) 上訓練模型。

關於 Ray

Ray 是開放原始碼的可擴充運算架構,適用於 AI/ML 應用程式。Ray Train 是 Ray 中的元件,專為分散式模型訓練和微調而設計。您可以使用 Ray Train API,在多部機器上擴充訓練作業,並與 PyTorch 等機器學習程式庫整合。

您可以使用 RayClusterRayJob 資源部署 Ray 訓練工作。在實際工作環境中部署 Ray 工作時,基於下列原因,您應使用 RayJob 資源

  • RayJob 資源會建立臨時 Ray 叢集,工作完成後即可自動刪除。
  • RayJob 資源支援重試政策,可確保工作執行作業的韌性。
  • 您可以使用熟悉的 Kubernetes API 模式管理 Ray 工作。

目標

本指南適用於生成式 AI 客戶、GKE 新手或現有使用者、機器學習工程師、MLOps (DevOps) 工程師,或是有意使用 Kubernetes 容器協調功能,透過 Ray 服務模型的平台管理員。

  • 建立 GKE 叢集。
  • 使用 RayCluster 自訂資源建立 Ray 叢集。
  • 使用 Ray 工作訓練模型。
  • 使用 RayJob 自訂資源部署 Ray 工作。

費用

在本文件中,您會使用 Google Cloud的下列計費元件:

如要根據預測用量估算費用,請使用 Pricing Calculator

初次使用 Google Cloud 的使用者可能符合免費試用資格。

完成本文所述工作後,您可以刪除已建立的資源,避免繼續計費。詳情請參閱清除所用資源一節。

事前準備

Cloud Shell 已預先安裝本教學課程所需的軟體,包括 kubectlgcloud CLI。如果您未使用 Cloud Shell,則必須安裝 gcloud CLI。

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. Install the Google Cloud CLI.

  3. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  4. To initialize the gcloud CLI, run the following command:

    gcloud init
  5. Create or select a Google Cloud project.

    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID

      Replace PROJECT_ID with your Google Cloud project name.

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the GKE API:

    gcloud services enable container.googleapis.com
  8. Install the Google Cloud CLI.

  9. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  10. To initialize the gcloud CLI, run the following command:

    gcloud init
  11. Create or select a Google Cloud project.

    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID

      Replace PROJECT_ID with your Google Cloud project name.

  12. Make sure that billing is enabled for your Google Cloud project.

  13. Enable the GKE API:

    gcloud services enable container.googleapis.com
  14. Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/container.clusterAdmin, roles/container.admin

    gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE
    • Replace PROJECT_ID with your project ID.
    • Replace USER_IDENTIFIER with the identifier for your user account. For example, user:myemail@example.com.

    • Replace ROLE with each individual role.

準備環境

如要準備環境,請按照下列步驟操作:

  1. 在 Google Cloud 控制台中,按一下Cloud Shell 啟用圖示Google Cloud 控制台中的「啟用 Cloud Shell」,即可啟動 Cloud Shell 工作階段。系統會在 Google Cloud 控制台的底部窗格啟動工作階段。

  2. 設定環境變數:

    export PROJECT_ID=PROJECT_ID
    export CLUSTER_NAME=ray-cluster
    export COMPUTE_REGION=us-central1
    export COMPUTE_ZONE=us-central1-c
    export CLUSTER_VERSION=CLUSTER_VERSION
    export TUTORIAL_HOME=`pwd`
    

    更改下列內容:

    • PROJECT_ID:您的 Google Cloud 專案 ID
    • CLUSTER_VERSION:要使用的 GKE 版本。必須為 1.30.1 或之後。
  3. 複製 GitHub 存放區:

    git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples
    
  4. 變更為工作目錄:

    cd kubernetes-engine-samples/ai-ml/gke-ray/raytrain/pytorch-mnist
    
  5. 建立 Python 虛擬環境:

    python -m venv myenv && \
    source myenv/bin/activate
    
  6. 安裝 Ray

建立 GKE 叢集

建立 Autopilot 或 Standard GKE 叢集:

Autopilot

建立 Autopilot 叢集:

gcloud container clusters create-auto ${CLUSTER_NAME}  \
    --enable-ray-operator \
    --cluster-version=${CLUSTER_VERSION} \
    --location=${COMPUTE_REGION}

標準

建立標準叢集:

gcloud container clusters create ${CLUSTER_NAME} \
    --addons=RayOperator \
    --cluster-version=${CLUSTER_VERSION}  \
    --machine-type=e2-standard-8 \
    --location=${COMPUTE_ZONE} \
    --num-nodes=4

部署 RayCluster 資源

將 RayCluster 資源部署至叢集:

  1. 請查看下列資訊清單:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: pytorch-mnist-cluster
    spec:
      rayVersion: '2.37.0'
      headGroupSpec:
        rayStartParams:
          dashboard-host: '0.0.0.0'
        template:
          metadata:
          spec:
            containers:
            - name: ray-head
              image: rayproject/ray:2.37.0
              ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              resources:
                limits:
                  cpu: "2"
                  ephemeral-storage: "9Gi"
                  memory: "4Gi"
                requests:
                  cpu: "2"
                  ephemeral-storage: "9Gi"
                  memory: "4Gi"
      workerGroupSpecs:
      - replicas: 4
        minReplicas: 1
        maxReplicas: 5
        groupName: worker-group
        rayStartParams: {}
        template:
          spec:
            containers:
            - name: ray-worker
              image: rayproject/ray:2.37.0
              resources:
                limits:
                  cpu: "4"
                  ephemeral-storage: "9Gi"
                  memory: "8Gi"
                requests:
                  cpu: "4"
                  ephemeral-storage: "9Gi"
                  memory: "8Gi"

    這個資訊清單說明 RayCluster 自訂資源。

  2. 將資訊清單套用至 GKE 叢集:

    kubectl apply -f ray-cluster.yaml
    
  3. 確認 RayCluster 資源已準備就緒:

    kubectl get raycluster
    

    輸出結果會與下列內容相似:

    NAME                    DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
    pytorch-mnist-cluster   2                 2                   6      20Gi     0      ready    63s
    

    在這個輸出內容中,STATUS 資料欄中的 ready 表示 RayCluster 資源已準備就緒。

連線至 RayCluster 資源

連線至 RayCluster 資源,即可提交 Ray 工作。

  1. 確認 GKE 是否已建立 RayCluster 服務:

    kubectl get svc pytorch-mnist-cluster-head-svc
    

    輸出結果會與下列內容相似:

    NAME                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                AGE
    pytorch-mnist-cluster-head-svc   ClusterIP   34.118.238.247   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP   109s
    
  2. 建立通訊埠轉送工作階段至 Ray 標頭:

    kubectl port-forward svc/pytorch-mnist-cluster-head-svc 8265:8265 2>&1 >/dev/null &
    
  3. 確認 Ray 用戶端可以使用 localhost 連線至 Ray 叢集:

    ray list nodes --address http://localhost:8265
    

    輸出結果會與下列內容相似:

    Stats:
    ------------------------------
    Total: 3
    
    Table:
    ------------------------------
        NODE_ID                                                   NODE_IP     IS_HEAD_NODE    STATE    NODE_NAME    RESOURCES_TOTAL                 LABELS
    0  1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2  10.28.1.21  False           ALIVE    10.28.1.21   CPU: 2.0                        ray.io/node_id: 1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2
    # Several lines of output omitted
    

訓練模型

使用 Fashion MNIST 資料集訓練 PyTorch 模型:

  1. 提交 Ray 工作,並等待工作完成:

    ray job submit --submission-id pytorch-mnist-job --working-dir . --runtime-env-json='{"pip": ["torch", "torchvision"], "excludes": ["myenv"]}' --address http://localhost:8265 -- python train.py
    

    輸出結果會與下列內容相似:

    Job submission server address: http://localhost:8265
    
    --------------------------------------------
    Job 'pytorch-mnist-job' submitted successfully
    --------------------------------------------
    
    Next steps
      Query the logs of the job:
        ray job logs pytorch-mnist-job
      Query the status of the job:
        ray job status pytorch-mnist-job
      Request the job to be stopped:
        ray job stop pytorch-mnist-job
    
    Handling connection for 8265
    Tailing logs until the job exits (disable with --no-wait):
    ...
    ...
    
  2. 確認工作狀態:

    ray job status pytorch-mnist
    

    輸出結果會與下列內容相似:

    Job submission server address: http://localhost:8265
    Status for job 'pytorch-mnist-job': RUNNING
    Status message: Job is currently running.
    

    等待 Status for job 變成 COMPLETE。這項作業可能需要 15 分鐘以上才能完成。

  3. 查看 Ray 工作記錄:

    ray job logs pytorch-mnist
    

    輸出結果會與下列內容相似:

    Training started with configuration:
    ╭─────────────────────────────────────────────────╮
    │ Training config                                  │
    ├──────────────────────────────────────────────────┤
    │ train_loop_config/batch_size_per_worker       8  │
    │ train_loop_config/epochs                     10  │
    │ train_loop_config/lr                      0.001  │
    ╰─────────────────────────────────────────────────╯
    
    # Several lines omitted
    
    Training finished iteration 10 at 2024-06-19 08:29:36. Total running time: 9min 18s
    ╭───────────────────────────────╮
    │ Training result                │
    ├────────────────────────────────┤
    │ checkpoint_dir_name            │
    │ time_this_iter_s      25.7394  │
    │ time_total_s          351.233  │
    │ training_iteration         10  │
    │ accuracy               0.8656  │
    │ loss                  0.37827  │
    ╰───────────────────────────────╯
    
    # Several lines omitted
    -------------------------------
    Job 'pytorch-mnist' succeeded
    -------------------------------
    

部署 RayJob

RayJob 自訂資源會在執行單一 Ray 工作期間,管理 RayCluster 資源的生命週期。

  1. 請查看下列資訊清單:

    apiVersion: ray.io/v1
    kind: RayJob
    metadata:
      name: pytorch-mnist-job
    spec:
      shutdownAfterJobFinishes: true
      entrypoint: python ai-ml/gke-ray/raytrain/pytorch-mnist/train.py
      runtimeEnvYAML: |
        pip:
          - torch
          - torchvision
        working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
        env_vars:
          NUM_WORKERS: "4"
          CPUS_PER_WORKER: "2"
      rayClusterSpec:
        rayVersion: '2.37.0'
        headGroupSpec:
          rayStartParams: {}
          template:
            spec:
              containers:
                - name: ray-head
                  image: rayproject/ray:2.37.0
                  ports:
                    - containerPort: 6379
                      name: gcs-server
                    - containerPort: 8265
                      name: dashboard
                    - containerPort: 10001
                      name: client
                  resources:
                    limits:
                      cpu: "2"
                      ephemeral-storage: "9Gi"
                      memory: "4Gi"
                    requests:
                      cpu: "2"
                      ephemeral-storage: "9Gi"
                      memory: "4Gi"
        workerGroupSpecs:
          - replicas: 4
            minReplicas: 1
            maxReplicas: 5
            groupName: small-group
            rayStartParams: {}
            template:
              spec:
                containers:
                  - name: ray-worker
                    image: rayproject/ray:2.37.0
                    resources:
                      limits:
                        cpu: "4"
                        ephemeral-storage: "9Gi"
                        memory: "8Gi"
                      requests:
                        cpu: "4"
                        ephemeral-storage: "9Gi"
                        memory: "8Gi"

    這個資訊清單說明 RayJob 自訂資源。

  2. 將資訊清單套用至 GKE 叢集:

    kubectl apply -f ray-job.yaml
    
  3. 確認 RayJob 資源是否正在執行:

    kubectl get rayjob
    

    輸出結果會與下列內容相似:

    NAME                JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGE
    pytorch-mnist-job   RUNNING      Running             2024-06-19T15:43:32Z              2m29s
    

    在這個輸出內容中,DEPLOYMENT STATUS 欄表示 RayJob 資源為 Running

  4. 查看 RayJob 資源狀態:

    kubectl logs -f -l job-name=pytorch-mnist-job
    

    輸出結果會與下列內容相似:

    Training started with configuration:
    ╭─────────────────────────────────────────────────╮
    │ Training config                                  │
    ├──────────────────────────────────────────────────┤
    │ train_loop_config/batch_size_per_worker       8  │
    │ train_loop_config/epochs                     10  │
    │ train_loop_config/lr                      0.001  │
    ╰─────────────────────────────────────────────────╯
    
    # Several lines omitted
    
    Training finished iteration 10 at 2024-06-19 08:29:36. Total running time: 9min 18s
    ╭───────────────────────────────╮
    │ Training result                │
    ├────────────────────────────────┤
    │ checkpoint_dir_name            │
    │ time_this_iter_s      25.7394  │
    │ time_total_s          351.233  │
    │ training_iteration         10  │
    │ accuracy               0.8656  │
    │ loss                  0.37827  │
    ╰───────────────────────────────╯
    
    # Several lines omitted
    -------------------------------
    Job 'pytorch-mnist' succeeded
    -------------------------------
    
  5. 確認 Ray 工作已完成:

    kubectl get rayjob
    

    輸出結果會與下列內容相似:

    NAME                JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME               AGE
    pytorch-mnist-job   SUCCEEDED    Complete            2024-06-19T15:43:32Z   2024-06-19T15:51:12Z   9m6s
    

    在這個輸出內容中,DEPLOYMENT STATUS 欄表示 RayJob 資源為 Complete

清除所用資源

刪除專案

    Delete a Google Cloud project:

    gcloud projects delete PROJECT_ID

刪除個別資源

如果您使用現有專案,但不想刪除專案,可以刪除個別資源。如要刪除叢集,請輸入:

gcloud container clusters delete ${CLUSTER_NAME}

後續步驟

  • 探索 Google Cloud 的參考架構、圖表和最佳做法。 歡迎瀏覽我們的雲端架構中心