在 Google Kubernetes Engine (GKE) 上使用 Ray 和 PyTorch 训练模型

Autopilot Standard

本指南演示了如何使用 Ray、PyTorch 和 Ray Operator 插件在 Google Kubernetes Engine (GKE) 上训练模型。

Ray 简介

Ray 是一个适用于 AI/机器学习应用的开源可伸缩计算框架。Ray Train 是 Ray 中的一个组件，专为分布式模型训练和微调而设计。您可以使用 Ray Train API 在多台机器上规模化训练，并与 PyTorch 等机器学习库集成。

您可以使用 RayCluster 或 RayJob 资源部署 Ray 训练作业。出于以下原因，您应该在生产环境中部署 Ray 作业时使用 RayJob 资源：

RayJob 资源创建一个临时 Ray 集群，该集群可以在作业完成后自动删除。
RayJob 资源支持重试政策，以弹性执行作业。
您可以使用熟悉的 Kubernetes API 模式来管理 Ray 作业。

目标

本指南适用于对通过 Kubernetes 容器编排功能来使用 Ray 应用模型感兴趣的生成式 AI 客户、GKE 的新用户或现有用户、机器学习工程师、MLOps (DevOps) 工程师或平台管理员。

创建 GKE 集群。
使用 RayCluster 自定义资源创建 Ray 集群。
使用 Ray 作业训练模型。
使用 RayJob 自定义资源部署 Ray 作业。

费用

在本文档中，您将使用 Google Cloud 的以下收费组件：

您可使用价格计算器根据您的预计使用情况来估算费用。 Google Cloud 新用户可能有资格申请免费试用。

完成本文档中描述的任务后，您可以通过删除所创建的资源来避免继续计费。如需了解详情，请参阅清理。

准备工作

Cloud Shell 中预安装了本教程所需的软件，包括 kubectl 和 gcloud CLI。如果您不使用 Cloud Shell，则必须安装 gcloud CLI。

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running

gcloud components
      update

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API:

gcloud services enable container.googleapis.com

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running

gcloud components
      update

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API:

gcloud services enable container.googleapis.com

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/container.clusterAdmin, roles/container.admin
```
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE
```
- Replace PROJECT_ID with your project ID.
- Replace USER_IDENTIFIER with the identifier for your user account. For example, user:myemail@example.com.
- Replace ROLE with each individual role.

准备环境

如需准备环境，请按照以下步骤操作：

点击 Google Cloud 控制台中的 激活 Cloud Shell，从 Google Cloud 控制台启动 Cloud Shell 会话。此操作会在 Google Cloud 控制台的底部窗格中启动会话。

设置环境变量：

export PROJECT_ID=PROJECT_ID
export CLUSTER_NAME=ray-cluster
export COMPUTE_REGION=us-central1
export COMPUTE_ZONE=us-central1-c
export CLUSTER_VERSION=CLUSTER_VERSION
export TUTORIAL_HOME=`pwd`

替换以下内容：

PROJECT_ID：您的 Google Cloud项目 ID。
CLUSTER_VERSION：要使用的 GKE 版本。必须为 1.30.1 或更高版本。

克隆 GitHub 代码库：

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples

切换到工作目录：

cd kubernetes-engine-samples/ai-ml/gke-ray/raytrain/pytorch-mnist

创建 Python 虚拟环境：

python -m venv myenv && \
source myenv/bin/activate

安装 Ray。

创建 GKE 集群

创建 Autopilot 或 Standard GKE 集群：

Autopilot

创建 Autopilot 集群：

gcloud container clusters create-auto ${CLUSTER_NAME}  \
    --enable-ray-operator \
    --cluster-version=${CLUSTER_VERSION} \
    --location=${COMPUTE_REGION}

标准

创建 Standard 集群：

gcloud container clusters create ${CLUSTER_NAME} \
    --addons=RayOperator \
    --cluster-version=${CLUSTER_VERSION}  \
    --machine-type=e2-standard-8 \
    --location=${COMPUTE_ZONE} \
    --num-nodes=4

部署 RayCluster 资源

将 RayCluster 资源部署到集群：

请查看以下清单：

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: pytorch-mnist-cluster
spec:
  rayVersion: '2.37.0'
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      metadata:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.37.0
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              cpu: "2"
              ephemeral-storage: "9Gi"
              memory: "4Gi"
            requests:
              cpu: "2"
              ephemeral-storage: "9Gi"
              memory: "4Gi"
  workerGroupSpecs:
  - replicas: 4
    minReplicas: 1
    maxReplicas: 5
    groupName: worker-group
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.37.0
          resources:
            limits:
              cpu: "4"
              ephemeral-storage: "9Gi"
              memory: "8Gi"
            requests:
              cpu: "4"
              ephemeral-storage: "9Gi"
              memory: "8Gi"

此清单描述了一个 RayCluster 自定义资源。

将清单应用于 GKE 集群：
```
kubectl apply -f ray-cluster.yaml
```

验证 RayCluster 资源是否已准备就绪：

kubectl get raycluster

输出类似于以下内容：

NAME                    DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
pytorch-mnist-cluster   2                 2                   6      20Gi     0      ready    63s

在此输出中，STATUS 列中的 ready 表示 RayCluster 资源已准备就绪。

连接到 RayCluster 资源

连接到 RayCluster 资源以提交 Ray 作业。

验证 GKE 是否已创建 RayCluster 服务：

kubectl get svc pytorch-mnist-cluster-head-svc

输出类似于以下内容：

NAME                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                AGE
pytorch-mnist-cluster-head-svc   ClusterIP   34.118.238.247   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP   109s

建立到 Ray 头节点的端口转发会话：

kubectl port-forward svc/pytorch-mnist-cluster-head-svc 8265:8265 2>&1 >/dev/null &

验证 Ray 客户端是否可以使用 localhost 连接到 Ray 集群：

ray list nodes --address http://localhost:8265

输出类似于以下内容：

Stats:
------------------------------
Total: 3

Table:
------------------------------
    NODE_ID                                                   NODE_IP     IS_HEAD_NODE    STATE    NODE_NAME    RESOURCES_TOTAL                 LABELS
0  1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2  10.28.1.21  False           ALIVE    10.28.1.21   CPU: 2.0                        ray.io/node_id: 1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2
# Several lines of output omitted

训练模型

使用 Fashion MNIST 数据集训练 PyTorch 模型：

提交 Ray 作业并等待作业完成：

ray job submit --submission-id pytorch-mnist-job --working-dir . --runtime-env-json='{"pip": ["torch", "torchvision"], "excludes": ["myenv"]}' --address http://localhost:8265 -- python train.py

输出类似于以下内容：

Job submission server address: http://localhost:8265

--------------------------------------------
Job 'pytorch-mnist-job' submitted successfully
--------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs pytorch-mnist-job
  Query the status of the job:
    ray job status pytorch-mnist-job
  Request the job to be stopped:
    ray job stop pytorch-mnist-job

Handling connection for 8265
Tailing logs until the job exits (disable with --no-wait):
...
...

验证作业状态：

ray job status pytorch-mnist

输出类似于以下内容：

Job submission server address: http://localhost:8265
Status for job 'pytorch-mnist-job': RUNNING
Status message: Job is currently running.

等待 Status for job 变为 COMPLETE。此过程可能需要 15 分钟或更长时间。

查看 Ray 作业日志：

ray job logs pytorch-mnist

输出类似于以下内容：

Training started with configuration:
╭─────────────────────────────────────────────────╮
│ Training config                                  │
├──────────────────────────────────────────────────┤
│ train_loop_config/batch_size_per_worker       8  │
│ train_loop_config/epochs                     10  │
│ train_loop_config/lr                      0.001  │
╰─────────────────────────────────────────────────╯

# Several lines omitted

Training finished iteration 10 at 2024-06-19 08:29:36. Total running time: 9min 18s
╭───────────────────────────────╮
│ Training result                │
├────────────────────────────────┤
│ checkpoint_dir_name            │
│ time_this_iter_s      25.7394  │
│ time_total_s          351.233  │
│ training_iteration         10  │
│ accuracy               0.8656  │
│ loss                  0.37827  │
╰───────────────────────────────╯

# Several lines omitted
-------------------------------
Job 'pytorch-mnist' succeeded
-------------------------------

部署 RayJob

RayJob 自定义资源会在执行单个 Ray 作业期间管理 RayCluster 资源的生命周期。

请查看以下清单：

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: pytorch-mnist-job
spec:
  shutdownAfterJobFinishes: true
  entrypoint: python ai-ml/gke-ray/raytrain/pytorch-mnist/train.py
  runtimeEnvYAML: |
    pip:
      - torch
      - torchvision
    working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
    env_vars:
      NUM_WORKERS: "4"
      CPUS_PER_WORKER: "2"
  rayClusterSpec:
    rayVersion: '2.37.0'
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.37.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "2"
                  ephemeral-storage: "9Gi"
                  memory: "4Gi"
                requests:
                  cpu: "2"
                  ephemeral-storage: "9Gi"
                  memory: "4Gi"
    workerGroupSpecs:
      - replicas: 4
        minReplicas: 1
        maxReplicas: 5
        groupName: small-group
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.37.0
                resources:
                  limits:
                    cpu: "4"
                    ephemeral-storage: "9Gi"
                    memory: "8Gi"
                  requests:
                    cpu: "4"
                    ephemeral-storage: "9Gi"
                    memory: "8Gi"

此清单描述了一个 RayJob 自定义资源。

将清单应用于 GKE 集群：
```
kubectl apply -f ray-job.yaml
```

验证 RayJob 资源是否正在运行：

kubectl get rayjob

输出类似于以下内容：

NAME                JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGE
pytorch-mnist-job   RUNNING      Running             2024-06-19T15:43:32Z              2m29s

在此输出中，DEPLOYMENT STATUS 列表示 RayJob 资源为 Running。

查看 RayJob 资源状态：

kubectl logs -f -l job-name=pytorch-mnist-job

输出类似于以下内容：

Training started with configuration:
╭─────────────────────────────────────────────────╮
│ Training config                                  │
├──────────────────────────────────────────────────┤
│ train_loop_config/batch_size_per_worker       8  │
│ train_loop_config/epochs                     10  │
│ train_loop_config/lr                      0.001  │
╰─────────────────────────────────────────────────╯

# Several lines omitted

Training finished iteration 10 at 2024-06-19 08:29:36. Total running time: 9min 18s
╭───────────────────────────────╮
│ Training result                │
├────────────────────────────────┤
│ checkpoint_dir_name            │
│ time_this_iter_s      25.7394  │
│ time_total_s          351.233  │
│ training_iteration         10  │
│ accuracy               0.8656  │
│ loss                  0.37827  │
╰───────────────────────────────╯

# Several lines omitted
-------------------------------
Job 'pytorch-mnist' succeeded
-------------------------------

验证 Ray 作业是否已完成：

kubectl get rayjob

输出类似于以下内容：

NAME                JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME               AGE
pytorch-mnist-job   SUCCEEDED    Complete            2024-06-19T15:43:32Z   2024-06-19T15:51:12Z   9m6s

在此输出中，DEPLOYMENT STATUS 列表示 RayJob 资源为 Complete。

清理

删除项目

Delete a Google Cloud project:

gcloud projects delete PROJECT_ID

删除各个资源

如果您使用的是现有项目，并且不想将其删除，则可逐个删除不再需要的资源。如需删除集群，请输入以下命令：

gcloud container clusters delete ${CLUSTER_NAME}

后续步骤

探索有关 Google Cloud 的参考架构、图表和最佳做法。查看我们的 Cloud 架构中心。

在 Google Kubernetes Engine (GKE) 上使用 Ray 和 PyTorch 训练模型 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

Ray 简介

目标

费用

准备工作

准备环境

创建 GKE 集群

Autopilot

标准

部署 RayCluster 资源

连接到 RayCluster 资源

训练模型

部署 RayJob

清理

删除项目

删除各个资源

后续步骤

在 Google Kubernetes Engine (GKE) 上使用 Ray 和 PyTorch 训练模型