此页面由 Cloud Translation API 翻译。

在 GKE Autopilot 模式下使用 GPU 训练模型

Autopilot

本快速入门介绍了如何在 Google Kubernetes Engine (GKE) 中部署采用 GPU 的训练模型，并将预测结果存储在 Cloud Storage 中。本文档适用于拥有现有 Autopilot 模式集群且首次想要运行 GPU 工作负载的 GKE 管理员。

如果您在集群中创建了单独的 GPU 节点池，也可以在 Standard 集群上运行这些工作负载。如需了解相关说明，请参阅在 GKE Standard 模式下使用 GPU 训练模型。

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE and Cloud Storage APIs.

Enable the APIs

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running

gcloud components
      update

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE and Cloud Storage APIs.

Enable the APIs

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running

gcloud components
      update

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

克隆示例代码库

在 Cloud Shell 中，运行以下命令：

git clone https://github.com/GoogleCloudPlatform/ai-on-gke && \
cd ai-on-gke/tutorials-and-examples/gpu-examples/training-single-gpu

创建集群

在 Google Cloud 控制台中，前往创建 Autopilot 集群页面：

前往“创建 Autopilot 集群”页面
在名称字段中，输入 gke-gpu-cluster。
在区域列表中，选择 us-central1。
点击创建。

创建 Cloud Storage 存储桶

在 Google Cloud 控制台中，前往创建存储桶页面：

转到“创建存储桶”
在为存储桶命名字段中，输入以下名称：
```
PROJECT_ID-gke-gpu-bucket
```
将 PROJECT_ID 替换为您的 Google Cloud项目 ID。
点击继续。
对于位置类型，选择区域。
在区域列表中，选择 us-central1 (Iowa)，然后点击继续。
在为数据选择一个存储类别部分，点击继续。
在选择如何控制对对象的访问权限部分的访问权限控制中，选择统一。
点击创建。
在系统将禁止公开访问对话框，请确保禁止公开访问此存储桶复选框处于选中状态，然后点击确认。

配置集群以使用适用于 GKE 的工作负载身份联合访问存储桶

如需允许集群访问 Cloud Storage 存储桶，请执行以下操作：

在集群中创建 Kubernetes ServiceAccount。
创建一个 IAM 许可政策，允许 ServiceAccount 访问存储桶。

在集群中创建 Kubernetes ServiceAccount

在 Cloud Shell 中，执行以下操作：

连接到集群：

gcloud container clusters get-credentials gke-gpu-cluster \
    --location=us-central1

创建 Kubernetes 命名空间：

kubectl create namespace gke-gpu-namespace

在命名空间中创建 Kubernetes ServiceAccount：

kubectl create serviceaccount gpu-k8s-sa --namespace=gke-gpu-namespace

为存储桶创建 IAM 允许政策

向 Kubernetes ServiceAccount 授予存储桶上的 Storage Object Admin (roles/storage.objectAdmin) 角色：

gcloud storage buckets add-iam-policy-binding gs://PROJECT_ID \
    --member=principal://iam.googleapis.com/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/gke-gpu-namespace/sa/gpu-k8s-sa \
    --role=roles/storage.objectAdmin \
    --condition=None

将 PROJECT_NUMBER 替换为您的 Google Cloud项目编号。

验证 Pod 是否可以访问 Cloud Storage 存储桶

在 Cloud Shell 中，创建以下环境变量：
```
export K8S_SA_NAME=gpu-k8s-sa
export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket
```
将 PROJECT_ID 替换为您的 Google Cloud项目 ID。
创建具有 TensorFlow 容器的 Pod：
```
envsubst < src/gke-config/standard-tensorflow-bash.yaml | kubectl --namespace=gke-gpu-namespace apply -f -
```
此命令会将您创建的环境变量插入到清单中相应的引用中。您还可以在文本编辑器中打开清单，并将 $K8S_SA_NAME 和 $BUCKET_NAME 替换为相应的值。

在存储桶中创建示例文件：

touch sample-file
gsutil cp sample-file gs://PROJECT_ID-gke-gpu-bucket

等待 Pod 准备就绪：
```
kubectl wait --for=condition=Ready pod/test-tensorflow-pod -n=gke-gpu-namespace --timeout=180s
```
Pod 准备就绪后，输出如下所示：
```
pod/test-tensorflow-pod condition met
```
如果命令超时，GKE 可能仍在创建新节点来运行 Pod。再次运行该命令，并等待 Pod 准备就绪。

在 TensorFlow 容器中打开 shell：

kubectl -n gke-gpu-namespace exec --stdin --tty test-tensorflow-pod --container tensorflow -- /bin/bash

尝试读取您创建的示例文件：
```
ls /data
```
输出会显示该示例文件。

检查日志以识别已挂接到 Pod 的 GPU：

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

输出结果会显示已挂接到 Pod 的 GPU，如下所示：

...
PhysicalDevice(name='/physical_device:GPU:0',device_type='GPU')

退出该容器：
```
exit
```

删除示例 Pod：

kubectl delete -f src/gke-config/standard-tensorflow-bash.yaml \
    --namespace=gke-gpu-namespace

使用 `MNIST` 数据集进行训练和预测

在本部分中，您将对 MNIST 示例数据集运行训练工作负载。

将示例数据复制到 Cloud Storage 存储桶：

gsutil -m cp -R src/tensorflow-mnist-example gs://PROJECT_ID-gke-gpu-bucket/

创建以下环境变量：

export K8S_SA_NAME=gpu-k8s-sa
export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket

查看训练作业：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-training-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu 
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt; python tensorflow_mnist_train_distributed.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"

部署训练作业：
```
envsubst < src/gke-config/standard-tf-mnist-train.yaml | kubectl -n gke-gpu-namespace apply -f -
```
此命令会将您创建的环境变量替换为清单中相应的引用。您还可以在文本编辑器中打开清单，并将 $K8S_SA_NAME 和 $BUCKET_NAME 替换为相应的值。
等待作业状态变为 Completed：
```
kubectl wait -n gke-gpu-namespace --for=condition=Complete job/mnist-training-job --timeout=180s
```
作业准备就绪后，输出类似于以下内容：
```
job.batch/mnist-training-job condition met
```
如果命令超时，GKE 可能仍在创建新节点来运行 Pod。再次运行该命令，然后等待作业准备就绪。

检查 TensorFlow 容器中的日志：

kubectl logs -f jobs/mnist-training-job -c tensorflow -n gke-gpu-namespace

输出显示发生了以下事件：

安装必需的 Python 软件包
下载 MNIST 数据集
使用 GPU 训练模型
保存模型
评估模型

...
Epoch 12/12
927/938 [============================>.] - ETA: 0s - loss: 0.0188 - accuracy: 0.9954
Learning rate for epoch 12 is 9.999999747378752e-06
938/938 [==============================] - 5s 6ms/step - loss: 0.0187 - accuracy: 0.9954 - lr: 1.0000e-05
157/157 [==============================] - 1s 4ms/step - loss: 0.0424 - accuracy: 0.9861
Eval loss: 0.04236088693141937, Eval accuracy: 0.9861000180244446
Training finished. Model saved

删除训练工作负载：

kubectl -n gke-gpu-namespace delete -f src/gke-config/standard-tf-mnist-train.yaml

部署推理工作负载

在本部分中，您将部署一个推理工作负载，该工作负载将示例数据集作为输入并返回预测。

将用于预测的图片复制到存储桶：

gsutil -m cp -R data/mnist_predict gs://PROJECT_ID-gke-gpu-bucket/

查看推理工作负载：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-batch-prediction-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu 
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt; python tensorflow_mnist_batch_predict.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"

部署推理工作负载：
```
envsubst < src/gke-config/standard-tf-mnist-batch-predict.yaml | kubectl -n gke-gpu-namespace apply -f -
```
此命令会将您创建的环境变量替换为清单中相应的引用。您还可以在文本编辑器中打开清单，并将 $K8S_SA_NAME 和 $BUCKET_NAME 替换为相应的值。

等待作业状态变为 Completed：

kubectl wait -n gke-gpu-namespace --for=condition=Complete job/mnist-batch-prediction-job --timeout=180s

输出类似于以下内容：

job.batch/mnist-batch-prediction-job condition met

检查 TensorFlow 容器中的日志：

kubectl logs -f jobs/mnist-batch-prediction-job -c tensorflow -n gke-gpu-namespace

输出为每个图片的预测以及预测中模型的置信度，类似于以下内容：

Found 10 files belonging to 1 classes.
1/1 [==============================] - 2s 2s/step
The image /data/mnist_predict/0.png is the number 0 with a 100.00 percent confidence.
The image /data/mnist_predict/1.png is the number 1 with a 99.99 percent confidence.
The image /data/mnist_predict/2.png is the number 2 with a 100.00 percent confidence.
The image /data/mnist_predict/3.png is the number 3 with a 99.95 percent confidence.
The image /data/mnist_predict/4.png is the number 4 with a 100.00 percent confidence.
The image /data/mnist_predict/5.png is the number 5 with a 100.00 percent confidence.
The image /data/mnist_predict/6.png is the number 6 with a 99.97 percent confidence.
The image /data/mnist_predict/7.png is the number 7 with a 100.00 percent confidence.
The image /data/mnist_predict/8.png is the number 8 with a 100.00 percent confidence.
The image /data/mnist_predict/9.png is the number 9 with a 99.65 percent confidence.

清理

为避免系统因您在本指南中创建的资源而向您的 Google Cloud 账号收取费用，请执行以下操作之一：

保留 GKE 集群：删除集群中的 Kubernetes 资源以及 Google Cloud 资源
保留 Google Cloud 项目：删除 GKE 集群以及 Google Cloud 资源
删除项目

删除集群中的 Kubernetes 资源以及 Google Cloud 资源

删除 Kubernetes 命名空间和您部署的工作负载：

kubectl -n gke-gpu-namespace delete -f src/gke-config/standard-tf-mnist-batch-predict.yaml
kubectl delete namespace gke-gpu-namespace

删除 Cloud Storage 存储桶：
1. 转至存储桶页面：
  
  进入“存储桶”
2. 选择 PROJECT_ID-gke-gpu-bucket 对应的复选框。
3. 点击删除。
4. 如需确认删除，请输入 DELETE，然后点击删除。
删除 Google Cloud 服务账号：
1. 转到服务账号页面：
  
  转到“服务账号”
2. 选择您的项目。
3. 选择 gke-gpu-sa@PROJECT_ID.iam.gserviceaccount.com 对应的复选框。
4. 点击删除。
5. 如需确认删除，请点击删除。

删除 GKE 集群和 Google Cloud 资源

删除 GKE 集群：
1. 转到集群页面：
  
  转到“集群”
2. 选择 gke-gpu-cluster 对应的复选框。
3. 点击删除。
4. 如需确认删除，请输入 gke-gpu-cluster，然后点击删除。
删除 Cloud Storage 存储桶：
1. 转至存储桶页面：
  
  进入“存储桶”
2. 选择 PROJECT_ID-gke-gpu-bucket 对应的复选框。
3. 点击删除。
4. 如需确认删除，请输入 DELETE，然后点击删除。
删除 Google Cloud 服务账号：
1. 转到服务账号页面：
  
  转到“服务账号”
2. 选择您的项目。
3. 选择 gke-gpu-sa@PROJECT_ID.iam.gserviceaccount.com 对应的复选框。
4. 点击删除。
5. 如需确认删除，请点击删除。

删除项目

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

后续步骤

详细了解如何在 GKE 中使用 GPU