GKE Standard 모드에서 GPU로 모델 학습

표준

이 빠른 시작 튜토리얼에서는 Google Kubernetes Engine(GKE)에서 GPU를 사용하여 학습 모델을 배포하고 Cloud Storage에 예측을 저장하는 방법을 보여줍니다. 이 튜토리얼에서는 TensorFlow 모델과 GKE Standard 클러스터를 사용합니다. 설정 단계를 줄여 Autopilot 클러스터에서 이러한 워크로드를 실행할 수도 있습니다. 자세한 내용은 GKE Autopilot 모드에서 GPU로 모델 학습을 참조하세요.

이 문서는 기존 Standard 클러스터가 있고 GPU 워크로드를 처음 실행하고자 하는 GKE 관리자를 대상으로 합니다.

시작하기 전에

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Kubernetes Engine and Cloud Storage APIs.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Kubernetes Engine and Cloud Storage APIs.

Enable the APIs

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

샘플 저장소 복제

Cloud Shell에서 다음 명령어를 실행합니다.

git clone https://github.com/GoogleCloudPlatform/ai-on-gke/ ai-on-gke
cd ai-on-gke/tutorials-and-examples/gpu-examples/training-single-gpu

Standard 모드 클러스터와 GPU 노드 풀 만들기

Cloud Shell을 사용하여 다음을 수행합니다.

GKE용 워크로드 아이덴티티 제휴를 사용하고 Cloud Storage FUSE 드라이버를 설치하는 Standard 클러스터를 만듭니다.
```
gcloud container clusters create gke-gpu-cluster \
    --addons GcsFuseCsiDriver \
    --location=us-central1 \
    --num-nodes=1 \
    --workload-pool=PROJECT_ID.svc.id.goog
```
PROJECT_ID를 Google Cloud프로젝트 ID로 바꿉니다.

클러스터 생성에는 몇 분 정도 걸릴 수 있습니다.

GPU 노드 풀을 만듭니다.

gcloud container node-pools create gke-gpu-pool-1 \
    --accelerator=type=nvidia-tesla-t4,count=1,gpu-driver-version=default \
    --machine-type=n1-standard-16 --num-nodes=1 \
    --location=us-central1 \
    --cluster=gke-gpu-cluster

Cloud Storage 버킷 만들기

Google Cloud 콘솔에서 버킷 만들기 페이지로 이동합니다.

버킷 만들기로 이동
버킷 이름 지정 필드에 다음 이름을 입력합니다.
```
PROJECT_ID-gke-gpu-bucket
```
계속을 클릭합니다.
위치 유형으로 리전을 선택합니다.
리전 목록에서 us-central1 (Iowa)을 선택하고 계속을 클릭합니다.
데이터의 스토리지 클래스 선택 섹션에서 계속을 클릭합니다.
객체 액세스를 제어하는 방식 선택 섹션의 액세스 제어에서 균일을 선택합니다.
만들기를 클릭합니다.
공개 액세스가 차단됨 대화상자에서 이 버킷에 공개 액세스 방지 적용 체크박스가 선택되어 있는 경우 확인을 클릭합니다.

GKE용 워크로드 아이덴티티 제휴를 사용하여 버킷에 액세스하도록 클러스터 구성

클러스터가 Cloud Storage 버킷에 액세스하도록 하려면 다음을 수행합니다.

Google Cloud 서비스 계정을 만듭니다.
클러스터에 Kubernetes ServiceAccount를 만듭니다.
Kubernetes ServiceAccount를 Google Cloud 서비스 계정에 바인딩합니다.

Google Cloud 서비스 계정을 만듭니다.

Google Cloud 콘솔에서 서비스 계정 만들기 페이지로 이동합니다.

서비스 계정 만들기로 이동
서비스 계정 ID 필드에 gke-ai-sa를 입력합니다.
만들고 계속하기를 클릭합니다.
역할 목록에서 Cloud Storage > Storage Insights 수집기 서비스 역할을 선택합니다.
다른 역할 추가를 클릭합니다.
역할 선택 목록에서 Cloud Storage > 스토리지 객체 관리자 역할을 선택합니다.
계속을 클릭한 다음 완료를 클릭합니다.

클러스터에 Kubernetes ServiceAccount 만들기

Cloud Shell에서 다음을 수행합니다.

Kubernetes 네임스페이스를 만듭니다.

kubectl create namespace gke-ai-namespace

네임스페이스에 Kubernetes ServiceAccount를 만듭니다.

kubectl create serviceaccount gpu-k8s-sa --namespace=gke-ai-namespace

Kubernetes ServiceAccount를 Google Cloud 서비스 계정에 바인딩

Cloud Shell에서 다음 명령어를 실행합니다.

IAM 바인딩을 Google Cloud 서비스 계정에 추가합니다.

gcloud iam service-accounts add-iam-policy-binding gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[gke-ai-namespace/gpu-k8s-sa]"

--member 플래그는 Google Cloud에서 Kubernetes ServiceAccount의 전체 ID를 제공합니다.

Kubernetes ServiceAccount에 주석을 추가합니다.

kubectl annotate serviceaccount gpu-k8s-sa \
    --namespace gke-ai-namespace \
    iam.gke.io/gcp-service-account=gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com

포드가 Cloud Storage 버킷에 액세스할 수 있는지 확인

Cloud Shell에서 다음 변수를 만듭니다.
```
export K8S_SA_NAME=gpu-k8s-sa
export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket
```
PROJECT_ID를 Google Cloud프로젝트 ID로 바꿉니다.
TensorFlow 컨테이너가 있는 포드를 만듭니다.
```
envsubst < src/gke-config/standard-tensorflow-bash.yaml | kubectl --namespace=gke-ai-namespace apply -f -
```
이 명령어는 생성된 환경 변수를 매니페스트의 해당 참조로 대체합니다. 텍스트 편집기에서 매니페스트를 열고 $K8S_SA_NAME 및 $BUCKET_NAME을 해당 값으로 바꿀 수도 있습니다.

버킷에 샘플 파일을 만듭니다.

touch sample-file
gcloud storage cp sample-file gs://PROJECT_ID-gke-gpu-bucket

포드가 준비될 때까지 기다립니다.

kubectl wait --for=condition=Ready pod/test-tensorflow-pod -n=gke-ai-namespace --timeout=180s

포드가 준비되면 다음과 같은 출력이 표시됩니다.

pod/test-tensorflow-pod condition met

Tensorflow 컨테이너에서 셸을 엽니다.

kubectl -n gke-ai-namespace exec --stdin --tty test-tensorflow-pod --container tensorflow -- /bin/bash

만든 샘플 파일을 읽어 봅니다.
```
ls /data
```
출력에 샘플 파일이 표시됩니다.

로그를 확인하여 포드에 연결된 GPU를 식별합니다.

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

다음과 같이 포드에 연결된 GPU가 출력에 표시됩니다.

...
PhysicalDevice(name='/physical_device:GPU:0',device_type='GPU')

컨테이너를 종료합니다.
```
exit
```

샘플 포드를 삭제합니다.

kubectl delete -f src/gke-config/standard-tensorflow-bash.yaml \
    --namespace=gke-ai-namespace

`MNIST` 데이터 세트를 사용하여 학습 및 예측

이 섹션에서는 MNIST 예시 데이터 세트에서 학습 워크로드를 실행합니다.

예시 데이터를 Cloud Storage 버킷에 복사합니다.

gcloud storage cp src/tensorflow-mnist-example gs://PROJECT_ID-gke-gpu-bucket/ --recursive

다음의 환경 변수를 만듭니다.

export K8S_SA_NAME=gpu-k8s-sa
export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket

학습 작업을 검토합니다.

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-training-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu 
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt; python tensorflow_mnist_train_distributed.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"

학습 작업을 배포합니다.
```
envsubst < src/gke-config/standard-tf-mnist-train.yaml | kubectl -n gke-ai-namespace apply -f -
```
이 명령어는 생성된 환경 변수를 매니페스트의 해당 참조로 대체합니다. 텍스트 편집기에서 매니페스트를 열고 $K8S_SA_NAME 및 $BUCKET_NAME을 해당 값으로 바꿀 수도 있습니다.

작업이 Completed 상태가 될 때까지 기다립니다.

kubectl wait -n gke-ai-namespace --for=condition=Complete job/mnist-training-job --timeout=180s

출력은 다음과 비슷합니다.

job.batch/mnist-training-job condition met

Tensorflow 컨테이너의 로그를 확인합니다.

kubectl logs -f jobs/mnist-training-job -c tensorflow -n gke-ai-namespace

출력은 다음과 같은 이벤트가 발생했음을 보여줍니다.

필수 Python 패키지 설치
MNIST 데이터 세트 다운로드
GPU를 사용하여 모델 학습
모델 저장
모델 평가

...
Epoch 12/12
927/938 [============================>.] - ETA: 0s - loss: 0.0188 - accuracy: 0.9954
Learning rate for epoch 12 is 9.999999747378752e-06
938/938 [==============================] - 5s 6ms/step - loss: 0.0187 - accuracy: 0.9954 - lr: 1.0000e-05
157/157 [==============================] - 1s 4ms/step - loss: 0.0424 - accuracy: 0.9861
Eval loss: 0.04236088693141937, Eval accuracy: 0.9861000180244446
Training finished. Model saved

학습 워크로드를 삭제합니다.

kubectl -n gke-ai-namespace delete -f src/gke-config/standard-tf-mnist-train.yaml

추론 워크로드 배포

이 섹션에서는 샘플 데이터 세트를 입력으로 받아 예측을 반환하는 추론 워크로드를 배포합니다.

예측할 이미지를 버킷에 복사합니다.

gcloud storage cp data/mnist_predict gs://PROJECT_ID-gke-gpu-bucket/ --recursive

추론 워크로드를 검토합니다.

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-batch-prediction-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu 
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt; python tensorflow_mnist_batch_predict.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"

추론 워크로드를 배포합니다.
```
envsubst < src/gke-config/standard-tf-mnist-batch-predict.yaml | kubectl -n gke-ai-namespace apply -f -
```
이 명령어는 생성된 환경 변수를 매니페스트의 해당 참조로 대체합니다. 텍스트 편집기에서 매니페스트를 열고 $K8S_SA_NAME 및 $BUCKET_NAME을 해당 값으로 바꿀 수도 있습니다.

작업이 Completed 상태가 될 때까지 기다립니다.

kubectl wait -n gke-ai-namespace --for=condition=Complete job/mnist-batch-prediction-job --timeout=180s

출력은 다음과 비슷합니다.

job.batch/mnist-batch-prediction-job condition met

Tensorflow 컨테이너의 로그를 확인합니다.

kubectl logs -f jobs/mnist-batch-prediction-job -c tensorflow -n gke-ai-namespace

출력은 다음과 유사한 각 이미지에 대한 예측과 예측에 대한 모델의 신뢰도입니다.

Found 10 files belonging to 1 classes.
1/1 [==============================] - 2s 2s/step
The image /data/mnist_predict/0.png is the number 0 with a 100.00 percent confidence.
The image /data/mnist_predict/1.png is the number 1 with a 99.99 percent confidence.
The image /data/mnist_predict/2.png is the number 2 with a 100.00 percent confidence.
The image /data/mnist_predict/3.png is the number 3 with a 99.95 percent confidence.
The image /data/mnist_predict/4.png is the number 4 with a 100.00 percent confidence.
The image /data/mnist_predict/5.png is the number 5 with a 100.00 percent confidence.
The image /data/mnist_predict/6.png is the number 6 with a 99.97 percent confidence.
The image /data/mnist_predict/7.png is the number 7 with a 100.00 percent confidence.
The image /data/mnist_predict/8.png is the number 8 with a 100.00 percent confidence.
The image /data/mnist_predict/9.png is the number 9 with a 99.65 percent confidence.

삭제

이 가이드에서 만든 리소스에 대한 비용이 Google Cloud 계정에 청구되지 않게 하려면 다음 중 하나를 수행합니다.

GKE 클러스터 유지: 클러스터의 Kubernetes 리소스와 Google Cloud 리소스를 삭제합니다.
Google Cloud 프로젝트 유지: GKE 클러스터와 Google Cloud 리소스를 삭제합니다.
프로젝트 삭제

클러스터의 Kubernetes 리소스 및 Google Cloud 리소스 삭제

배포한 Kubernetes 네임스페이스와 워크로드를 삭제합니다.

kubectl -n gke-ai-namespace delete -f src/gke-config/standard-tf-mnist-batch-predict.yaml
kubectl delete namespace gke-ai-namespace

Cloud Storage 버킷을 삭제합니다.
1. 버킷 페이지로 이동합니다.
  
  버킷으로 이동
2. PROJECT_ID-gke-gpu-bucket의 체크박스를 선택합니다.
3. 삭제를 클릭합니다.
4. 삭제를 확인하려면 DELETE를 입력하고 삭제를 클릭합니다.
Google Cloud 서비스 계정을 삭제합니다.
1. 서비스 계정 페이지로 이동합니다.
  
  서비스 계정으로 이동
2. 프로젝트를 선택합니다.
3. gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com의 체크박스를 선택합니다.
4. 삭제를 클릭합니다.
5. 삭제를 확인하려면 삭제를 클릭합니다.

GKE 클러스터 및 Google Cloud 리소스 삭제

GKE 클러스터를 삭제합니다.
1. 클러스터 페이지로 이동합니다.
  
  클러스터로 이동
2. gke-gpu-cluster의 체크박스를 선택합니다.
3. 삭제를 클릭합니다.
4. 삭제를 확인하려면 gke-gpu-cluster를 입력하고 삭제를 클릭합니다.
Cloud Storage 버킷을 삭제합니다.
1. 버킷 페이지로 이동합니다.
  
  버킷으로 이동
2. PROJECT_ID-gke-gpu-bucket의 체크박스를 선택합니다.
3. 삭제를 클릭합니다.
4. 삭제를 확인하려면 DELETE를 입력하고 삭제를 클릭합니다.
Google Cloud 서비스 계정을 삭제합니다.
1. 서비스 계정 페이지로 이동합니다.
  
  서비스 계정으로 이동
2. 프로젝트를 선택합니다.
3. gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com의 체크박스를 선택합니다.
4. 삭제를 클릭합니다.
5. 삭제를 확인하려면 삭제를 클릭합니다.

프로젝트 삭제

주의: 프로젝트 삭제가 미치는 영향은 다음과 같습니다.

프로젝트의 모든 항목이 삭제됩니다. 이 문서의 태스크에 기존 프로젝트를 사용한 경우 프로젝트를 삭제하면 프로젝트에서 수행한 다른 작업도 삭제됩니다.
커스텀 프로젝트 ID가 손실됩니다. 이 프로젝트를 만들 때 앞으로 사용할 커스텀 프로젝트 ID를 만들었을 수 있습니다. appspot.com URL과 같이 프로젝트 ID를 사용하는 URL을 보존하려면 전체 프로젝트를 삭제하는 대신 프로젝트 내에서 선택한 리소스만 삭제합니다.

여러 아키텍처, 튜토리얼, 빠른 시작을 살펴보려는 경우 프로젝트를 재사용하면 프로젝트 할당량 한도 초과를 방지할 수 있습니다.

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

다음 단계

GKE에서 GPU 사용 자세히 알아보기

GKE Standard 모드에서 GPU로 모델 학습

시작하기 전에

샘플 저장소 복제

Standard 모드 클러스터와 GPU 노드 풀 만들기

Cloud Storage 버킷 만들기

GKE용 워크로드 아이덴티티 제휴를 사용하여 버킷에 액세스하도록 클러스터 구성

Google Cloud 서비스 계정을 만듭니다.

클러스터에 Kubernetes ServiceAccount 만들기

Kubernetes ServiceAccount를 Google Cloud 서비스 계정에 바인딩

포드가 Cloud Storage 버킷에 액세스할 수 있는지 확인

MNIST 데이터 세트를 사용하여 학습 및 예측

추론 워크로드 배포

삭제

클러스터의 Kubernetes 리소스 및 Google Cloud 리소스 삭제

GKE 클러스터 및 Google Cloud 리소스 삭제

프로젝트 삭제

다음 단계

`MNIST` 데이터 세트를 사용하여 학습 및 예측