Melatih model TensorFlow dengan Keras di Google Kubernetes Engine

Bagian berikut memberikan contoh penyesuaian model BERT untuk klasifikasi urutan menggunakan library Hugging Face transformers dengan TensorFlow. Set data didownload ke volume yang dipasang dan didukung Parallelstore, sehingga pelatihan model dapat langsung membaca data dari volume.

Prasyarat

Simpan manifes YAML berikut (parallelstore-csi-job-example.yaml) untuk Tugas pelatihan model Anda.

  apiVersion: batch/v1
  kind: Job
  metadata:
    name: parallelstore-csi-job-example
  spec:
    template:
      metadata:
        annotations:
            gke-parallelstore/cpu-limit: "0"
            gke-parallelstore/memory-limit: "0"
      spec:
        securityContext:
          runAsUser: 1000
          runAsGroup: 100
          fsGroup: 100
        containers:
        - name: tensorflow
          image: jupyter/tensorflow-notebook@sha256:173f124f638efe870bb2b535e01a76a80a95217e66ed00751058c51c09d6d85d
          command: ["bash", "-c"]
          args:
          - |
            pip install transformers datasets
            python - <<EOF
            from datasets import load_dataset
            dataset = load_dataset("glue", "cola", cache_dir='/data')
            dataset = dataset["train"]
            from transformers import AutoTokenizer
            import numpy as np
            tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
            tokenized_data = tokenizer(dataset["sentence"], return_tensors="np", padding=True)
            tokenized_data = dict(tokenized_data)
            labels = np.array(dataset["label"])
            from transformers import TFAutoModelForSequenceClassification
            from tensorflow.keras.optimizers import Adam
            model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")
            model.compile(optimizer=Adam(3e-5))
            model.fit(tokenized_data, labels)
            EOF
          volumeMounts:
          - name: parallelstore-volume
            mountPath: /data
        volumes:
        - name: parallelstore-volume
          persistentVolumeClaim:
            claimName: parallelstore-pvc
        restartPolicy: Never
    backoffLimit: 1

Terapkan manifes YAML ke cluster.

kubectl apply -f parallelstore-csi-job-example.yaml

Periksa progres pemuatan data dan pelatihan model dengan perintah berikut:

POD_NAME=$(kubectl get pod | grep 'parallelstore-csi-job-example' | awk '{print $1}')
kubectl logs -f $POD_NAME -c tensorflow