Deploy a batch machine learning workload

Stay organized with collections Save and categorize content based on your preferences.

This tutorial demonstrates how to use Google Kubernetes Engine (GKE) to manage fault-tolerant batch workloads while keeping your costs low. Follow this tutorial to learn how to use Jobs and cost-optimized Spot VMs, as well as how to configure an in-cluster Redis job queue on GKE.

Background

A batch workload is a process typically designed to have a start and a completion point. You should consider batch workloads on GKE if your architecture involves ingesting, processing, and outputting data instead of using raw data. Areas like machine learning, artificial intelligence, and high performance computing (HPC) feature different kinds of batch workloads, such as offline model training, batched prediction, data analytics, simulation of physical systems, and video processing.

By designing containerized batch workloads, you can leverage the following GKE benefits:

  • An open standard, broad community, and managed service.
  • Cost efficiency from effective workload and infrastructure orchestration and specialized compute resources.
  • Isolation and portability of containerization, allowing the use of cloud as overflow capacity while maintaining data security.
  • Availability of burst capacity, followed by rapid scale down of GKE clusters.

Objectives

This tutorial is for machine learning engineers or data scientists who want to learn how to run batch machine learning workloads on GKE, on the following cost-efficient and scalable architecture:

Architecture diagram

This tutorial covers the following steps:

  1. Create a GKE cluster with Standard and Spot VM node pools.
  2. Create a Filestore NFS volume.
  3. Create an in-cluster Redis job queue.
  4. Transfer datasets to the NFS volume and queue them to be consumed by the workload.
  5. Run a sample batch machine learning workload on the GKE cluster.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.


For step-by-step guidance for this task directly in the Google Cloud console, click Guide me:

Guide me


The following sections take you through the same steps as clicking Guide me.

Before you begin

Set up your project

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, click Create project to begin creating a new Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  4. Enable the Compute Engine, GKE, and Filestore APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, click Create project to begin creating a new Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  7. Enable the Compute Engine, GKE, and Filestore APIs.

    Enable the APIs

Set up your environment

In this tutorial, you use Cloud Shell to manage resources hosted on Google Cloud. Cloud Shell comes preinstalled with Docker and the kubectl and gcloud CLI.

To use Cloud Shell to set up your environment:

  1. In the Google Cloud console, start a Cloud Shell instance.
    Open Cloud Shell

  2. Download the source code for this sample app.

    git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples
    
    cd kubernetes-engine-samples/batch/aiml-workloads
    
  3. Update the scripts/variables.sh file with your project information:

    sed -i "\
      s/<YOUR_PROJECT_ID>/PROJECT_ID/g; \
      s/<YOUR_REGION>/REGION/g; \
      s/<YOUR_ZONE>/ZONE/g" \
      scripts/variables.sh
    

    Replace the following:

  4. Set the necessary environment variables and gcloud defaults.

    source scripts/variables.sh
    gcloud config set project ${PROJECT_ID}
    gcloud config set compute/region ${REGION}
    gcloud config set compute/zone ${ZONE}
    gcloud config set filestore/zone ${ZONE}
    

Create a GKE cluster

Create a GKE cluster with two node pools where GKE hosts the persistent volume and runs the workload with the Redis job queue.

  1. Create a Standard GKE cluster. This cluster initializes a node pool with default configurations.

    gcloud container clusters create ${CLUSTER_NAME} \
      --machine-type=e2-standard-2 --num-nodes=3
    
  2. Verify the GKE cluster is running:

    gcloud container clusters list
    

    The STATUS value is RUNNING for the batch-aiml cluster.

  3. Create a second node pool in the GKE cluster which will be enabled with Spot VMs:

    gcloud beta container node-pools create batch-node-pool \
      --cluster=${CLUSTER_NAME} --spot
    

    You will deploy the batch workload to this Spot VM node pool later in the tutorial.

Set up dataset storage with a Network File System (NFS)

The machine learning workload requires a storage solution for the datasets and output files. In this tutorial, you create a Filestore instance, which is accessible by all nodes (read and write operations) in a GKE cluster through a Persistent Volume (PV) and Persistent Volume Claim (PVC).

To learn more, see how to design an optimal storage strategy and how to access file shares from GKE clusters.

Create a Filestore instance

  1. Create a Filestore instance:

    gcloud filestore instances create ${FILESTORE_ID} \
      --tier=BASIC_HDD \
      --file-share=name="NFSVol",capacity=1TB \
      --network=name="default"
    

    where:

    • tier is service tier for the Filestore instance. This sample uses the basic offering, but you can learn about the other options in Service tiers.

    • network=name is the name of the Compute Engine Virtual Private Cloud (VPC) network to which the instance is connected. The GKE cluster must be in the same VPC network as the Filestore instance.

    • capacity is the desired size of the volume. Specify the storage value in one of the supported units described in Resource quantities

  2. Verify that the Filestore instance has been deployed:

    gcloud filestore instances list --project=${PROJECT_ID} --zone=${ZONE}
    

    The output is similar to the following, where the STATE value is READY for the batch-aiml-filestore instance:

    INSTANCE_NAME: batch-aiml-filestore
    LOCATION: us-central1-b
    TIER: BASIC_HDD
    CAPACITY_GB: 1024
    FILE_SHARE_NAME: NFSVol
    IP_ADDRESS: 10.152.101.10
    STATE: READY
    CREATE_TIME: 2022-03-15T18:23:51
    
  3. From the output, make note of the IP_ADDRESS value; it will be used in the next section.

Create a Persistent Volume

A Kubernetes Persistent Volume (PV) specification allows the GKE cluster to connect to the Filestore instance.

  1. Update the kubernetes-manifests/persistent-volume.yaml file with the Filestore instance IP address:

    sed -i "\
      s/<FILESTORE_IP_ADDRESS>/IP_ADDRESS/g" \
      kubernetes-manifests/persistent-volume.yaml
    

    Replace the IP_ADDRESS with the value you noted when creating the Filestore instance.

  2. Deploy the PV specification:

    kubectl apply -f kubernetes-manifests/persistent-volume.yaml
    

Create a Persistent Volume Claim

A Kubernetes Persistent Volume Claim (PVC) specification allows Kubernetes Pods and Jobs to access the storage resources of a Persistent Volume.

Deploy the PVC specification:

kubectl apply -f kubernetes-manifests/persistent-volume-claim.yaml

Consume the Persistent Volume Claim

With the Persistent Volume and Persistent Volume Claim set up on the GKE cluster, you can configure the Redis server and the batch jobs to consume the PVC. This will appear as a mountable storage volume in the VM running the Kubernetes resources.

Inspect the kubernetes-manifests/redis-pod.yaml and kubernetes-manifests/workload.yaml files. The manifest configurations are similar to the following:

  spec:
  …
  containers:
  - name: workload
    image: "us-central1-docker.pkg.dev/gke-batch-aiml/batch-aiml-docker-repo/workload"
    volumeMounts:
    - mountPath: /mnt/fileserver
      name: workload-pvc
  volumes:
  - name: workload-pvc
    persistentVolumeClaim:
      claimName: fileserver-claim
      readOnly: false

where:

  • spec.volumes specifies the Persistent Volume Claims to consume.

  • spec.containers.volumeMounts specifies the local file path on the VM where the Filestore fileshare can be accessed.

Set up a Redis job queue

The workload will be processing data in batches to iteratively train the fraud detection model. To manage the datasets currently being processed or still in the queue, you deploy the Redis server into the GKE cluster.

For simplicity, you start a single instance of Redis in this sample. For an example of deploying Redis scalably and redundantly, see how to create a guestbook with Redis and PHP.

  1. Deploy the Redis server specification.

    kubectl apply -f kubernetes-manifests/redis-pod.yaml
    
  2. Verify the Pod is running before proceeding. In the output of the following command, the STATUS for the redis-leader Pod should be Running.

    kubectl get pods
    

    The output is similar to the following:

    NAME           READY   STATUS    RESTARTS   AGE
    redis-leader   1/1     Running   0          118s
    
  3. Transfer the files containing the training and testing datasets to the NFS volume.

    sh scripts/transfer-datasets.sh
    

    This script copies the files from the sample code repository to the /mnt/fileserver/datasets/ directory on the redis-leader Pod.

  4. Populate the Redis queue.

    sh scripts/queue-jobs.sh
    

    This script pushes the file paths for the training datasets to a list named datasets in the Redis database. This queue will be used by the workload to locate the next dataset to process.

  5. Deploy the Service to make the Redis server discoverable within the GKE cluster.

    kubectl apply -f ./kubernetes-manifests/redis-service.yaml
    

Run the batch workload

At this point, you prepared the GKE cluster, the Redis job queue, and fileshare. Now, you can run your batched workload.

In this section, you will use a container image of a sample workload for training a fraud detection model using batches of financial transaction data. The training process can be summarized as:

  1. A Redis client claims jobs (file paths to datasets) in the Redis queue, and removes them from the queue once completed.

  2. A model training manager class, FraudDetectionModelTrainer, loads a new batch of data and, optionally, a saved state of a machine learning model. The dataset is used to refine the model (a process known as "warm-start" training).

  3. The new state of the model and a report of the batch details and performance scores are saved in the Filestore NFS volume, which can be accessed in the GKE cluster through a Persistent Volume Claim.

To learn more, explore the source code.

Define the job

The following manifest describes batch job specification for the workload image. This manifest utilizes the Kubernetes Job resource.

apiVersion: batch/v1
kind: Job
metadata:
  name: workload
spec:
  parallelism: 1
  template:
    metadata:
      name: workload
    spec:
      nodeSelector:
        cloud.google.com/gke-spot: "true"
      containers:
      - name: workload
        image: "us-docker.pkg.dev/google-samples/containers/gke/batch-ml-workload"
        volumeMounts:
        - mountPath: /mnt/fileserver
          name: workload-pvc
      volumes:
      - name: workload-pvc
        persistentVolumeClaim:
          claimName: fileserver-claim
          readOnly: false
      restartPolicy: OnFailure

Deploy the workload

  1. Apply the specification for the batch job.

    kubectl apply -f ./kubernetes-manifests/workload.yaml
    
  2. Check if the status of the workload-XXX Pod is complete.

    watch kubectl get pods
    

    This may take a few seconds. You can return to the command line by pressing Ctrl+C.

    The output is similar to the following:

    NAME             READY   STATUS      RESTARTS   AGE
    redis-leader     1/1     Running     0          16m
    workload-4p55d   0/1     Completed   0          83s
    
  3. Check the logs from the workload Job:

    kubectl logs job/workload
    

    The output is similar to the following:

    Worker with sessionID: b50f9459-ce7f-4da8-9f84-0ab5c3233a72
    Initial queue state: empty=False
    Processing dataset: datasets/training/2018-04-04.pkl
    Processing dataset: datasets/training/2018-04-03.pkl
    Processing dataset: datasets/training/2018-04-02.pkl
    Processing dataset: datasets/training/2018-04-01.pkl
    Queue empty, exiting
    

    The .pkl files are serializations of datasets containing a batch of credit card transactions, marked valid or fraudulent. The workload Job iterates over these files, unpacking the datasets and using them to train the machine learning model, before removing them from the Redis queue. The workload continues to process the data in batches, until the Redis queue is emptied, before exiting with success.

Explore the NFS volume

During its operation the workload creates files in the mounted NFS volume, accessible across the cluster to other batch jobs or online applications. To explore the files in the NFS volume, follow these steps:

  1. List the files created by the workload:

    kubectl exec --stdin --tty redis-leader -- /bin/sh -c "ls -1 /mnt/fileserver/output"
    

    The output should be the following:

    model_cpt_2018-04-01.pkl
    model_cpt_2018-04-02.pkl
    model_cpt_2018-04-03.pkl
    model_cpt_2018-04-04.pkl
    report.txt
    

    Checkpoints for the trained model (filenames like model_cpt_XXX.pkl) and a report of the model performance (report.txt) were created in the /mnt/fileserver/output directory on the NFS volume.

  2. Inspect the model performance report:

    kubectl exec --stdin --tty redis-leader -- /bin/sh -c "cat /mnt/fileserver/output/report.txt"
    

    The following is a snippet from the output:

    Report generated on: 2022-02-09 14:19:42.303619
    Training dataset: 2018-04-04.pkl
    Model checkpoint: model_cpt_2018-04-04.pkl
    ---
    Accuracy on training data: 0.9981112277019937
    Accuracy on testing data: 0.9977204434773599
    

    The file contains entries detailing the time of training, dataset used, accuracy achieved, and the filename of the model checkpoint associated with the training.

To learn more about NFS volumes, check out the Filestore guides.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

To remove the individual resources created for this tutorial:

sh scripts/cleanup.sh

What's next