Managing Batch on GKE clusters

This page shows you how to create and manage Batch on GKE clusters.

Before you begin

Before you start, make sure you have performed the following tasks:

Set up default gcloud settings using one of the following methods:

  • Using gcloud init, if you want to be walked through setting defaults.
  • Using gcloud config, to individually set your project ID, zone, and region.

Using gcloud init

  1. Run gcloud init and follow the directions:

    gcloud init

    If you are using SSH on a remote server, use the --console-only flag to prevent the command from launching a browser:

    gcloud init --console-only
  2. Follow the instructions to authorize gcloud to use your Google Cloud account.
  3. Create a new configuration or select an existing one.
  4. Choose a Google Cloud project.
  5. Choose a default Compute Engine zone.

Using gcloud config

  • Set your default project ID:
    gcloud config set project project-id
  • If you are working with zonal clusters, set your default compute zone:
    gcloud config set compute/zone compute-zone
  • If you are working with regional clusters, set your default compute region:
    gcloud config set compute/region compute-region
  • Update gcloud to the latest version:
    gcloud components update

In Beta, Batch on GKE (Batch) supports only regional clusters. You must create a regional cluster and enable Workload Identity.

Run the following command to create a cluster that is compatible with Batch on GKE:

gcloud beta container clusters create cluster-name \
  --region compute-region \
  --node-locations compute-zone \
  --num-nodes 1 \
  --machine-type n1-standard-8 \
  --release-channel regular \
  --enable-stackdriver-kubernetes \
  --identity-namespace=project-id.svc.id.goog \
  --enable-ip-alias

Configuring identity and access management

  1. Bind your account as the project owner:

    gcloud projects add-iam-policy-binding project-id \
      --member user:email --role=roles/owner
    

    where:

    • project-id is your Project ID.
    • email is the email address of your account.
  2. Create a custom role with read permissions on GKE clusters:

    gcloud iam roles create BatchUser --project project-id \
      --title GKEClusterReader --permissions container.clusters.get --stage BETA 2>&1
    

    where:

    • project-id is your Project ID.
    • GKEClusterReader is the title of the role.
  3. Create a ClusterRoleBinding in your cluster to allow Batch to create Kubernetes Roles:

    kubectl create clusterrolebinding cluster-admin-binding-email \
      --clusterrole=cluster-admin --user email
    

    where email is the email address of your account.

  4. Create a Google service account:

    gcloud iam service-accounts create kbatch-controllers-gcloud-sa \
      --display-name kbatch-controllers-gcloud-service-account
    
  5. Create a Kubernetes service account:

    kubectl create serviceaccount --namespace kube-system kbatch-controllers-k8s-sa
    
  6. Add the following IAM policy bindings, where project-id is your Project ID:

    gcloud projects add-iam-policy-binding project-id \
      --member serviceAccount:kbatch-controllers-gcloud-sa@project-id.iam.gserviceaccount.com \
      --role=roles/container.clusterAdmin
    
    gcloud projects add-iam-policy-binding project-id \
      --member serviceAccount:kbatch-controllers-gcloud-sa@project-id.iam.gserviceaccount.com \
      --role=roles/compute.admin
    
    gcloud projects add-iam-policy-binding project-id \
      --member serviceAccount:kbatch-controllers-gcloud-sa@project-id.iam.gserviceaccount.com \
      --role=roles/iam.serviceAccountUser
    
    gcloud iam service-accounts add-iam-policy-binding \
      --role roles/iam.workloadIdentityUser \
      --member "serviceAccount:project-id.svc.id.goog[kube-system/kbatch-controllers-k8s-sa]" kbatch-controllers-gcloud-sa@project-id.iam.gserviceaccount.com
    
  7. Add the iam.gke.io/gcp-service-account annotation to the Kubernetes service account:

    kubectl annotate serviceaccount --namespace kube-system kbatch-controllers-k8s-sa \
       iam.gke.io/gcp-service-account=kbatch-controllers-gcloud-sa@project-id.iam.gserviceaccount.com
    

Enabling GPUs

If you want to run GPU jobs, you need to install NVIDIA's device drivers to the nodes. Google provides a DaemonSet that automatically installs the drivers for you when nodes are created in the cluster.

Refer to the section Installing NVIDIA GPU device drivers for installation instructions.

Installing Batch on GKE

To install Batch, perform the following steps:

  1. Download the Batch release from GitHub.

  2. Extract the tar file:

    tar zxvf kbatch-version.tar.gz
    
  3. Change to the kbatch directory:

    cd kbatch_dist
    
  4. Add your info to the Config file:

    vi config/kbatch-config.yaml
    
    ...
    ClusterName: cluster-name
     ClusterLocation : compute-region
     ProjectID: project-id
     Recommender:
       Locations:
       # Note: Only one zone is supported in the Locations list here.
       - compute-zone
    Actuator:
    ...
    
  5. Create configmaps:

    kubectl create configmap --from-file config/kbatch-config.yaml -n kube-system kbatch-config
    
  6. Install the Batch custom resource definitions and components:

    kubectl apply -f install/
    

Verifying the Batch installation

  1. Verify that kbatch-admission Pods are running:

    kubectl get pods -n kube-system --selector=app=kbatch-admission
    

    The output looks similar to the following:

    NAME                                READY   STATUS    RESTARTS   AGE
    kbatch-admission-799b776795-xxvmh   1/1     Running   0          1m
    
  2. Verify that kbatch-controllers Pods are running:

    kubectl get pods -n kube-system --selector=control-plane=kbatch-controllers
    

    The output looks similar to the following:

    NAME                   READY   STATUS    RESTARTS   AGE
    kbatch-controllers-0   1/1     Running   0          1m
    
  3. Once you've verified the Batch installation, run the sample jobs.

Managing Batch on GKE versions

You can upgrade, downgrade, and uninstall Batch.

Upgrading Batch

To upgrade to a new minor or patch version, run the following commands:

  1. Delete the current admission and controllers .yaml files:

    kubectl delete -f kbatch-current-version/install/02-admission.yaml \
    kubectl delete -f kbatch-current-version/install/03-controllers.yaml
    
  2. Apply the new admission and controllers .yaml files:

    kubectl apply -f kbatch-new-version/install/02-admission.yaml \
    kubectl apply -f kbatch-new-version/install/03-controllers.yaml
    

To upgrade to a new major version, either install the new version on a new cluster, or follow the steps in Uninstalling Batch then install the new major version.

Downgrading Batch

You can only rollback to the previous minor or patch version.

To rollback to a previous version, run the following commands:

  1. Delete the current admission and controllers .yaml files:

    kubectl delete -f kbatch-current-version/install/02-admission.yaml \
    kubectl delete -f kbatch-current-version/install/03-controllers.yaml
    
  2. Apply the new admission and controllers .yaml files:

    kubectl apply -f kbatch-old-version/install/02-admission.yaml \
    kubectl apply -f kbatch-old-version/install/03-controllers.yaml
    

Uninstalling Batch

To uninstall Batch, perform the following steps:

  1. Verify which version of Batch versions you are running by checking the image tags:

    kubectl get deployment kbatch-admission -n kube-system -o jsonpath="{..image}"
    kubectl get statefulset kbatch-controllers -n kube-system -o jsonpath="{..image}"
    
  2. Delete the installation bundle from your cluster:

    kubectl delete -f kbatch-version/install/
    

Debugging Batch on GKE using Stackdriver

Batch uses Prometheus as the monitoring tool. You can view your kbatch-controller-service metrics from stackdriver monitoring. The metrics that are generated by Batch services are considered as external metrics in Stackdriver.

Custom metrics are a chargeable feature of Stackdriver Monitoring and there could be costs for the custom metrics. For more information on pricing, see Stackdriver Pricing.

Before you begin

Configuring identity and access management

  1. Create a Google service account:

    gcloud iam service-accounts create kbatch-monitoring-gcloud-sa \
      --display-name kbatch-monitoring-gcloud-service-account
    
  2. Create a Kubernetes service account:

    kubectl create serviceaccount --namespace kube-system kbatch-monitoring-k8s-sa
    
  3. Add the following IAM policy bindings, where project-id is your Project ID:

    gcloud projects add-iam-policy-binding project-id \
      --member serviceAccount:kbatch-monitoring-gcloud-sa@project-id.iam.gserviceaccount.com \
      --role=roles/monitoring.metricWriter
    
    gcloud projects add-iam-policy-binding project-id \
      --member serviceAccount:kbatch-monitoring-gcloud-sa@project-id.iam.gserviceaccount.com \
      --role=roles/monitoring.viewer
    
    gcloud iam service-accounts add-iam-policy-binding \
      --role roles/iam.workloadIdentityUser \
      --member "serviceAccount:project-id.svc.id.goog[kube-system/kbatch-monitoring-k8s-sa]" kbatch-monitoring-gcloud-sa@project-id.iam.gserviceaccount.com
    
  4. Add the iam.gke.io/gcp-service-account annotation to the Kubernetes service account:

    kubectl annotate serviceaccount --namespace kube-system kbatch-monitoring-k8s-sa \
       iam.gke.io/gcp-service-account=kbatch-monitoring-gcloud-sa@project-id.iam.gserviceaccount.com
    
  5. Get the admin tools:

    git clone https://github.com/GoogleCloudPlatform/Kbatch.git
    
  6. Go to the monitoring directory:

    cd admintools/monitoring
    

Deploy prometheus service

  1. To deploy the prometheus service run the following command:

    kubectl apply -f prometheus.yaml
    
  2. To validate the Prometheus deployment, run the following command:

    kubectl get pod -n kube-system | grep 'kbatch-prometheus'
    

    The output is similar to this:

    kbatch-prometheus-deployment-97bc6b97b-m4q9h       1/1     Running   0          9s
    

Installing the Stackdriver collector

Next, deploy the sidecar container as the Stackdriver collector. Sidecar exports the Prometheus metrics to Stackdriver.

  1. To deploy the stackdriver collector run the following command:

    sh ./setup_metrics_export_to_sd.sh
    
  2. To validate the Stackdriver collector installation, run the following command:

    kubectl -n kube-system get deployment kbatch-prometheus-deployment -o=go-template='{{$output := "stackdriver-prometheus-sidecar does not exists."}}{{range .spec.template.spec.containers}}{{if eq .name "sidecar"}}{{$output = (print "sidecar exists. Image: " .image)}}{{end}}{{end}}{{printf $output}}{{"\n"}}'
    

    When the Prometheus sidecar is successfully installed, the output of the script lists the image used from the container registry.

    sidecar exists. Image: gcr.io/kbatch-images/stackdriver-prometheus-sidecar:0.6.1
    

    Otherwise, the output of the script shows:

    stackdriver-prometheus-sidecar does not exist.
    

Viewing metrics

  1. Go to Metrics Explorer.

    Go to Metrics Explorer

  2. Go to Resources > Metrics Explorer.

  3. In the Find resource type and metric field, select the one with the external/prometheus/ prefix.

    For example, you might select external/prometheus/kbatch_scheduling_dep.

    You can add multiple metrics in one Workspace.

Disable the Stackdriver collector

To disable the sidecar container run the following command from the kbatch directory.

sh ./disable_metrics_export_to_sd.sh

Clean up

To stop running Batch services in a GKE cluster, run the following commands:

kubectl delete deployment kbatch-admission --namespace=kube-system
kubectl delete statefulset kbatch-controllers --namespace=kube-system

To delete the GKE cluster that has Batch installed, run the following command:

gcloud container clusters delete cluster-name --region compute-region

To delete the Filestore instance, run the following command:

gcloud beta filestore instances delete filestore-instance-id \
  --project=project-id --location=filestore-zone

where:

  • filestore-instance-id is your Filestore Instance ID.
  • project-id is your Project ID.
  • filestore-zone is your zone.

To delete the project that has Batch installed, run the following command:

gcloud projects delete project-id

What's next