Managing Batch on GKE clusters

This page shows you how to create and manage Batch on GKE clusters.

Before you begin

To prepare for this task, perform the following steps:

  • Ensure that you have enabled the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • Ensure that you have installed the Cloud SDK.
  • Set your default project ID:
    gcloud config set project [PROJECT_ID]
  • If you are working with zonal clusters, set your default compute zone:
    gcloud config set compute/zone [COMPUTE_ZONE]
  • If you are working with regional clusters, set your default compute region:
    gcloud config set compute/region [COMPUTE_REGION]
  • Update gcloud to the latest version:
    gcloud components update

In Beta, Batch on GKE (Batch) supports only regional clusters. You must create a regional cluster and enable Workload Identity.

Run the following command to create a regional cluster with Workload Identity enabled:

gcloud beta container clusters create [CLUSTER_NAME] \
  --region [COMPUTE_REGION] \
  --node-locations [COMPUTE_ZONE] \
  --num-nodes 1 \
  --machine-type n1-standard-8 \
  --release-channel regular \
  --enable-stackdriver-kubernetes \
  --identity-namespace=[PROJECT_ID].svc.id.goog \
  --enable-ip-alias

Configuring identity and access management

  1. Bind your account as the project owner:

    gcloud projects add-iam-policy-binding [PROJECT_ID] \
    --member user:[EMAIL] --role=roles/owner
    
  2. Create a custom role with read permissions on GKE clusters:

    gcloud iam roles create BatchUser --project [PROJECT_ID] \
    --title GKEClusterReader --permissions container.clusters.get --stage BETA 2>&1
    

    where:

    • [PROJECT_ID] is your Project ID.
    • GKEClusterReader is the title of the role.
  3. Create a ClusterRoleBinding in your cluster to allow Batch to create Kubernetes Roles:

    kubectl create clusterrolebinding cluster-admin-binding-[EMAIL] \
    --clusterrole=cluster-admin --user [EMAIL]
    

    where [EMAIL] is your email address.

  4. Create a Google service account:

    gcloud iam service-accounts create kbatch-controllers-gcloud-sa --display-name \
    kbatch-controllers-gcloud-service-account
    
  5. Create a Kubernetes service account:

    kubectl create serviceaccount --namespace kube-system kbatch-controllers-k8s-sa
    
  6. Add the following IAM policy bindings:

    gcloud projects add-iam-policy-binding [PROJECT_ID] \
    --member serviceAccount:kbatch-controllers-gcloud-sa@[PROJECT_ID].iam.gserviceaccount.com \
    --role=roles/container.clusterAdmin
    
    gcloud projects add-iam-policy-binding [PROJECT_ID] \
    --member serviceAccount:kbatch-controllers-gcloud-sa@[PROJECT_ID].iam.gserviceaccount.com \
    --role=roles/compute.admin
    
    gcloud projects add-iam-policy-binding [PROJECT_ID] \
    --member serviceAccount:kbatch-controllers-gcloud-sa@[PROJECT_ID].iam.gserviceaccount.com \
    --role=roles/iam.serviceAccountUser
    
    gcloud iam service-accounts add-iam-policy-binding \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:[PROJECT_ID].svc.id.goog[kube-system/kbatch-controllers-k8s-sa]" kbatch-controllers-gcloud-sa@[PROJECT_ID].iam.gserviceaccount.com
    

    where [PROJECT_ID] is your Project ID.

  7. Add the iam.gke.io/gcp-service-account annotation to the Kubernetes service account:

    kubectl annotate serviceaccount --namespace kube-system kbatch-controllers-k8s-sa \
     iam.gke.io/gcp-service-account=kbatch-controllers-gcloud-sa@[PROJECT].iam.gserviceaccount.com
    

Enabling GPUs

If you want to run GPU jobs, run the following command to ensure the proper drivers are installed when GPU nodes are created:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Installing Batch on GKE

To install Batch, perform the following steps:

  1. Download the Batch release from GitHub.

  2. Extract the tar file:

    tar zxvf kbatch-[VERSION].tar.gz
    
  3. Change to the kbatch directory:

    cd kbatch
    
  4. Add your info to the Config file:

    vi config/kbatch-config.yaml
    
    ...
    ClusterName: [CLUSTER_NAME]
     ClusterLocation : [COMPUTE_REGION]
     ProjectID: [PROJECT_ID]
     Recommender:
       Locations:
       # Note: Only one zone is supported in the Locations list here.
       - [COMPUTE_ZONE]
    Actuator:
    ...
    
  5. Create configmaps:

    kubectl create configmap --from-file config/kbatch-config.yaml -n kube-system kbatch-config
    
  6. Create autoscaler machine types:

    gcloud compute machine-types list --filter="zone:[COMPUTE_ZONE]" --format json > ./machine_types.json
    
    kubectl create configmap --from-file ./machine_types.json -n kube-system kbatch-machine-types
    
  7. Install the Batch custom resource definitions and components:

    kubectl apply -f install/01-crds.yaml
    
    kubectl apply -f install/02-admission.yaml
    
    kubectl apply -f install/03-controller.yaml
    
  8. Enable ksub to use your user credentials for API access:

    gcloud auth application-default login
    
  9. Initialize Ksub:

    ./ksub --config --create-default
    
  10. Add the default values for projectID, clusterName, and, if you are not operating in the default namespace, namespace in .ksubrc:

    vi ~/.ksubrc
    

Verifying the Batch installation

  1. Verify kbatch-admission Pods are RUNNING:

    kubectl get pods -n kube-system --selector=app=kbatch-admission
    
  2. Verify kbatch-controllers Pods are RUNNING:

    kubectl get pods -n kube-system --selector=control-plane=kbatch-controllers
    

Once you've verified the Batch installation, you can run the sample jobs

Managing Batch on GKE versions

You can upgrade, downgrade, and uninstall Batch.

Upgrading Batch

To upgrade to a new minor or patch version, run the following commands:

  1. Delete the current admission and controllers .yaml files:

    kubectl delete -f kbatch-[CURRENT_VERSION]/install/02-admission.yaml \
    kubectl delete -f kbatch-[CURRENT_VERSION]/install/03-controllers.yaml
    
  2. Apply the new admission and controllers .yaml files:

    kubectl apply -f kbatch-[NEW_VERSION]/install/02-admission.yaml \
    kubectl apply -f kbatch-[NEW_VERSION]/install/03-controllers.yaml
    

To upgrade to a new major version, either install the new version on a new cluster, or follow the steps in Uninstalling Batch then install the new major version.

Downgrading Batch

You can only rollback to the previous minor or patch version.

To rollback to a previous version, run the following commands:

  1. Delete the current admission and controllers .yaml files:

    kubectl delete -f kbatch-[CURRENT_VERSION]/install/02-admission.yaml \
    kubectl delete -f kbatch-[CURRENT_VERSION]/install/03-controllers.yaml
    
  2. Apply the new admission and controllers .yaml files:

    kubectl apply -f kbatch-[OLD_VERSION]/install/02-admission.yaml \
    kubectl apply -f kbatch-[OLD_VERSION]/install/03-controllers.yaml
    

Uninstalling Batch

To uninstall Batch, perform the following steps:

  1. Verify which version of Batch versions you are running by checking the image tags:

    kubectl get deployment kbatch-admission -n kube-system -o jsonpath="{..image}"
    kubectl get statefulset kbatch-controllers -n kube-system -o jsonpath="{..image}"
    
  2. Delete the installation bundle from your cluster:

    kubectl delete -f kbatch-[VERSION]/install/
    

Debugging Batch on GKE using Stackdriver

Batch uses Prometheus as the monitoring tool. You can view your kbatch-controller-service metrics from stackdriver monitoring. The metrics that are generated by Batch services are considered as external metrics in Stackdriver.

Custom metrics are a chargeable feature of Stackdriver Monitoring and there could be costs for the custom metrics. For more information on pricing, see Stackdriver Pricing.

Before you begin

Configuring identity and access management

  1. Create a Google service account:

    gcloud iam service-accounts create kbatch-monitoring-gcloud-sa \
    --display-name kbatch-monitoring-gcloud-service-account
    
  2. Create a Kubernetes service account:

    kubectl create serviceaccount --namespace kube-system kbatch-monitoring-k8s-sa
    
  3. Add the following IAM policy bindings:

    gcloud projects add-iam-policy-binding [PROJECT_ID] \
    --member serviceAccount:kbatch-monitoring-gcloud-sa@[PROJECT_ID].iam.gserviceaccount.com \
    --role=roles/monitoring.metricWriter
    
    gcloud projects add-iam-policy-binding [PROJECT_ID] \
    --member serviceAccount:kbatch-monitoring-gcloud-sa@[PROJECT_ID].iam.gserviceaccount.com \
    --role=roles/monitoring.viewer
    
    gcloud iam service-accounts add-iam-policy-binding \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:[PROJECT_ID].svc.id.goog[kube-system/kbatch-monitoring-k8s-sa]" kbatch-monitoring-gcloud-sa@[PROJECT_ID].iam.gserviceaccount.com
    

    where [PROJECT_ID] is your Project ID.

  4. Add the iam.gke.io/gcp-service-account annotation to the Kubernetes service account:

    kubectl annotate serviceaccount --namespace kube-system kbatch-monitoring-k8s-sa \
     iam.gke.io/gcp-service-account=kbatch-monitoring-gcloud-sa@[PROJECT].iam.gserviceaccount.com
    
  5. Get the admin tools:

    git clone https://github.com/GoogleCloudPlatform/Kbatch.git
    
  6. Go to the monitoring directory:

    cd admintools/monitoring
    

Deploy prometheus service

  1. To deploy the prometheus service run the following command:

    kubectl apply -f prometheus.yaml
    
  2. To validate the Prometheus deployment, run the following command:

    kubectl get pod -n kube-system | grep 'kbatch-prometheus'
    

    The output is similar to this:

    kbatch-prometheus-deployment-97bc6b97b-m4q9h       1/1     Running   0          9s
    

Installing the Stackdriver collector

Next, deploy the sidecar container as the Stackdriver collector. Sidecar exports the Prometheus metrics to Stackdriver.

  1. To deploy the stackdriver collector run the following command:

    sh ./setup_metrics_export_to_sd.sh
    
  2. To validate the Stackdriver collector installation, run the following command:

    kubectl -n kube-system get deployment kbatch-prometheus-deployment -o=go-template='{{$output := "stackdriver-prometheus-sidecar does not exists."}}{{range .spec.template.spec.containers}}{{if eq .name "sidecar"}}{{$output = (print "sidecar exists. Image: " .image)}}{{end}}{{end}}{{printf $output}}{{"\n"}}'
    

    When the Prometheus sidecar is successfully installed, the output of the script lists the image used from the container registry.

    sidecar exists. Image: gcr.io/kbatch-images/stackdriver-prometheus-sidecar:0.6.1
    

    Otherwise, the output of the script shows:

    stackdriver-prometheus-sidecar does not exist.
    

Viewing metrics

  1. Go to Metrics Explorer.

    Go to Metrics Explorer

  2. Go to Resources > Metrics Explorer. In the Find resource type and metric field, select the one with prefix external/prometheus/.

    For example, you might select external/prometheus/kbatch_scheduling_dep.

    You can add multiple metrics in one Workspace.

Disable the Stackdriver collector

To disable the sidecar container run the following command from the kbatch directory.

sh ./disable_metrics_export_to_sd.sh

Clean up

To stop running Batch services in a GKE cluster, run the following commands:

kubectl delete deployment kbatch-admission --namespace=kube-system
kubectl delete statefulset kbatch-controllers --namespace=kube-system

To delete the GKE cluster that has Batch installed, run the following command:

gcloud container clusters delete [CLUSTER_NAME] --region [REGION]

To delete the Filestore instance, run the following command:

gcloud beta filestore instances delete [FILESTORE_INSTANCE_ID] \
  --project=[PROJECT_ID] --location=[FILESTORE_ZONE]

where [FILESTORE_INSTANCE_ID] is your Filestore Instance ID, [PROJECT_ID] is your Project ID, and [FILESTORE_ZONE] is your zone.

To delete the project that has Batch installed, run the following command:

gcloud projects delete [PROJECT_ID]

What's next

Hai trovato utile questa pagina? Facci sapere cosa ne pensi:

Invia feedback per...

Kubernetes Engine Documentation