Running jobs with Batch on GKE

This page shows you how to run batch jobs with Batch on GKE (Batch). There are two ways to submit jobs in Batch: ksub and kubectl. The ksub command can submit shell scripts as jobs, and kubectl can submit jobs using yaml files.

Configuring ksub

Ksub is a command line tool to perform job related actions on your Batch system. You can use keywords prefixed with #KB to specify job properties.

To configure Ksub, perform the following steps:

  1. Enable ksub to use your own user credentials for API access:

    gcloud auth application-default login
    
  2. Change to the kbatch directory: Depending on your OS, choose the right ksub version from the ksub folder. For example, for Linux users, use this version:

    cd ksub/linux/amd64
    
  3. Setup a default configuration file:

    ./ksub --config --create-default
    

    This creates a configuration file at ~/.ksubrc.

  4. Add the values for project-id, cluster-name, and, if you are not operating in the default namespace, namespace-name using a ksub command like this:

    ./ksub --config --set-project-id project-id \
    --set-clustername cluster-name --set-namespace namespace-name
    
  5. Setting up mount points [Optional if you are not using a private PersistentVolumeClaim created by your cluster admin]

    ./ksub --config --add-volume fs-volume --volume-source PersistentVolumeClaim
     --params claimName:[PVC_NAME] --params readOnly:false
    

    where [PVC_NAME] is the PersistentVolumeClaim name created by your admin for you to save your private input/output files. If using a Filestore instance, this is the Filestore instance's name.

  6. Add the install directory of ksub to $PATH:

    export PATH=$PATH:/path/to/kbatch/
    

Configuring kubectl

The default tool for Kubernetes is kubectl, which is already included in the Cloud SDK.

To ensure you have the current version of kubectl, run the following command:

gcloud components update

Running sample jobs

Batch on GKE includes several sample jobs.

Running a single-task job

  1. Get the samples:

    git clone https://github.com/GoogleCloudPlatform/Kbatch.git
    
  2. Create the default Batch admin resources in the "default" K8s namespace:

    ./samples/defaultresources/create.sh
    

You can submit the job with ksub or kubectl.

ksub

  1. Run the ComputePi Job under /samples/computepi:

    ksub run_pi_with_ksub.sh
    

    This command outputs the job name.

  2. Wait for the job to complete:

    ksub -Gw [JOB_NAME]
    
  3. Get the task name:

    ksub -Ga [JOB_NAME]
    

    This command outputs the task name.

  4. View the logs:

    ksub -L [TASK_NAME]
    

kubectl

  1. Run the ComputePi Job in /samples/computepi:

    kubectl create -f pi-job.yaml
    

    This output is:

    batchjob.kbatch.k8s.io/[JOB_NAME] created
    
  2. Identify the Pod associated with the job:

    kubectl get pods | grep [JOB_NAME]
    

    The output is:

    [POD_NAME]   0/1     Completed   0          1m
    
  3. View the logs:

    kubectl logs pod/[POD_NAME]
    

Running a job that uses Preemptible VMs

  1. Get the samples. Skip this step if you have done it for other sample jobs.

    git clone https://github.com/GoogleCloudPlatform/Kbatch.git
    
  2. Create the default Batch admin resources in the "default" K8s namespace. Skip this step if you have done it for other sample jobs.

    ./samples/defaultresources/create.sh
    

You can submit the job with ksub or kubectl.

ksub

  1. Run the ComputePi Preemptible Job under /samples/computepi:

    ksub run_pi_preemptible_with_ksub.sh
    

    This command outputs the job name.

  2. Wait for the job to complete:

    ksub -Gw [JOB_NAME]
    
  3. Get the task name:

    ksub -Ga [JOB_NAME]
    

    This command outputs the task name.

  4. View the logs:

    ksub -L [TASK_NAME]
    

kubectl

  1. Run the ComputePi Preemptible Job in /samples/computepi:

    kubectl create -f pi-job-preemptible.yaml
    

    This output is:

    batchjob.kbatch.k8s.io/[JOB_NAME] created
    
  2. Identify the Pod associated with the job:

    kubectl get pods | grep [JOB_NAME]
    

    The output is:

    [POD_NAME]   0/1     Completed   0          1m
    
  3. View the logs:

    kubectl logs pod/[POD_NAME]
    

Running jobs with dependencies

With dependencies, you can run some jobs only when specific conditions related to previous jobs have occurred. The Beta version supports 3 dependency types:

Success
A job will run only if all the jobs it depends on have succeeded.
Failed
A job will run only if all the jobs it depends on have failed.
Finished
A job will run only once all the jobs it depends on have completed.

If the system decides not to run a job because a dependency cannot be met, Batch marks the job as Failed. For example, if job1 depends on job2 with the dependency type Success and job2 fails, then job1 never runs and is considered to have failed. Otherwise, job failure and success are determined by the success or failure of the Pod associated with the job as defined by the Kubernetes Pod lifecycle.

Before running this sample job, you must set up a Google Cloud Filestore instance, in the same zone that is your GKE cluster's node location) for inputs / outputs.

  1. Get and extract the Batch samples, admintools and usertools. Skip this step if you have done it for other sample jobs.

    git clone https://github.com/GoogleCloudPlatform/Kbatch.git
    
    tar -xzvf kbatch-github.tar.gz
    
  2. Create the default Batch admin resources in the "default" K8s namespace. Skip this step if you have done it for other sample jobs.

    ./samples/defaultresources/create.sh
    
  3. Change to the imageprocess folder.

    cd ../imageprocess
    
  4. Run apply-extra-config.sh to create the PersistentVolume resources and permissions. Type 'y' when asked if you can "run as root in BatchTasks and access storage."

    pushd ../userstorage
    ./apply-extra-config.sh
    popd
    
  5. Update ksub config to use the persistent volume claim created the previous step:

    ./ksub --config --add-volume fs-volume --volume-source PersistentVolumeClaim \
     --params claimName:[PVC_NAME] --params readOnly:false
    

    where [PVC_NAME] is the name of the PVC created in step 4. In this example, a PersistentVolumeClaim named pvc was created in step 4, so. replace [PVC_NAME] with pvc.

  6. Run copy-input.sh to copy the input image to Filestore.

You can submit the job with ksub or kubectl.

ksub

Submitting a job with dependencies using ksub

There are two ways to submit a job with dependencies: specify the dependency with a single command or manually edit the KB Dependency Success: field in the shell script.

Specify dependencies with ksub

  1. Run the following command from samples/imageprocess to run the both jobs with a dependency:

    ksub --dependency Success:job_name -- run_grey_with_ksub.sh
    

    You can also use shell variables to connect the jobs.

    Create the job1:

    job1=`ksub run_checkerboard_with_ksub.sh` \
    

    Submit job2:

    ksub --dependency Success:${job1} -- ./run_grey_with_ksub.sh
    
  2. Get the task name:

    ksub -Ga [JOB_NAME]
    

    This command outputs the task name.

  3. View the logs:

    ksub -L [TASK_NAME]
    
  4. Run copy-output.sh to copy the processed image to your local machine.

Use your first job's name to create a dependency.

  1. Submit ImageProcess Jobs:

    ksub run_checkerboard_with_ksub.sh
    

    This outputs the [JOB_NAME], for example:

    checkerboard-64t5n
    

    The following run_grey_with_ksub.sh describes a sample script for job2 with a dependency on job1:

    #!/bin/sh
    
    #KB Jobname grey-
    #KB Namespace default
    #KB Image gcr.io/kbatch-images/greyimage/greyimage:latest
    #KB Queuename default
    #KB MaxWallTime 5m
    #KB MinCpu 1.0
    #KB MinMemory 2Gi
    #KB Mount fs-volume /mnt/pv
    #KB Dependency Success:[JOB_NAME]
    
    echo "Starting job grey"
    # greyimage is in /app directory.
    cd /app
    ./greyimage -in=/mnt/pv/checker.png -out=/mnt/pv/checkergrey.png
    echo "Completed job grey"
    
  2. Open run_grey_with_ksub.sh with the editor of your choice and replace [JOB_NAME] with your job name.

  3. Submit the second job

    ksub run_grey_with_ksub.sh
    

    This outputs the job name.

  4. Get the task name:

    ksub -Ga [JOB_NAME]
    

    This command outputs the task name.

  5. View the logs:

    ksub -L [TASK_NAME]
    
  6. Run copy-output.sh to copy the processed image to your local machine.

kubectl

Submitting a job with dependencies using kubectl

  1. Submit ImageProcess Jobs:

    kubectl create -f imageprocess-job.yaml
    

    The output is similar to this:

    batchjob.kbatch.k8s.io/checkerboard created
    batchjob.kbatch.k8s.io/grey created
    
  2. Examine the first job:

    kubectl describe batchjob/checkerboard
    
  3. Examine the second job:

    kubectl describe batchjob/grey
    
  4. Run copy-output.sh to copy the processed image to your local machine.

Running a job that uses GPUs

  1. Get and extract the Batch samples, admintools and usertools. Skip this step if you have done it for other sample jobs.

    git clone https://github.com/GoogleCloudPlatform/Kbatch.git
    
    tar -xzvf kbatch-github.tar.gz
    
  2. Create the default Batch admin resources in the "default" K8s namespace. Skip this step if you have done it for other sample jobs.

    ./samples/defaultresources/create.sh
    

You can submit the job with ksub or kubectl.

ksub

  1. Verify that samples/GPUjob/run_gpu_with_ksub.sh indicates a GPU type that is available in your cluster.

  2. Submit the job:

    ksub samples/GPUjob/run_gpu_with_ksub.sh
    

    This outputs the job name.

  3. Wait for the job to complete:

    ksub -Gw [JOB_NAME]
    
  4. Get the task name:

    ksub -Ga [JOB_NAME]
    

    This command outputs the task name.

  5. View the logs:

    ksub -L [TASK_NAME]
    

gcloud

  1. Verify the GPU shown in the gpu-job.yaml file matches a GPU type that is available in your autoscaler zone.

  2. Submit the job:

    kubectl create -f samples/GPUjob/run_gpu_with_ksub.sh
    

    The output is similar to:

    batchjob.kbatch.k8s.io/[JOB_NAME] created
    
  3. View the logs:

    kubectl describe batchjob/[JOB_NAME]
    

Running array jobs

An array job is a group of tasks that shares the same container image and are differentiated by the array index. By specifying the "taskCount" in IndexSpec, a job can generate up to one thousand tasks.

To run the array job example, you must set up a Google Cloud Filestore instance, in the same zone as your {product_name_short}} cluster's node location for inputs / outputs. Once your filestore instance is created, perform the following steps to run array job example:

  1. Get and extract the Batch samples, admintools and usertools. Skip this step if you have done it for other sample jobs.

    git clone https://github.com/GoogleCloudPlatform/Kbatch.git
    
    tar -xzvf kbatch-github.tar.gz
    
  2. Create the default Batch admin resources in the "default" K8s namespace. Skip this step if you have done it for other sample jobs.

    ./samples/defaultresources/create.sh
    
  3. Change to the arrayjob folder.

      cd ../arrayjob
      

  4. Run setup.sh to create the PersistentVolume resources and permissions. You need to input the filestore instance IP, zone and volume name.

    ./setup.sh
    

    Type 'y' when asked if you can "run as root in BatchTasks and access storage."

  5. Update ksub config to use the persistent volume claim created the previous step:

      ./ksub --config --add-volume fs-volume --volume-source PersistentVolumeClaim 
    --params claimName:[PVC_NAME] --params readOnly:false

    where [PVC_NAME] is the name of the PVC created in step 4. In this example, a PersistentVolumeClaim named pvc was created in step 4 so replace [PVC_NAME] with pvc.

  6. Run copy-array-input.sh to copy the input images to Filestore.

    ./copy-array-input.sh
    

    The input files are copied into the array-image-data directory under your PVC.

  7. Use ksub to submit your array job:

    ksub ./run_array_image_ksub.sh`
    

    Alternatively, you can also submit an array job by yaml file: kubectl create -f array-image.yaml

  8. Check the array job status by the job name generated in the previous step:

    ksub -Ga [JOB_NAME],/var> --output=wide
    
    or
    ksub -G [JOB_NAME],/var> --output=describe
    

    Once the job is successfully finished, the output files are saved into the array-image-data direcotory under your PVC.

    To copy the output data into your local machine, run the following command:

    ./copy-array-output.sh
    

    To validate, go to the array-image-data diretory. Each input file should have a corresponding output file.

Submitting jobs

You can submit jobs with ksub or kubectl.

Using ksub

Ksub allows for submission of scripts as jobs. You can use keywords prefixed with #KB to specify job properties.

The following run_pi_with_ksub.sh describes a sample ksub job:

#!/bin/sh

# Keywords to specify job parameters

#KB Jobname pi-
#KB Namespace default
#KB Image gcr.io/kbatch-images/generate-pi/generate-pi:latest
#KB Queuename default
#KB MaxWallTime 5m
#KB MinCpu 1.0
#KB MinMemory 2Gi

echo "Starting job pi"
# pi is in /app directory.
cd /app
./pi
echo "Completed job pi"

To submit the script, run the following command:

ksub run_pi_with_ksub.sh

Specifying ksub Keywords

Specify your job's parameters with keywords. These keywords are prefixed with #KB. Ksub expects keywords in a block of lines without blank lines or spaces between them. Ksub stops parsing #KB keywords after the 1st line which doesn't start with #KB.

Ksub supports the following keywords:

Keyword Comment Example
Jobname Jobname prefix used to generate jobname. #KB Jobname pi-
Namespace Namespace job operates in. #KB Namespace default
Queuename Queue job is submitted to. #KB Queuename default
Image Image that runs the job container. #KB Image ubuntu
Mount PVC to mount and location where it should be mounted. #KB Mount fs-volume /tmp
MinCpu Number of CPUs the job requires. #KB MinCpu 1.0
MinMemory Amount of memory required by the container. #KB MinMemory 2Gi
Gpu Number and type of GPUs required for the job. In the example on the right, nvidia-tesla-k80 is the type of GPUs to be used, and "2" is the number of GPUs to be used #KB GPU nvidia-tesla-k80 2
Dependency Dependencies of the job #KB Dependency Success:job-name1
MaxWallTime Maximum run time of the job #KB MaxWallTime 5m
TaskCount Task count for an array job (Maximum: 1000) #KB TaskCount 10

Using kubectl

Kubectl connects to the Batch system using the Kubernetes configuration for the cluster.

The following pi-job.yaml describes a sample YAML job:

apiVersion: kbatch.k8s.io/v1beta1
kind: BatchJob
metadata:
  generateName: pi-  # generateName allows the system to generate a random name, using this prefix, for the BatchJob upon creation.
  namespace: default
spec:
  batchQueueName: default
  taskGroups:
  - name: main
    maxWallTime: 5m
    template:
      spec:
        containers:
        - name: pi
          # This image has been made public so it can be pulled from any project.
          image: gcr.io/kbatch-images/generate-pi/generate-pi:latest
          resources:
            requests:
              cpu: 1.0
              memory: 2Gi
            limits:
              cpu: 1.0
              memory: 2Gi
          imagePullPolicy: IfNotPresent
        restartPolicy: Never

To submit the job run the following command:

kubectl create -f pi-job.yaml

Managing data

  1. Get the user tools:

    git clone https://github.com/GoogleCloudPlatform/Kbatch.git
    
  2. Go to the monitoring directory:

    cd usertools/filestore
    

Batch on GKE provides a utility for you to copy files into/from a Kubernetes PersistentVolume.

The basic usage of the script is:

./datacopy.sh [-d|-u] -l [LOCAL_FILE] -r [REMOTE_FILE_PATH] -p [PVC_NAME]

Where:

  • -u to copy data from your workstation to the Cloud.
  • -d to copy data from the Cloud to your workstation.
  • -h prints a helpful usage message.

If you wanted to run a job that uses the input file input.dat that is in the current directory on your local machine, you can type the following command to copy the input file to your personal Batch directory:

./datacopy.sh -u -l input.dat -r problem-1-input.data -p [NAME]-team1

Viewing jobs

You can view jobs by using ksub or kubectl.

ksub

Viewing jobs by user in a queue

View the jobs by user by running the following command:

ksub -Q -n [NAMESPACE] [QUEUE_NAME]

Where the first [NAMESPACE] is your namespace and [QUEUE_NAME] is your Queue name.

The output is similar to:

Name: pi-s4dwl, Status: Succeeded

Viewing jobs in a queue

View the jobs in a queue by running the following command:

ksub -Qa -n [NAMESPACE] [QUEUE_NAME]

Where the first [NAMESPACE] is your namespace and [QUEUE_NAME] is your Queue name.

The output is similar to:

Name: pi-s4dwl, Creation Time Stamp: 2019-09-12 13:03:42 -0700 PDT, Status: Succeeded

kubectl

Viewing jobs in a queue

View the jobs in a queue by running the following command:

kubectl get batchjobs --selector=batchQueue=[QUEUE_NAME] --namespace [NAMESPACE]

Where [QUEUE_NAME] is your Queue name and [NAMESPACE] is your namespace.

The output is similar to:

NAME       AGE
pi-6rc7s   2m

Viewing jobs by user

View jobs by user by performing the following instructions:

  1. Retrieve your username:

    gcloud config get-value account
    

    The output is similar to:

    user@company.com
    
  2. Run the following command to view a list of jobs by a user in a Namespace:

    kubectl get batchjobs --selector=submittedBy=[userATexample.com] --namespace [NAMESPACE]
    

    Where --selector=submittedBy is your emailATcompany.com and [NAMESPACE] is your namespace.

    The output is similar to:

    NAME       AGE
    pi-6rc7s   36m
    
  3. Run the following command to view a list of jobs by a user in a Queue:

    kubectl get batchjobs --selector=batchQueue=[QUEUE_NAME],submittedBy=[userATexample.com] --namespace [NAMESPACE]
    

    Where [QUEUE_NAME] is your Queue name, emailATcompany.com, and [NAMESPACE] is your namespace.

    The output is similar to:

    NAME       AGE
    pi-6rc7s   36m
    

Stopping jobs

You can stop a running job by using ksub or kubectl. The job will be marked as "Failed" with the condition "JobTerminationByUser" while reserving the historical data associated with the job.

ksub

Run the following command to terminate the job:

ksub -T [JOB_NAME] -n [NAMESPACE]

where [JOB_NAME] is your Job name and [NAMESPACE] is your namespace.

The output is similar to this:

Termination request for job [JOB_NAME] is sent to the server, please check the job status

You can also run the following command to watch the job until it completes:

ksub -Gw [JOB_NAME] -n [NAMESPACE]

kubectl

To terminate a running job, run the following command:

kubectl patch batchjob [JOB_NAME] --namespace [NAMESPACE] --type merge --patch '{"spec": {"userCommand": "Terminate"}}'

where [JOB_NAME] is your Job name and [NAMESPACE] is your namespace.

The output is similar to this:

batchjob.kbatch.k8s.io/[JOB_NAME] patched

Viewing logs for jobs

Job logs can be viewed only after the job has started executing, that is if the job has moved from queued to a running state. You must wait for the job to start before viewing it's logs.

To get a job's log, run the following command:

ksub -L -n [NAMESPACE] [JOB_NAME]

Where [NAMESPACE] is your namespace, and [JOB_NAME] is your job name.

Troubleshooting

If a BatchTask stays at the "Ready" phase for a long time, you or your cluster admin can go to the "Workloads" tab in cloud console to check the corresponding Pod.

There is a known issue in GKE 1.14 where a Pod can stuck at the "ContainerCreating" stage due to network set up issues. In this case, you can terminate the job and resubmit it. This issue will be fixed in the GKE 1.15 release.

Due to new restrictions around labels in GKE, please use v0.9.1 or newer of Batch on GKE to ensure jobs run properly.

What's next