Running a job

This page explains how to run Jobs in Google Kubernetes Engine (GKE).

Overview

In GKE, a Job is a controller object that represents a finite task. Jobs differ from other controller objects in that Jobs manage the task as it runs to completion, rather than managing an ongoing desired state (such as the total number of running Pods).

Jobs are useful for large computation and batch-oriented tasks. Jobs can be used to support parallel execution of Pods. You can use a Job to run independent but related work items in parallel: sending emails, rendering frames, transcoding files, scanning database keys, etc. However, Jobs are not designed for closely-communicating parallel processes such as continuous streams of background processes.

In GKE, there are two types of Jobs:

  • Non-parallel Job: A Job which creates only one Pod (which is re-created if the Pod terminates unsuccessfully), and which is completed when the Pod terminates successfully.
  • Parallel jobs with a completion count: A Job that is completed when a certain number of Pods terminate successfully. You specify the desired number of completions using the completions field.

Jobs are represented by Kubernetes Job objects. When a Job is created, the Job controller creates one or more Pods and ensures that its Pods terminate successfully. As its Pods terminate, a Job tracks how many Pods completed their tasks successfully. Once the desired number of successful completions is reached, the Job is complete.

Similar to other controllers, a Job controller creates a new Pod if one of its Pods fails or is deleted.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Ensure that you have enabled the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • Ensure that you have installed the Cloud SDK.
  • Set up default gcloud command-line tool settings for your project by using one of the following methods:
    • Use gcloud init, if you want to be walked through setting project defaults.
    • Use gcloud config, to individually set your project ID, zone, and region.

    gcloud init

    1. Run gcloud init and follow the directions:

      gcloud init

      If you are using SSH on a remote server, use the --console-only flag to prevent the command from launching a browser:

      gcloud init --console-only
    2. Follow the instructions to authorize the gcloud tool to use your Google Cloud account.
    3. Create a new configuration or select an existing one.
    4. Choose a Google Cloud project.
    5. Choose a default Compute Engine zone.
    6. Choose a default Compute Engine region.

    gcloud config

    1. Set your default project ID:
      gcloud config set project PROJECT_ID
    2. Set your default Compute Engine region (for example, us-central1):
      gcloud config set compute/region COMPUTE_REGION
    3. Set your default Compute Engine zone (for example, us-central1-c):
      gcloud config set compute/zone COMPUTE_ZONE
    4. Update gcloud to the latest version:
      gcloud components update

    By setting default locations, you can avoid errors in gcloud tool like the following: One of [--zone, --region] must be supplied: Please specify location.

Creating a Job

You can create a Job using kubectl apply with a manifest file.

The following example shows a Job manifest:

apiVersion: batch/v1
kind: Job
metadata:
  # Unique key of the Job instance
  name: example-job
spec:
  template:
    metadata:
      name: example-job
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl"]
        args: ["-Mbignum=bpi", "-wle", "print bpi(2000)"]
      # Do not restart containers after they exit
      restartPolicy: Never
  # of retries before marking as failed.
  backoffLimit: 4

Copy the manifest to a file named config.yaml, and create the Job:

kubectl apply -f config.yaml

This Job computes pi to 2000 places then prints it.

The only requirement for a Job object is that the Pod template field is mandatory.

Job completion count

A Job is completed when a specific number of Pods terminate successfully. By default, a non-parallel Job with a single Pod completes as soon as the Pod terminates successfully.

If you have a parallel Job, you can set a completion count using the optional completions field. This field specifies how many Pods should terminate successfully before the Job is complete. The completions field accepts a non-zero, positive value.

Omitting completions or specifying a zero value causes the success of any Pod to signal the success of all Pods.

Copy config.yaml from the preceding example to a file named config-2.yaml. In config-2.yaml, change name to example-job-2, and add completions: 8 to the Job's spec field. This specifies that there should be eight successful completions:

apiVersion: batch/v1
kind: Job
metadata:
  name: example-job-2
spec:
  completions: 8
  template:
    metadata:
      name: example-job-2
    spec:
      ...

Create the Job:

kubectl apply -f config-2.yaml

The default value of completions is 1. When completions is set, the parallelism field defaults to 1 unless set otherwise. If both fields are not set, their default values are 1.

Managing parallelism

By default, Job Pods do not run in parallel. The optional parallelism field specifies the maximum desired number of Pods a Job should run concurrently at any given time.

The actual number of Pods running in a steady state might be less than the parallelism value if the remaining work is less than the parallelism value. If you have also set completions, the actual number of Pods running in parallel does not exceed the number of remaining completions. A Job may throttle Pod creation in response to excessive Pod creation failure.

Copy config.yaml from the preceding example to a file named config-3.yaml. In config-3.yaml, change name to example-job-3, and add parallelism: 5 to the Job's spec field. This specifies that there should be five concurrent Pods running:

apiVersion: batch/v1
kind: Job
metadata:
  name: example-job-3
spec:
  parallelism: 5
  template:
    metadata:
      name: example-job-3
    spec:
      ...

Create the Job:

kubectl apply -f config-3.yaml

The default value of parallelism is 1 if the field if omitted or unless set otherwise. If the value is set to 0, the Job is paused until the value is increased.

Specifying retries

By default, a Job runs uninterrupted unless there is a failure, at which point the Job defers to the backoffLimit. The backoffLimit field specifies the number of retries before marking the job as failed; the default value is 6. The number of retries applies per Pod, not globally. This means that if multiple Pods fail (when parallelism is greater than 1), the job continues to run until a single Pod fails backoffLimit of times. Once the backoffLimit has been reached, the Job is marked as failed and any running Pods will be terminated.

For example, in our example job, we set the number of retries to 4:

apiVersion: batch/v1
kind: Job
metadata:
  name: example-job
spec:
  template:
    metadata:
      name: example-job
    spec:
      containers:
      ...
  backoffLimit: 4

Pod replacement

Job recreates Pods honoring the backoffLimit when the current Pod is considered failed in scenarios such as:

  • The Pod container exits with a non-zero error code.
  • When a Node is rebooted, the kubelet may mark the Pod as Failed after the reboot

Under certain scenarios a Job that has not completed replaces the Pod without considering the backoffLimit, such as:

  • Manually killing a Pod would not set the Pod phase to Failed. The replacement Pod may be created even before the current pod's termination grace period is completed.
  • When a Node is drained (manually or during auto-upgrade), the Pod is terminated honoring a drain grace period and is replaced.
  • When a Node is deleted, the Pod is garbage collected (marked as deleted) and is replaced.

Specifying a deadline

By default, a Job creates new Pods forever if its Pods fail continuously. If you prefer not to have a Job retry forever, you can specify a deadline value using the optional .spec.activeDeadlineSeconds field of the Job.

A deadline grants a Job a specific amount of time, in seconds, to complete its tasks successfully before terminating. The activeDeadlineSeconds value is relative to the startTime of the Job, and applies to the duration of the Job, no matter how many Pods are created.

To specify a deadline, add the activeDeadlineSeconds value to the Job's spec field in the manifest file. For example, the following configuration specifies that the Job has 100 seconds to complete successfully:

apiVersion: batch/v1
kind: Job
metadata:
  name: example-job
spec:
  activeDeadlineSeconds: 100
  template:
    metadata:
      name: example-job
    spec:
      ...

If a Job does not complete successfully before the deadline, the Job ends with the status DeadlineExceeded. This causes the creation of Pods to stop and causes existing Pods to be deleted.

Specifying a Pod selector

Manually specifying a selector is useful if you want to update a Job's Pod template, but you want the Job's current Pods to run under the updated Job.

A Job is instantiated with a selector field. The selector generates a unique identifier for the Job's Pods. The generated ID does not overlap with any other Jobs. Generally, you would not set this field yourself: setting a selector value which overlaps with another Job can cause issues with Pods in the other Job. To set the field yourself, you must specify manualSelector: True in the Job's spec field.

For example, you can run kubectl get job my-job --output=yaml to see the Job's specification, which contains the selector generated for its Pods:

kind: Job
metadata:
  name: my-job
...
spec:
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...

When you create a new Job, you can set the manualSelector value to True, then set the selector field's job-uid value like the following:

kind: Job
metadata:
  name: my-new-job
  ...
spec:
  manualSelector: true
  selector:
    matchLabels:
      controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
  ...

Pods created by my-new-job use the previous Pod UID.

Inspecting a Job

kubectl

To check a Job's status, run the following command:

kubectl describe job my-job

To view all Pod resources in your cluster, including Pods created by the Job which have completed, run:

kubectl get pods -a

The -a flag specifies that all resources of the type specified (in this case, Pods) should be shown.

Console

After creating a Job using kubectl, you can inspect it by performing the following steps:

  1. Go to the Workloads page in Cloud Console.

    Go to Workloads

  2. In the workloads list, click the name of the Job you want to inspect.

  3. On the Job details page, do any of the following:

    • Click the Revision History tab to see the Job's revision history.
    • Click the Events tab to see all events related to the Job.
    • Click the Logs tab to see all container logs related to the Job.
    • Click the YAML tab to see, copy, and download the Job's live configuration.

Deleting a Job

When a Job completes, the Job stops creating Pods. The Job API object is not removed when it completes, which allows you to view its status. Pods created by the Job are not deleted, but they are terminated. Retention of the Pods allows you to view their logs and to interact with them.

kubectl

To delete a Job, run the following command:

kubectl delete job my-job

When you delete a Job, all of its Pods are also deleted.

To delete a Job but retain its Pods, specify the --cascade false flag:

kubectl delete jobs my-job --cascade false

Console

To delete a Job, perform the following steps:

  1. Go to the Workloads page in Cloud Console.

    Go to Workloads

  2. In the workloads list, select one or more Jobs you want to delete.

  3. Click Delete.

  4. When prompted to confirm, click Delete.

What's next