Adjust job timeout values

If jobs are timing out and you believe the timeout is not due to an underlying problem with your installation, you can increase the timeout interval. This document describes how to adjust the timeout intervals for machine jobs and batch jobs with annotations to the config spec.

GDCV for Bare Metal commands and routines fall into two categories, machine jobs and batch jobs. Many things can affect how long it takes for a job to complete, such as hardware configuration, network configuration, and cluster configuration. GKE on Bare Metal has default timeouts that are intended to accommodate typical installations.

Here are a couple of examples of where you can find job timeout error messages:

  • Machine job timeout error message (wrapped for clarity) from a preflight log, for example, bmctl-workspace/cluster1/logs/preflight-20210501-000426/172.18.0.4

    Pod:172.18.0.4-machine-preflf3a32c8a2f7a2449545c7e8ff954c961-652st
    Result:Failed Reason:DeadlineExceeded Time:Wed Feb  3 16:59:56 2021
    
  • Output from kubectl logs for a failed pod may show a similar DeadlineExceeded message (wrapped):

    cluster-cluster1
    172.18.0.4-machine-preflf3a32c8a2f7a2449545c7e8ff954c961-652st  ●  0/1
    0 DeadlineExceeded  192.168.122.180  bmctl-control-plane  7m12
    

Adjusting the machine job timeout interval

A machine job is a routine that runs on one machine only, like a preflight check that is confined to a single machine. GKE on Bare Metal machine jobs have a default timeout of 900 seconds or 15 minutes. The machine job timeout interval is adjusted with the baremetal.cluster.gke.io/machine-job-deadline-seconds annotation in the cluster config file.

The following example sets the machine job timeout interval to 1800 seconds or 30 minutes:

   apiVersion: baremetal.cluster.gke.io/v1
   kind: Cluster
   metadata:
     name: cluster1
     namespace: cluster-cluster1
     annotations:
       baremetal.cluster.gke.io/machine-job-deadline-seconds: "1800"
   spec:
   ...

Your timeout interval value will be applied when you create new clusters with bmctl create cluster or when you upgrade existing clusters with bmctl upgrade cluster. The new interval will be used for all single machine jobs, including bmctl check preflight, bmctl check -c <cluster-name>, and more.

Adjusting the batch job timeout interval

A batch job is a routine that runs across multiple machines, like a network preflight check. The default timeout interval for GKE on Bare Metal batch jobs is dependent upon the number of machines in the network. The default timeout interval is 900 seconds plus an additional 20 seconds for each machine. So, if your batch job runs on 60 machines, the default timeout interval is 2100 seconds (900 + 20 * 60 = 2100) or 35 minutes. The batch job timeout interval is adjusted with the baremetal.cluster.gke.io/batch-job-deadline-seconds annotation in the cluster config file.

The following example sets the batch job timeout interval to 10800 seconds or 3 hours:

   apiVersion: baremetal.cluster.gke.io/v1
   kind: Cluster
   metadata:
     name: cluster1
     namespace: cluster-cluster1
     annotations:
       baremetal.cluster.gke.io/batch-job-deadline-seconds: "10800"
   spec:
   ...

Your timeout interval value will be applied when you create new clusters with bmctl create cluster or when you upgrade existing clusters with bmctl upgrade cluster.