Adjust job timeout values

If jobs in GKE on Bare Metal time out and you believe the behavior isn't due to an underlying problem with your installation, you can increase the timeout interval. This document describes how to adjust the timeout intervals for machine jobs and batch jobs by using annotations in the config spec.

If you need additional assistance, reach out to Cloud Customer Care.

Job types and errors

There are two types of GKE on Bare Metal commands and routines: machine jobs and batch jobs. Many things can affect how long it takes for a job to complete, such as hardware configuration, network configuration, and cluster configuration. GKE on Bare Metal has default timeouts that are intended to accommodate typical installations.

The following are example job timeout error messages:

  • A machine job timeout error message (wrapped for clarity) from a preflight log like bmctl-workspace/cluster1/logs/preflight-20210501-000426/172.18.0.4:

    Pod:172.18.0.4-machine-preflf3a32c8a2f7a2449545c7e8ff954c961-652st
    Result:Failed Reason:DeadlineExceeded Time:Wed Feb  3 16:59:56 2021
    
  • Output from kubectl logs for a failed Pod might show a similar DeadlineExceeded message (wrapped):

    cluster-cluster1
    172.18.0.4-machine-preflf3a32c8a2f7a2449545c7e8ff954c961-652st  ●  0/1
    0 DeadlineExceeded  192.168.122.180  bmctl-control-plane  7m12
    

Adjust the machine job timeout interval

A machine job is a routine that runs on one machine only, like a preflight check that is confined to a single machine. GKE on Bare Metal machine jobs have a default timeout of 900 seconds, or 15 minutes. You can adjust the machine job timeout interval with the baremetal.cluster.gke.io/machine-job-deadline-seconds annotation in the cluster config file.

The following example sets the machine job timeout interval to 1,800 seconds, or 30 minutes:

   apiVersion: baremetal.cluster.gke.io/v1
   kind: Cluster
   metadata:
     name: cluster1
     namespace: cluster-cluster1
     annotations:
       baremetal.cluster.gke.io/machine-job-deadline-seconds: "1800"
   spec:
   ...

Your timeout interval value is applied when you create new clusters with bmctl create cluster or when you upgrade existing clusters with bmctl upgrade cluster. The new interval is used for all single machine jobs, including bmctl check preflight, bmctl check -c CLUSTER_NAME, and more.

Adjust the batch job timeout interval

A batch job is a routine that runs across multiple machines, like a network preflight check. The default timeout interval for GKE on Bare Metal batch jobs is dependent upon the number of machines in the network. The default timeout interval is 900 seconds, plus an additional 20 seconds for each machine.

For example, if your batch job runs on 60 machines, the default timeout interval is 2,100 seconds (900 + (20 * 60) = 2100), or 35 minutes.

You can adjust the batch job timeout interval with the baremetal.cluster.gke.io/batch-job-deadline-seconds annotation in the cluster config file.

The following example sets the batch job timeout interval to 10,800 seconds, or 3 hours:

   apiVersion: baremetal.cluster.gke.io/v1
   kind: Cluster
   metadata:
     name: cluster1
     namespace: cluster-cluster1
     annotations:
       baremetal.cluster.gke.io/batch-job-deadline-seconds: "10800"
   spec:
   ...

Your timeout interval value is applied when you create new clusters with bmctl create cluster or when you upgrade existing clusters with bmctl upgrade cluster.

What's next

If you need additional assistance, reach out to Cloud Customer Care.