Troubleshoot the Kubernetes scheduler

This pages shows you how to resolve issues with the Kubernetes scheduler (kube-scheduler) for Google Distributed Cloud Virtual for Bare Metal.

If you need additional assistance, reach out to Cloud Customer Care.

Kubernetes always schedules Pods to the same set of nodes

This error might be observed in a few different ways:

  • Unbalanced cluster utilization. You can inspect cluster utilization for each Node with the kubectl top nodes command. The following exaggerated example output shows pronounced utilization on certain Nodes:

    NAME                   CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%
    XXX.gke.internal       222m         101%       3237Mi          61%
    YYY.gke.internal       91m          0%         2217Mi          0%
    ZZZ.gke.internal       512m         0%         8214Mi          0%
    
  • Too many requests. If you schedule a lot of Pods at once onto the same Node and those Pods make HTTP requests, it's possible for the Node to be rate limited. The common error returned by the server in this scenario is 429 Too Many Requests.

  • Service unavailable. A webserver, for example, hosted on a Node under high load might respond to all requests with 503 Service Unavailable errors until it's under lighter load.

To check if you have Pods that are always scheduled to the same nodes, use the following steps:

  1. Run the following kubectl command to view the status of the Pods:

    kubectl get pods -o wide -n default
    

    To see the distribution of Pods across Nodes, check the NODE column in the output. In the following example output, all of the Pods are scheduled on the same Node:

    NAME                               READY  STATUS   RESTARTS  AGE  IP             NODE
    nginx-deployment-84c6674589-cxp55  1/1    Running  0         55s  10.20.152.138  10.128.224.44
    nginx-deployment-84c6674589-hzmnn  1/1    Running  0         55s  10.20.155.70   10.128.226.44
    nginx-deployment-84c6674589-vq4l2  1/1    Running  0         55s  10.20.225.7    10.128.226.44
    

Pods have a number of features that allow you to fine tune their scheduling behavior. These features include topology spread constraints and anti-affinity rules. You can use one, or a combination, of these features. The requirements you define are ANDed together by kube-scheduler.

The scheduler logs aren't captured at the default logging verbosity level. If you need the scheduler logs for troubleshooting, do the following steps to capture the scheduler logs:

  1. Increase the logging verbosity level:

    1. Edit the kube-scheduler Deployment:

      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG edit deployment kube-scheduler \
        -n USER_CLUSTER_NAMESPACE
      
    2. Add the flag --v=5 under the spec.containers.command section:

      containers:
      - command:
      - kube-scheduler
      - --profiling=false
      - --kubeconfig=/etc/kubernetes/scheduler.conf
      - --leader-elect=true
      - --v=5
      
  2. When you are finished troubleshooting, reset the verbosity level back to the default level:

    1. Edit the kube-scheduler Deployment:

      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG edit deployment kube-scheduler \
        -n USER_CLUSTER_NAMESPACE
      
    2. Set the verbosity level back to the default value:

      containers:
      - command:
      - kube-scheduler
      - --profiling=false
      - --kubeconfig=/etc/kubernetes/scheduler.conf
      - --leader-elect=true
      

Topology spread constraints

Topology spread constraints can be used to evenly distribute Pods among Nodes according to their zones, regions, node, or other custom-defined topology.

The following example manifest shows a Deployment that spreads replicas evenly among all schedulable Nodes using topology spread constraints:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: topology-spread-deployment
  labels:
    app: myapp
spec:
  replicas: 30
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      topologySpreadConstraints:
      - maxSkew: 1 # Default. Spreads evenly. Maximum difference in scheduled Pods per Node.
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule # Default. Alternatively can be ScheduleAnyway
        labelSelector:
          matchLabels:
            app: myapp
        matchLabelKeys: # beta in 1.27
        - pod-template-hash
      containers:
      # pause is a lightweight container that simply sleeps
      - name: pause
        image: registry.k8s.io/pause:3.2

The following considerations apply when using topology spread constraints:

  • A Pod's labels.app: myapp is matched by the constraint's labelSelector.
  • The topologyKey specifies kubernetes.io/hostname. This label is automatically attached to all Nodes and is populated with the Node's hostname.
  • The matchLabelKeys prevents rollouts of new Deployments from considering Pods of old revisions when calculating where to schedule a Pod. The pod-template-hash label is automatically populated by a Deployment.

Pod anti-affinity

Pod anti-affinity lets you define constraints for which Pods can be co-located on the same Node.

The following example manifest shows a Deployment that uses anti-affinity to limit replicas to one Pod per Node:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod-affinity-deployment
  labels:
    app: myapp
spec:
  replicas: 30
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      name: with-pod-affinity
      labels:
        app: myapp
    spec:
      affinity:
        podAntiAffinity:
          # requiredDuringSchedulingIgnoredDuringExecution
          # prevents Pod from being scheduled on a Node if it
          # does not meet criteria.
          # Alternatively can use 'preferred' with a weight
          # rather than 'required'.
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - myapp
            # Your nodes might be configured with other keys
            # to use as `topologyKey`. `kubernetes.io/region`
            # and `kubernetes.io/zone` are common.
            topologyKey: kubernetes.io/hostname
      containers:
      # pause is a lightweight container that simply sleeps
      - name: pause
        image: registry.k8s.io/pause:3.2

This example Deployment specifies 30 replicas, but only expands to as many Nodes are available in your cluster.

The following considerations apply when using Pod anti-affinity:

  • A Pod's labels.app: myapp is matched by the constraint's labelSelector.
  • The topologyKey specifies kubernetes.io/hostname. This label is automatically attached to all Nodes and is populated with the Node's hostname. You can choose to use other labels if your cluster supports them, such as region or zone.

Pre-pull container images

In the absence of any other constraints, by default kube-scheduler prefers to schedule Pods on Nodes that already have the container image downloaded onto them. This behavior might be of interest in smaller clusters without other scheduling configurations where it would be possible to download the images on every Node. However, relying on this concept should be seen as a last resort. A better solution is to use nodeSelector, topology spread constraints, or affinity / anti-affinity. For more information, see Assigning Pods to Nodes.

If you want to make sure container images are pre-pulled onto all Nodes, you can use a DaemonSet like the following example:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: prepulled-images
spec:
  selector:
    matchLabels:
      name: prepulled-images
  template:
    metadata:
      labels:
        name: prepulled-images
    spec:
      initContainers:
        - name: prepulled-image
          image: IMAGE
          # Use a command the terminates immediately
          command: ["sh", "-c", "'true'"]
      containers:
      # pause is a lightweight container that simply sleeps
      - name: pause
        image: registry.k8s.io/pause:3.2

After the Pod is Running on all Nodes, redeploy your Pods again to see if the containers are now evenly distributed across Nodes.

What's next

If you need additional assistance, reach out to Cloud Customer Care.