Creating and managing node pools

Starting in Google Distributed Cloud version 1.3, you can create a group of nodes in your user cluster that all have the same configuration by defining a node pool in the configuration file of that cluster. You can then manage that pool of nodes separately without affecting any of the other nodes in the cluster. Learn more about node pools.

One or more node pools can be defined in the configuration file of any user clusters. Creating a node pool creates additional nodes in the user cluster. Node pool management, including creating, updating, and deleting node pools in a user cluster, is done through modifying the nodePools section of your configuration file and deploying those changes to your existing cluster with the gkectl update cluster command. Note that deleting nodepools will cause immediate removal of the related nodes regardless if any of those nodes are running a workload.

Example node pool:

nodePools:
  - name: pool-1
    cpus: 4
    memoryMB: 8192
    replicas: 5

Tip for new installations: Create your first user cluster and define your node pools in that cluster. Then use that cluster's configuration file to create additional user clusters with the same node pool settings.

Before you begin

  • Support:

    • Only user clusters version 1.3.0 or later are supported.

    • Node pools in admin clusters are unsupported.

    • The gkectl update cluster command currently has full support for updating node pools and adding static IPs. It also supports enabling cloud audit logging and enabling / disabling auto repair. All other changes that exist in the configuration file are ignored.

    • While the nodes in a node pool can be managed separately from other nodes, the nodes of any cluster cannot be separately upgraded. All nodes are upgraded when you upgrade your clusters.

  • Resources:

    • You can deploy only changes to node pool replicas without interruption to a node's workload.

      Important: If you deploy any other node pool configuration change, the nodes in the node pool are recreated. You must ensure any such node pool is not running a workload that should not be disrupted.

    • When you deploy your node pool changes, unwanted nodes are deleted after the desired ones are created or updated. One implication of this policy is that even if the total number of nodes remains the same before and after an update, more resources (for example, IP addresses) may be required during the update.

      Suppose a node pool will have N nodes at the end of an update. Then you must have at least N + 1 IP addresses available for nodes in that pool. This means that if you are resizing a cluster by adding nodes to one or more pools, you must have at least one more IP address than the total number of nodes that will be in all of the cluster's node pools at the end of the resizing. For more information, see Verify that enough IP addresses are available.

Creating and updating node pools

You manage a node pool through modifying and deploying your user cluster's configuration file. You can create and deploy one or more node pools in a user cluster.

To create or update node pools:

  1. In an editor, open the configuration file of the user cluster in which you want to create or update node pools.

  2. Define one or more node pools in the nodePools section of the user cluster configuration file:

    1. Configure the minimum required node pool attributes. You must specify the following attributes for each node pool:

      • nodePools.name: Specifies a unique name for the node pool. Updating this attribute recreates the node. Example: - name: pool-1

      • nodePools.cpus: Specify how many CPUs are allocated to each worker node in the pool. Updating this attribute recreates the node. Example: cpus: 4

      • nodePools.memoryMB: Specifies how much memory, in megabytes, is allocated to each worker node of the user cluster. Updating this attribute recreates the node. Example: memoryMB: 8192

      • nodePools.replicas: Specifies the total number of worker nodes in the pool. The user cluster uses nodes across all the pools to run workloads. You can update this attribute without affecting any nodes or running workloads. Example: replicas: 5

      Note that while some of the nodePools attributes are the same as the workernode (DHCP | Static IP) in the old configuration file, the workernode section is still required in the old configuration files of every user cluster. You can't remove the workernode section nor replace it with nodepools. In new user cluster configuration file, there is no workernode section any more. You have to define at least one node pool for a user cluster and ensure that there are enough un-tainted nodes in replacement of the default workernode pool in old configuration files.

      Example:

      nodePools:
      - name: pool-1
        cpus: 4
        memoryMB: 8192
        replicas: 5
      

      See Examples for a exemplar user cluster configuration file with multiple node pools.

    2. Configure optional node pool attributes. You can add labels and taints to your node pool configuration to steer node workloads. You can also define which vSphere Datastore is used by your node pool.

      • nodePools.labels: Specifies one or more key : value pairs to uniquely identify your node pools. The key and value must begin with a letter or number, and can contain letters, numbers, hyphens, dots, and underscores, up to 63 characters each.

        For detailed configuration information, see labels.

        Important: You cannot specify the following keys for a label because they are reserved for use by Google Distributed Cloud: kubernetes.io, k8s.io, and googleapis.com.

        Example:

        labels:
          key1: value1
          key2: value2
        
      • nodePools.taints: Specifies a key, value, and effect to define taints for your node pools. These taints correspond with the tolerations that you configure for your pods.

        The key is required and value is optional. Both must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores, up to 253 characters. Optionally, you can prefix a key with a DNS subdomain followed by a /. For example: example.com/my-app.

        Valid effect values are: NoSchedule, PreferNoSchedule, or NoExecute.

        For detailed configuration information, see taints.

        Example:

        taints:
          - key: key1
            value: value1
            effect: NoSchedule
        
      • nodePools.bootDiskSizeGB: Specifies the size of boot disk, in gigabytes, is allocated to each worker node in the pool. This configuration is available starting from Google Distributed Cloud version 1.5.0

        Example:

        bootDiskSizeGB: 40
        
      • nodePools.vsphere.datastore: Specify the vSphere Datastore on which each node in the pool will be created on. This overrides the default vSphere Datastore of the user cluster.

        Example:

        vsphere:
          datastore: datastore_name
        

    See Examples for a configuration example with multiple node pools.

  3. Use the gkectl update cluster command to deploy your changes to the user cluster.

    gkectl update cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] --config [USER_CLUSTER_CONFIG_FILE] --dry-run --yes
    
    where:
    • [ADMIN_CLUSTER_KUBECONFIG]: Specifies the kubeconfig file of your admin cluster.
    • [USER_CLUSTER_CONFIG_FILE]: Specifies the configuration file of your user cluster.
    • --dry-run: Optional flag. Add this flag to view the change only. No changes are deployed to the user cluster.
    • --yes: Optional flag. Add this flag to run the command silently. The prompt to verify that you want to proceed is disabled.

    If you aborted the command prematurely, you can run the same command again to complete the operation and deploy your changes to the user cluster.

    If you need to revert your changes, you must revert your changes in the configuration file and then redeploy those changes to your user cluster.

  4. Verify that the changes are successful by inspecting all the nodes. Run the following command to list all of the nodes in the user cluster:

    kubectl --kubeconfig [USER_CLUSTER_KUBECONFIG] get nodes -o wide
    

    where [USER_CLUSTER_KUBECONFIG] is the kubeconfig file of your user cluster.

Deleting a node pool

To delete a node pool from a user cluster:

  1. Remove its definition from the nodePools section of the user cluster configuration file.

  2. Ensure that there are no workloads running on the affected nodes.

  3. Deploy your changes by running the gkectl update cluster command:

    gkectl update cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] --config [USER_CLUSTER_CONFIG_FILE] --dry-run --yes
    
    where:
    • [ADMIN_CLUSTER_KUBECONFIG]: Specifies the kubeconfig file of your admin cluster.
    • [USER_CLUSTER_CONFIG_FILE]: Specifies the configuration file of your user cluster.
    • --dry-run: Optional flag. Add this flag to view the change only. No changes are deployed to the user cluster.
    • --yes: Optional flag. Add this flag to run the command silently. The prompt to verify that you want to proceed is disabled.
  4. Verify that the changes are successful by inspecting all the nodes. Run the following command to list all of the nodes in the user cluster:

    kubectl --kubeconfig [USER_CLUSTER_KUBECONFIG] get nodes -o wide
    

    where [USER_CLUSTER_KUBECONFIG] is the kubeconfig file of your user cluster.

Examples

In the following example configuration, there are four node pools, each with different attributes:

  • pool-1: only the minimum required attributes are specified
  • pool-2: includes vSphere Datastore
  • pool-3: includes bootDiskSizeGB
  • pool-4: includes taints and labels
  • pool-5: includes all attributes
apiVersion: v1
kind: UserCluster
...
# (Required) List of node pools. The total un-tainted replicas across all node pools
# must be greater than or equal to 3
nodePools:
- name: pool-1
  cpus: 4
  memoryMB: 8192
  replicas: 5
- name: pool-2
  cpus: 8
  memoryMB: 16384
  replicas: 3
  vsphere:
    datastore: my_datastore
- name: pool-3
  cpus: 8
  memoryMB: 8192
  replicas: 3
  bootDiskSizeGB: 40
- name: pool-4
  cpus: 4
  memoryMB: 8192
  replicas: 5
  taints:
    - key: "example-key"
      effect: NoSchedule
  labels:
    environment: production
    app: nginx
- name: pool-5
  cpus: 8
  memoryMB: 16384
  replicas: 3
  taints:
    - key: "my_key"
      value: my_value1
      effect: NoExecute
  labels:
    environment: test
  vsphere:
    datastore: my_datastore
  bootDiskSizeGB: 60
...
apiVersion: v1
kind: UserCluster
# (Required) A unique name for this cluster
name: ""
# (Required) Anthos clusters on VMware (GKE on-prem) version (example: 1.3.0-gke.16)
gkeOnPremVersion: ""
# # (Optional) vCenter configuration (default: inherit from the admin cluster)
# vCenter:
#   # Resource pool to use. Specify [VSPHERE_CLUSTER_NAME]/Resources to use the default
#   # resource pool
#   resourcePool: ""
#   datastore: ""
#   # Provide the path to vCenter CA certificate pub key for SSL verification
#   caCertPath: ""
#   # The credentials to connect to vCenter
#   credentials:
#     # reference to external credentials file
#     fileRef:
#       # read credentials from this file
#       path: ""
#       # entry in the credential file
#       entry: ""
# (Required) Network configuration; vCenter section is optional and inherits from
# the admin cluster if not specified
network:
  # # (Optional) This section overrides ipBlockFile values. Use with ipType "static" mode.
  # # Used for seesaw nodes as well
  # hostConfig:
  #   # List of DNS servers
  #   dnsServers:
  #   - ""
  #   # List of NTP servers
  #   ntpServers:
  #   - ""
  #   # # List of DNS search domains
  #   # searchDomainsForDNS:
  #   # - ""
  ipMode:
    # (Required) Define what IP mode to use ("dhcp" or "static")
    type: dhcp
    # # (Required when using "static" mode) The absolute or relative path to the yaml file
    # # to use for static IP allocation. Hostconfig part will be overwritten by network.hostconfig
    # # if specified
    # ipBlockFilePath: ""
  # (Required) The Kubernetes service CIDR range for the cluster. Must not overlap
  # with the pod CIDR range
  serviceCIDR: 10.96.0.0/12
  # (Required) The Kubernetes pod CIDR range for the cluster. Must not overlap with
  # the service CIDR range
  podCIDR: 192.168.0.0/16
  vCenter:
    # vSphere network name
    networkName: ""
# (Required) Load balancer configuration
loadBalancer:
  # (Required) The VIPs to use for load balancing
  vips:
    # Used to connect to the Kubernetes API
    controlPlaneVIP: ""
    # Shared by all services for ingress traffic
    ingressVIP: ""
  # (Required) Which load balancer to use "F5BigIP" "Seesaw" or "ManualLB". Uncomment
  # the corresponding field below to provide the detailed spec
  kind: Seesaw
  # # (Required when using "ManualLB" kind) Specify pre-defined nodeports
  # manualLB:
  #   # NodePort for ingress service's http (only needed for user cluster)
  #   ingressHTTPNodePort: 30243
  #   # NodePort for ingress service's https (only needed for user cluster)
  #   ingressHTTPSNodePort: 30879
  #   # NodePort for control plane service
  #   controlPlaneNodePort: 30562
  #   # NodePort for addon service (only needed for admin cluster)
  #   addonsNodePort: 0
  # # (Required when using "F5BigIP" kind) Specify the already-existing partition and
  # # credentials
  # f5BigIP:
  #   address: ""
  #   credentials:
  #     # reference to external credentials file
  #     fileRef:
  #       # read credentials from this file
  #       path: ""
  #       # entry in the credential file
  #       entry: ""
  #   partition: ""
  #   # # (Optional) Specify a pool name if using SNAT
  #   # snatPoolName: ""
  # (Required when using "Seesaw" kind) Specify the Seesaw configs
  seesaw:
    # (Required) The absolute or relative path to the yaml file to use for IP allocation
    # for LB VMs. Must contain one or two IPs. Hostconfig part will be overwritten
    # by network.hostconfig if specified.
    ipBlockFilePath: ""
    # (Required) The Virtual Router IDentifier of VRRP for the Seesaw group. Must
    # be between 1-255 and unique in a VLAN.
    vrid: 0
    # (Required) The IP announced by the master of Seesaw group
    masterIP: ""
    # (Required) The number CPUs per machine
    cpus: 4
    # (Required) Memory size in MB per machine
    memoryMB: 3072
    # (Optional) Network that the LB interface of Seesaw runs in (default: cluster
    # network)
    vCenter:
      # vSphere network name
      networkName: ""
    # (Optional) Run two LB VMs to achieve high availability (default: false)
    enableHA: false
# # (Optional/Preview) Enable dataplane v2
# enableDataplaneV2: false
# # (Optional) Storage specification for the cluster
# storage:
#   # Whether to disable vSphere CSI components deployment. The feature is enabled by
#   # default.
#   vSphereCSIDisabled: false
# (Optional) User cluster master nodes must have either 1 or 3 replicas (default:
# 4 CPUs; 16384 MB memory; 1 replica)
masterNode:
  cpus: 4
  memoryMB: 8192
  # How many machines of this type to deploy
  replicas: 1
# (Required) List of node pools. The total un-tainted replicas across all node pools
# must be greater than or equal to 3
nodePools:
- name: pool-1
  cpus: 4
  memoryMB: 8192
  # How many machines of this type to deploy
  replicas: 3
  # # (Optional) boot disk size; must be at least 40 (default: 40)
  # bootDiskSizeGB: 40
  # # Labels to apply to Kubernetes Node objects
  # labels: {}
  # # Taints to apply to Kubernetes Node objects
  # taints:
  # - key: ""
  #   value: ""
  #   effect: ""
  # vsphere:
  #   # (Optional) vSphere datastore the node pool will be created on (default: vCenter.datastore)
  #   datastore: ""
# Spread nodes across at least three physical hosts (requires at least three hosts)
antiAffinityGroups:
  # Set to false to disable DRS rule creation
  enabled: true
# # (Optional): Configure additional authentication
# authentication:
#   # (Optional) Configure OIDC authentication
#   oidc:
#     # URL for OIDC Provider.
#     issuerURL: ""
#     # (Optional) Default is http://kubectl.redirect.invalid
#     kubectlRedirectURL: ""
#     # ID for OIDC client application.
#     clientID: ""
#     # (Optional) Secret for OIDC client application.
#     clientSecret: ""
#     username: ""
#     # (Optional) Prefix prepended to username claims.
#     usernamePrefix: ""
#     # (Optional) JWT claim to use as group name.
#     group: ""
#     # (Optional) Prefix prepended to group claims.
#     groupPrefix: ""
#     # (Optional) Additional scopes to send to OIDC provider as comma separated list.
#     # Default is "openid".
#     scopes: ""
#     # (Optional) Additional key-value parameters to send to OIDC provider as comma
#     # separated list.
#     extraParams: ""
#     # (Optional) Set value to string "true" or "false". Default is false.
#     deployCloudConsoleProxy: ""
#     # # (Optional) The absolute or relative path to the CA file
#     # caPath: ""
#   # (Optional) Provide an additional serving certificate for the API server
#   sni:
#     certPath: ""
#     keyPath: ""
#   # (Optional/Preview) Configure LDAP authentication
#   ldap:
#     # Name of LDAP provider.
#     name: ""
#     # Hostname or IP of the LDAP provider.
#     host: ""
#     # (Optional) Only support "insecure" for now
#     connectionType: insecure
#     # # (Optional) The absolute or relative path to the CA file
#     # caPath: ""
#     user:
#       # Location in LDAP directory where user entries exist.
#       baseDN: ""
#       # (Optional) Name of the attribute that precedes the username in a DN. Default
#       # is "CN".
#       userAttribute: ""
#       # (Optional) Name of the attribute that records a user's group membership. Default
#       # is "memberOf".
#       memberAttribute: ""
# (Optional) Specify which GCP project to connect your logs and metrics to
stackdriver:
  projectID: ""
  # A GCP region where you would like to store logs and metrics for this cluster.
  clusterLocation: ""
  enableVPC: false
  # The absolute or relative path to the key file for a GCP service account used to
  # send logs and metrics from the cluster
  serviceAccountKeyPath: ""
  # (Optional/Preview) Disable vsphere resource metrics collection from vcenter. True
  # by default
  disableVsphereResourceMetrics: true
# (Optional) Specify which GCP project to connect your GKE clusters to
gkeConnect:
  projectID: ""
  # The absolute or relative path to the key file for a GCP service account used to
  # register the cluster
  registerServiceAccountKeyPath: ""
  # The absolute or relative path to the key file for a GCP service account used by
  # the GKE connect agent
  agentServiceAccountKeyPath: ""
# (Optional) Specify Cloud Run configuration
cloudRun:
  enabled: false
# # (Optional/Alpha) Configure the GKE usage metering feature
# usageMetering:
#   bigQueryProjectID: ""
#   # The ID of the BigQuery Dataset in which the usage metering data will be stored
#   bigQueryDatasetID: ""
#   # The absolute or relative path to the key file for a GCP service account used by
#   # gke-usage-metering to report to BigQuery
#   bigQueryServiceAccountKeyPath: ""
#   # Whether or not to enable consumption-based metering
#   enableConsumptionMetering: false
# # (Optional/Alpha) Configure kubernetes apiserver audit logging
# cloudAuditLogging:
#   projectID: ""
#   # A GCP region where you would like to store audit logs for this cluster.
#   clusterLocation: ""
#   # The absolute or relative path to the key file for a GCP service account used to
#   # send audit logs from the cluster
#   serviceAccountKeyPath: ""
# # (Optional/Preview) Enable auto repair for the cluster
# autoRepair:
#   # Whether to enable auto repair feature. The feature is disabled by default.
#   enabled: false

Troubleshooting

  • In general, the gkectl update cluster command provides specifics when it fails. If the command succeeded and you don't see the nodes, you can troubleshoot with the Diagnosing cluster issues guide.

  • It is possible that there are insufficient cluster resources like a lack of available IP addresses during node pool creation or update. See the Resizing a user cluster topic for details about verifying that IP addresses are available.

  • You can also review the general Troubleshooting guide.

  • Won't proceed past Creating node MachineDeployment(s) in user cluster….

    It can take a while to create or update the node pools in your user cluster. However, if the wait time is extremely long and you suspect that something might have failed, you can run the following commands:

    1. Run kubectl get nodes to obtain the state of your nodes.
    2. For any nodes that are not ready, run kubectl describe node [node_name] to obtain details.