Reconfigure a running cluster

To make basic configuration changes to a running cluster, it is recommended to edit and redeploy the blueprint. This method of reconfiguring a live cluster must only be used for the following cases:

  • To add or remove a partition from the cluster
  • To resize an existing partition

Set up environment to allow for reconfiguration

To setup the environment to allow for reconfiguration, complete the following steps:

  1. Set top level variables to allow for cluster reconfiguration. Placing these settings in the vars block ensures that they are applied to any module that accepts them as inputs.

    # Slurm v6
    vars:
      ...
      enable_cleanup_compute: true
    
    # Slurm v5
    vars:
      ...
      enable_reconfigure: true
      enable_cleanup_compute: true
      enable_cleanup_subscriptions: true
    
  2. To use the settings in the previous step, install local python dependencies. Python dependencies must be installed on the deployment machine where the ghpccommand is run from. For install instructions, review the following:

Reconfigure partitions on running cluster

To reconfigure a running cluster, complete the following steps:

  1. To enable cluster reconfiguration, ensure that you Set up environment to allow for reconfiguration.
  2. Ensure that redeployment happens with the same version of ghpc as the original deployment. Graceful redeployment across versions of the Cloud HPC Toolkit isn't guaranteed.

    You can check the version of ghpc by using the ghpc --version command. Also, ghpc prints a warning if you are using a different version on the redeploy.

  3. Redeploy the blueprint as follows:

    1. Edit the blueprint file. For example, you can increase the node_count_static on a node set.
    2. Recreate the deployment by running the following command. The -w flag is required for the previous deployment to be overwritten.

      ghpc create BLUEPRINT_NAME -w
    3. Redeploy the deployment by running the following command:

      ghpc deploy DEPLOYMENT_FOLDER_NAME

      Carefully evaluate the terraform plan to make sure that no unexpected resources are replaced or deleted.

What's next