Vertex AI Managed Training on reserved clusters uses Slurm (Simple Linux Utility for Resource Management) as the orchestrator for managing and scheduling jobs on your cluster.
Slurm is a widely-used, open-source cluster management and job scheduling system known for its scalability and fault tolerance.
Key capabilities of Slurm
- Slurm allocates a set of compute nodes for the exclusive use of a specific job for a defined period. This ensures a job has dedicated access to the resources it needs to run without interference.
- Slurm provides a framework for managing the complete lifecycle of a job—from submission and execution to monitoring and completion. This system is specifically designed to handle parallel jobs that run across a set of allocated nodes.
- Slurm maintains a queue of pending jobs, using a sophisticated prioritization engine to arbitrate access to compute resources. By considering factors like job size, user priority, and wait time, this system ensures fair and efficient resource utilization across the cluster.
Basic cluster configuration
Before you can run jobs, you must define the fundamental structure of your Slurm cluster. This section details the essential configuration settings, including how to organize compute nodes into partitions, specify a dedicated login node pool, and configure a shared home directory for your users.
Partitions
Partitions group nodes into logical sets, which can be useful for managing
different machine types or access tiers. They are defined as a list within the
partitions field of the slurm_spec.
Each partition object has the following required fields:
id: A unique identifier for the partition.node_pool_ids: A list containing the IDs of one or more node pools that belong to this partition.
For example:
"partitions": [
{
"id": "a4",
"node_pool_ids": [ "a4" ]
}
]
Login nodes
The login node pool provides dedicated nodes that serve as the primary entry
point for users to interact with the cluster. The login_node_pool_id field
specifies the unique identifier for this pool.
For example:
"login_node_pool_id": "login"
Home directory storage
The home_directory_storage field specifies the Filestore instance to be
mounted as the /home directory on all nodes in the cluster. This provides
a shared, persistent home directory for all users.
You must provide the full resource name of the Filestore instance for this value.
For example:
"home_directory_storage": "projects/PROJECT_ID/locations/REGION-ZONE/instances/FILESTORE_INSTANCE_NAME"
Advanced Slurm configuration
Managed Training lets you customize a select set of slurm.conf parameters,
but be aware that these settings can only be configured during initial cluster
creation and can't be changed afterward.
Accounting
Managed Training lets you use built-in accounting features to track resource usage within your cluster. For a complete guide on how to monitor metrics like job-specific CPU time and memory usage, review the official Slurm accounting documentation.
| Parameter | Value | Example |
|---|---|---|
| AccountingStorageEnforce | Comma-separated strings | associations,limits,qos |
Preemption and priority
To manage how jobs are scheduled and prioritized, Managed Training lets you configure Slurm's job preemption. Preemption works with the multifactor priority plugin to determine if running jobs should be paused to make way for higher-priority work.
For a complete conceptual overview, review the official Slurm documentation on the multifactor priority plugin and preemption.
Preemption parameters
| Parameter | Value | Example |
|---|---|---|
| PREEMPT_TYPE | String | preempt/partition_prio |
| PREEMPT_MODE | Comma-separated strings | SUSPEND,GANG |
| PREEMPT_EXEMPT_TIME | String | 00:00:00 |
Priority parameters
| Parameter | Value | Example |
|---|---|---|
| PRIORITY_TYPE | String | priority/multifactor |
| PRIORITY_WEIGHT_AGE | Integer | 0 |
| PRIORITY_WEIGHT_ASSOC | Integer | 0 |
| PRIORITY_WEIGHT_FAIRSHARE | Integer | 0 |
| PRIORITY_WEIGHT_JOB_SIZE | Integer | 0 |
| PRIORITY_WEIGHT_PARTITION | Integer | 0 |
| PRIORITY_WEIGHT_QOS | Integer | 0 |
| PRIORITY_WEIGHT_TRES | Comma-separated strings | cpu=100,mem=150 |
Prolog and epilog scripts
You can configure custom Bash scripts to run automatically at the start (prolog) and end (epilog) of each job using the following fields:
prolog_bash_scripts: A list of strings, where each string contains the full content of a Bash script to be executed before the job begins.epilog_bash_scripts: A list of strings, where each string contains the full content of a Bash script to be executed after the job completes.
This is useful for setting up a unique job environment or performing automated cleanup tasks.
Example cluster specification
The following example shows a complete JSON configuration for creating a Managed Training cluster. You can adapt this specification for your own needs.
{ // ... other cluster configurations ... "orchestratorSpec": { "slurmSpec": { "partitions": [ { "id": "a4", "node_pool_ids": ["a4"] } ], "login_node_pool_id": "login", "home_directory_storage": "projects/PROJECT_ID/locations/REGION-ZONE/instances/FILESTORE_INSTANCE_ID", "accounting": { "accounting_storage_enforce": "ACCOUNTING_STORAGE_ENFORCE" }, "scheduling": { "priority_type": "PRIORITY_TYPE", "priority_weight_age": PRIORITY_WEIGHT_AGE, "priority_weight_assoc": PRIORITY_WEIGHT_ASSOC, "priority_weight_fairshare": PRIORITY_WEIGHT_FAIRSHARE, "priority_weight_job_size": PRIORITY_WEIGHT_JOB_SIZE, "priority_weight_partition": PRIORITY_WEIGHT_PARTITION, "priority_weight_qos": PRIORITY_WEIGHT_QOS, "priority_weight_tres": "PRIORITY_WEIGHT_TRES", "preempt_type": "PREEMPT_TYPE", "preempt_mode": "PREEMPT_MODE", "preempt_exempt_time": "PREEMPT_EXEMPT_TIME" }, "prolog_bash_scripts": [ "#!/bin/bash\necho 'First prolog script running'", "#!/bin/bash\necho 'Second prolog script running'" ], "epilog_bash_scripts": [ "#!/bin/bash\necho 'Epilog script running'" ] // ... other Slurm settings ... } } }
Cluster management and operations
Managing a running cluster
Once your cluster is created with the chosen accounting and preemption settings, you can use Slurm's command-line tools to manage user accounts and monitor job scheduling.
Account management with sacctmgr
The sacctmgr command is the primary tool for managing user and account
information in the Slurm database. For example, to add a new user to an account
and grant them access to a partition, run the following command:
sudo sacctmgr add User Accounts=<account> Partition=<partition> <user>
For a comprehensive list of all sacctmgr options, review the official
Slurm accounting documentation.
Checking job priority
To check the priority components of each job in the queue, use the sprio
utility. This is useful for understanding why certain jobs are scheduled to
run before others.
See the sprio utility documentation
for detailed usage.
Preemption examples
The official Slurm documentation provides several working examples of different preemption strategies. You can find these on the Slurm Preemption page.