H4D instances on Google Cloud are optimized for High Performance Computing (HPC) workloads and feature Cloud RDMA technology to deliver low-latency, high-bandwidth inter-node communication. This guide provides procedures for setting up and configuring Message Passing Interface (MPI) libraries to take advantage of Cloud RDMA on your H4D clusters. It also provides best practices for compiling and scaling your MPI applications on H4D instances.
Before you begin
Before you attempt any of the tasks in this document, you must have meet the following prerequisites:
- You have an HPC cluster of H4D instances with Cloud RDMA configured on the instances. If you need to create one, follow the quickstart to Create a Slurm cluster with H4D VMs.
- You have configured a Virtual Private Cloud (VPC) network and subnet enabled for Cloud RDMA.
- Increase memory lock and open file limits.
Set up the MPI library
Google Cloud has validated Cloud RDMA on H4D with Intel MPI and Open MPI. Your application's requirements or recommendations should determine which MPI library you use. Many applications are built and tuned for a specific MPI implementation. Google recommends using Intel MPI version 2021.13.1.
Configure Intel MPI
Install the Intel MPI library on all nodes in the cluster. Refer to Intel's official documentation for the latest installation procedures. If you use Cluster Toolkit, then it runs scripts that typically handle the installation.
To ensure that Intel MPI uses the Cloud RDMA interface, set environment variables in your job scripts or user environment:
export FI_PROVIDER="verbs;ofi_rxm"
export FI_VERBS_INLINE_SIZE: 39
export FI_OFI_RXM_BUFFER_SIZE: 4096
export FI_UNIVERSE_SIZE: N_MPI_RANKS
Replace N_MPI_RANKS
with a value that is based on the number
of H4D instances you have provisioned. For example, if you have a 16-node
cluster, you would use the value 3072, which is 192 * 16
machines.
Configure the guest OS environment
After you have created the H4D instances, configure the guest environment.
Set up user limits for MPI
In the guest OS of each H4D instance, increase the memory lock and open file limits, using the values shown in the following example:
cat << EOF | sudo tee -a /etc/security/limits.conf
* hard memlock unlimited
* soft memlock unlimited
* hard nofile 65535
* soft nofile 65535
EOF
Set up SSH keys for MPI
Set up SSH keys for MPI applications that require them. The following example
syntax assumes there is a shared home directory for all nodes. Otherwise, the
.ssh
directory must be copied to each node.
ssh-keygen -f /home/$USER/.ssh/id_rsa -t rsa -N ''
cat << EOF > /home/$USER/.ssh/config
Host *
StrictHostKeyChecking no
EOF
cat /home/$USER/.ssh/id_rsa.pub >> /home/$USER/.ssh/authorized_keys
chmod 600 /home/$USER/.ssh/authorized_keys
chmod 644 /home/$USER/.ssh/config
Set environment variables for Intel MPI
You can optionally set the following environment variables:
To pin MPI processes within NUMA nodes, potentially improving locality, set the following:
export I_MPI_PIN_DOMAIN=numa
To increase output verbosity for troubleshooting and debugging, use the following:
export I_MPI_DEBUG=5
Optimize and scale MPI with Cloud RDMA
To achieve the best performance with MPI on H4D VMs, use the following configuration steps to optimizing and scale the performance of your MPI applications.
Network tuning for RDMA
Use the following information when configuring your H4D instances to use Cloud RDMA:
- Interface selection: Explicitly configure your MPI library to use the
IRDMA network interface for inter-node communication. This is often
configured by using environment variables that select the fabric
provider, for example
FI_PROVIDER
for MPIs based on libfabric. See Setup the MPI library. - MTU: The RDMA network supports a large MTU size. Ensure your interface and MPI configuration take advantage of a larger MTU size to reduce overhead. If you use Cluster Toolkit to deploy your cluster, then the MTU size is set to 8896.
- Buffer sizes: Tuning MPI buffer sizes can sometimes improve performance, but default settings are often a good starting point.
Process pinning
Binding MPI processes to specific CPU cores is crucial for performance, especially on NUMA systems like H4D instances. This minimizes remote memory access and helps to ensure consistent performance.
or command-line options with mpirun
like -genv I_MPI_PIN_PROCESSOR_LIST
.
Experiment with different pinning strategies (for example, per core or per NUMA
node) based on your application's characteristics. Simultaneous multi-threading
(SMT) is disabled on H4D instances, so each vCPU represents a physical core.
Optimizing MPI Collectives
Collective communication, such as MPI_Bcast
or MPI_Allreduce
, can
significantly impact performance at scale.
select different algorithms for collectives. Intel MPI often has tuned algorithms for various message sizes and process counts.
Compiling Applications
Compile your HPC applications with compiler flags optimized for the AMD EPYC Turin architecture.
- Use modern compilers: Use recent versions of GNU compiler collection (GCC), Intel compilers, or AMD Optimizing C/C++ Compiler (AOCC), if available within your environment.
- Architecture flags: Use flags like
-march=znver4
for AMD EPYC Turin architecture if using GCC or AOCC. - Optimization levels: Employ appropriate optimization levels, for example
-O2
,-O3
, or-Ofast
. - Vectorization: Ensure vectorization is enabled, which is often the
default when using optimization level
-O2
or higher. - Link time optimization (LTO): Consider using LTO with flags like
-flto
.
Application Scaling Optimizations
- Load Balancing: Ensure work is evenly distributed across MPI processes.
- Communication Patterns: Analyze and optimize communication patterns to reduce synchronization overhead and latency. Use point-to-point communication instead of collectives where possible, or use non-blocking operations.
- I/O: For large-scale jobs, parallel I/O solutions like Lustre or other parallel file systems accessible from your cluster are critical to avoid bottlenecks. H4D supports Hyperdisk Balanced disks with capped performance; for I/O intensive needs, Local SSD or Parallelstore should be considered.
Keep the drivers up to date
If you disable automatic updates on your H4D instances, then you should
regularly run the dnf update
command on the instance to keep the Cloud RDMA
driver up to date.
Alternatively, if you used Cluster Toolkit to create your H4D
instances, then you can use the install_cloud_rdma_drivers
setting in the
startup-script
module to ensure that the latest Cloud RDMA drivers are
installed on instance startup.
Running MPI Applications
To run your MPI application, use the mpirun
command from your chosen MPI
library.
Intel MPI
Create a host file that lists the network names of the H4D instances in the cluster. Then use the following command:
# Example for Intel MPI
mpirun -n TOTAL_PROCESSES -ppn PROCESSES_PER_NODE -hosts HOST_FILE ./YOUR_APPLICATION
Intel MPI within a Slurm script
To run your MPI application, use the mpirun
command from your chosen MPI
library within your Slurm job script.
# Example for Intel MPI in a Slurm script
#SBATCH --nodes=NUMBER_OF_NODES
#SBATCH --ntasks-per-node=PROCESSES_PER_NODE
# Load Intel MPI module if necessary
module load intelmpi
# Set environment variables for RDMA
export FI_PROVIDER="verbs;ofi_rxm"
export FI_VERBS_INLINE_SIZE: 39
export FI_OFI_RXM_BUFFER_SIZE: 4096
export FI_UNIVERSE_SIZE: N_MPI_RANKS
# Run the application
mpirun ./your_application
Replace the following:
- NUMBER_OF_NODES: the number of instances in your cluster
- PROCESSES_PER_NODE: the number of processes per node
- N_MPI_RANKS: number of MPI ranks. For example, if you have
a 16 node cluster, you might use the value
192 * 16
, or 3072.
What's next
- Explore Best practices for running HPC workloads.
- Review the H4D machine series documentation.
- Consult the documentation for your specific MPI library and HPC application for detailed tuning and environment variable options.
- Explore the Cluster Toolkit GitHub repository for HPC examples.