Set up and scale MPI applications on H4D VMs with Cloud RDMA

H4D instances on Google Cloud are optimized for High Performance Computing (HPC) workloads and feature Cloud RDMA technology to deliver low-latency, high-bandwidth inter-node communication. This guide provides procedures for setting up and configuring Message Passing Interface (MPI) libraries to take advantage of Cloud RDMA on your H4D clusters. It also provides best practices for compiling and scaling your MPI applications on H4D instances.

Before you begin

Before you attempt any of the tasks in this document, you must have meet the following prerequisites:

You have an HPC cluster of H4D instances with Cloud RDMA configured on the instances. If you need to create one, follow the quickstart to Create a Slurm cluster with H4D VMs.
You have configured a Virtual Private Cloud (VPC) network and subnet enabled for Cloud RDMA.
Increase memory lock and open file limits.

Set up the MPI library

Google Cloud has validated Cloud RDMA on H4D with Intel MPI and Open MPI. Your application's requirements or recommendations should determine which MPI library you use. Many applications are built and tuned for a specific MPI implementation. Google recommends using Intel MPI version 2021.13.1.

Configure Intel MPI

Install the Intel MPI library on all nodes in the cluster. Refer to Intel's official documentation for the latest installation procedures. If you use Cluster Toolkit, then it runs scripts that typically handle the installation.

To ensure that Intel MPI uses the Cloud RDMA interface, set environment variables in your job scripts or user environment:

export FI_PROVIDER="verbs;ofi_rxm"
export FI_VERBS_INLINE_SIZE: 39
export FI_OFI_RXM_BUFFER_SIZE: 4096
export FI_UNIVERSE_SIZE: N_MPI_RANKS

Replace N_MPI_RANKS with a value that is based on the number of H4D instances you have provisioned. For example, if you have a 16-node cluster, you would use the value 3072, which is 192 * 16 machines.

Configure the guest OS environment

After you have created the H4D instances, configure the guest environment.

Set up user limits for MPI

In the guest OS of each H4D instance, increase the memory lock and open file limits, using the values shown in the following example:

cat << EOF | sudo tee -a /etc/security/limits.conf
*  hard    memlock         unlimited
*  soft    memlock         unlimited
*  hard    nofile          65535
*  soft    nofile          65535
EOF

Set up SSH keys for MPI

Set up SSH keys for MPI applications that require them. The following example syntax assumes there is a shared home directory for all nodes. Otherwise, the .ssh directory must be copied to each node.

ssh-keygen -f /home/$USER/.ssh/id_rsa -t rsa -N ''
cat << EOF > /home/$USER/.ssh/config
Host *
    StrictHostKeyChecking no
EOF
cat /home/$USER/.ssh/id_rsa.pub >> /home/$USER/.ssh/authorized_keys
chmod 600 /home/$USER/.ssh/authorized_keys
chmod 644 /home/$USER/.ssh/config

Set environment variables for Intel MPI

You can optionally set the following environment variables:

To pin MPI processes within NUMA nodes, potentially improving locality, set the following:
```
export I_MPI_PIN_DOMAIN=numa
```
To increase output verbosity for troubleshooting and debugging, use the following:
```
export I_MPI_DEBUG=5
```

Optimize and scale MPI with Cloud RDMA

To achieve the best performance with MPI on H4D VMs, use the following configuration steps to optimizing and scale the performance of your MPI applications.

Network tuning for RDMA

Use the following information when configuring your H4D instances to use Cloud RDMA:

Interface selection: Explicitly configure your MPI library to use the IRDMA network interface for inter-node communication. This is often configured by using environment variables that select the fabric provider, for example FI_PROVIDER for MPIs based on libfabric. See Setup the MPI library.
MTU: The RDMA network supports a large MTU size. Ensure your interface and MPI configuration take advantage of a larger MTU size to reduce overhead. If you use Cluster Toolkit to deploy your cluster, then the MTU size is set to 8896.
Buffer sizes: Tuning MPI buffer sizes can sometimes improve performance, but default settings are often a good starting point.

Process pinning

Binding MPI processes to specific CPU cores is crucial for performance, especially on NUMA systems like H4D instances. This minimizes remote memory access and helps to ensure consistent performance.

or command-line options with mpirun like -genv I_MPI_PIN_PROCESSOR_LIST. Experiment with different pinning strategies (for example, per core or per NUMA node) based on your application's characteristics. Simultaneous multi-threading (SMT) is disabled on H4D instances, so each vCPU represents a physical core.

Optimizing MPI Collectives

Collective communication, such as MPI_Bcast or MPI_Allreduce, can significantly impact performance at scale.

select different algorithms for collectives. Intel MPI often has tuned algorithms for various message sizes and process counts.

Compiling Applications

Compile your HPC applications with compiler flags optimized for the AMD EPYC Turin architecture.

Use modern compilers: Use recent versions of GNU compiler collection (GCC), Intel compilers, or AMD Optimizing C/C++ Compiler (AOCC), if available within your environment.
Architecture flags: Use flags like -march=znver4 for AMD EPYC Turin architecture if using GCC or AOCC.
Optimization levels: Employ appropriate optimization levels, for example -O2, -O3, or -Ofast.
Vectorization: Ensure vectorization is enabled, which is often the default when using optimization level -O2 or higher.
Link time optimization (LTO): Consider using LTO with flags like -flto.

Application Scaling Optimizations

Load Balancing: Ensure work is evenly distributed across MPI processes.
Communication Patterns: Analyze and optimize communication patterns to reduce synchronization overhead and latency. Use point-to-point communication instead of collectives where possible, or use non-blocking operations.
I/O: For large-scale jobs, parallel I/O solutions like Lustre or other parallel file systems accessible from your cluster are critical to avoid bottlenecks. H4D supports Hyperdisk Balanced disks with capped performance; for I/O intensive needs, Local SSD or Parallelstore should be considered.

Keep the drivers up to date

If you disable automatic updates on your H4D instances, then you should regularly run the dnf update command on the instance to keep the Cloud RDMA driver up to date.

Alternatively, if you used Cluster Toolkit to create your H4D instances, then you can use the install_cloud_rdma_drivers setting in the startup-script module to ensure that the latest Cloud RDMA drivers are installed on instance startup.

Running MPI Applications

To run your MPI application, use the mpirun command from your chosen MPI library.

Intel MPI

Create a host file that lists the network names of the H4D instances in the cluster. Then use the following command:

# Example for Intel MPI
mpirun -n TOTAL_PROCESSES -ppn PROCESSES_PER_NODE -hosts HOST_FILE ./YOUR_APPLICATION

Intel MPI within a Slurm script

To run your MPI application, use the mpirun command from your chosen MPI library within your Slurm job script.

# Example for Intel MPI in a Slurm script
#SBATCH --nodes=NUMBER_OF_NODES
#SBATCH --ntasks-per-node=PROCESSES_PER_NODE

# Load Intel MPI module if necessary
module load intelmpi

# Set environment variables for RDMA
export FI_PROVIDER="verbs;ofi_rxm"
export FI_VERBS_INLINE_SIZE: 39
export FI_OFI_RXM_BUFFER_SIZE: 4096
export FI_UNIVERSE_SIZE: N_MPI_RANKS

# Run the application
mpirun ./your_application

Replace the following:

NUMBER_OF_NODES: the number of instances in your cluster
PROCESSES_PER_NODE: the number of processes per node
N_MPI_RANKS: number of MPI ranks. For example, if you have a 16 node cluster, you might use the value 192 * 16, or 3072.

What's next

Explore Best practices for running HPC workloads.
Review the H4D machine series documentation.
Consult the documentation for your specific MPI library and HPC application for detailed tuning and environment variable options.
Explore the Cluster Toolkit GitHub repository for HPC examples.