Best practices for running tightly coupled HPC applications on Compute Engine

This document provides best practices for tuning Google Cloud resources for optimal Message Passing Interface (MPI) performance. Tightly coupled High Performance Computing (HPC) workloads often use MPI to communicate between processes and instances. Proper tuning of the underlying systems and network infrastructure is essential for optimal MPI performance. If you run MPI-based code in Google Cloud, use these practices to get the best possible performance.

Assumptions and requirements

Typically, workload schedulers such as Slurm or HTCondor are used to manage instances. The recommendations and best practices in this document apply for all schedulers and workflow managers.

Implementation of these best practices using the various schedulers or workflow tools is beyond the scope of this document. Other documents and tutorials provide tools for implementation and the guidelines for those tools.

The guidelines in this document are general and might not benefit all applications. We recommend that you benchmark your applications to find the most efficient or cost-effective configuration.

Compute Engine configuration

This section includes best practices to get best compute performance for your application. Using the right machine type and settings inside the system can have a significant impact on the MPI performance.

Use compact placement policy

Placement policy gives you control over the placement of your virtual machines (VMs) in data centers. Compact placement policy provides lower-latency topologies for VM placement in a single availability zone. Current APIs let you create up to 22 compute-optimized (C2) VMs that are physically close to each other. If you need more than 22 C2 instances (more than 660 physical cores), divide your instances into multiple placement policies. We recommend using the minimum number of placement policies that accommodates your workload.

To use placement policies, first create a collocated placement policy with the required number of VMs in a given region:

gcloud compute resource-policies create group-placement \
    PLACEMENT_POLICY_NAME --collocation=collocated \
    --vm-count=NUMBER_OF_VMS

Then create instances using the policy in the required zone:

gcloud compute instances create instance-1 \
    instance-2...instance-n --zone=us-central1-a \
    --resource-policies=PLACEMENT_POLICY_NAME \
    --maintenance-policy=TERMINATE ...

In some cases, you might not have direct control over how VMs are created. For example, the VMs might be created through the use of some unintegrated third-party tools. To apply a placement policy to existing VMs, first stop the VMs, and then run the following command:

gcloud compute instances add-resource-policies \
    instance-1 instance-2...instance-n --zone=us-central1-a \
    --resource-policies=PLACEMENT_POLICY_NAME

Use compute-optimized instances

We recommend using C2 instances to run HPC applications. These instances have fixed virtual-to-physical core mapping and expose NUMA cell architecture to guest OS, both of which are critical for performance of tightly coupled HPC applications.

C2 instances have up to 60 vCPUS (30 physical cores) and 240 GB of RAM. They can also have up to 3 TB of local SSD storage and can support up to 32 Gbps of network throughput. C2 instances also leverage 2nd Generation Intel Xeon Scalable Processors (Cascade Lake) that provide more memory bandwidth and a higher clock speed (up to 3.8 GHz) compared to other instance types. C2 instances typically provide up to 40% improvement in performance compared to N1 instance types.

To reduce the communication overhead between machines, we recommend consolidating your workload into a smaller number of C2-standard-60 VMs (with the same total core count) instead of launching a larger number of smaller C2 VMs.

Disable Hyper-Threading

Some MPI applications get better performance by disabling Hyper-Threading in the guest OS. Hyper-Threading allocates two virtual cores (vCPU) per physical core on the node. For many general computing tasks or tasks that require lots of I/O, Hyper-Threading can increase application throughput significantly. For compute-bound jobs in which both virtual cores are compute bound, Hyper-Threading can hinder overall application performance and can add nondeterministic variance to jobs. Turning off Hyper-Threading allows more predictable performance and can decrease job times.

You can disable Hyper-Threading on C2, N1-Ultramem, and N2 instances, either online or by rebooting.

To disable Hyper-Threading online, use a script like Manage_Hyperthreading.sh, which sets active CPUs offline.

To disable Hyper-Threading by rebooting, add the following to the GRUB_CMDLINE_LINUX string in /etc/default/grub, where NUM_CPU is the number of vCPUs in your instance divided by two. For example, for a C2-standard-60 instance, NUM_CPU would be 30.

noht nosmt nr_cpus=NUM_CPU

After you change the grub file, run the following to update the GRUB system configuration file, and then reboot the system:

sudo grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

If your system has a legacy BIOS boot mode, run the following command instead:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg

To verify that SMP (Hyper-Threading) is disabled in the system after reboot, use the following command:

lscpu | grep -e Socket -e Core -e Thread

The output is similar to the following, where Thread(s) per core is 1 when Hyper-Threading is disabled, and 2 when Hyper-Threading is enabled:

Thread(s) per core:    1
Core(s) per socket:    15
Socket(s):             2

Adjust user limits

Unix systems have default limits on system resources like open files and numbers of processes that any one user can use. These limits prevent one user from monopolizing the system resources and affecting other users' work. In the context of HPC, however, these limits are typically unnecessary because the compute nodes in the cluster aren't directly shared between users.

You can adjust user limits by editing the /etc/security/limits.conf file and logging in to the node again. For automation, you can bake these changes into a VM image, or adjust limits at the time of deployment by using tools like Deployment Manager, Terraform, or Ansible.

When you adjust user limits, change the values for the following limits:

  • nproc - maximum number of processes
  • memlock - maximum locked-in-memory address space (KB)
  • stack - maximum stack size (KB)
  • nofile - maximum number of open files
  • cpu - maximum CPU time (minutes)
  • rtprio - maximum real-time priority allowed for non-privileged processes (Linux 2.6.12 and higher)

These limits are configured in the /etc/security/limits.conf system configuration file for most Unix and Linux systems, including Debian, CentOS, and Red Hat.

To change user limits, use a text editor to change the following values:

  • In /etc/security/limits.conf:

    *            -     nproc     unlimited
    *            -     memlock   unlimited
    *            -     stack     unlimited
    *            -     nofile    1048576
    *            -     cpu       unlimited
    *            -     rtprio    unlimited
    
  • In /etc/security/limits.d/20-nproc.conf:

    *            -    nproc      unlimited
    

Set up SSH host keys

Intel MPI requires host keys for all of the cluster nodes in the ~/.ssh/known_hosts file of the node that executes mpirun. You must also save your SSH keys in authorized_keys.

To add host keys, run the following:

ssh-keyscan -H 'cat HOSTFILE' >> ~/.ssh/known_hosts

Another way to do this is to add StrictHostKeyChecking=no to the ~/.ssh/config file by running the following:

Host *
StrictHostKeyChecking no

Storage

Performance of many HPC applications strongly depends on the performance of the underlying storage system. This is especially true for applications that read or write a lot of data or that create or access many files or objects. It's also true when a lot of ranks access the storage system simultaneously.

Choose an NFS file system or parallel file system

Following are the primary storage choices for tightly coupled applications. Each choice has its own cost, performance profile, APIs, and consistency semantics:

  • NFS-based solutions such as Filestore and NetApp Cloud Volumes are the easiest for deploying shared storage options. Both options are fully managed on Google Cloud, and are best when the application doesn't have extreme I/O requirements to a single dataset, and has limited to no data sharing between compute nodes during application execution and updates. For performance limits, see the Filestore and NetApp Cloud Volumes documentation.
  • POSIX-based parallel file systems are more commonly used by MPI applications. POSIX-based options include open source Lustre and the fully supported Lustre offering, DDN Storage EXAScaler Cloud. When compute nodes generate and share data, they frequently rely on the extreme performance provided by parallel file systems and support for full POSIX semantics. Parallel file systems like Lustre deliver data to the largest supercomputers and can support thousands of clients. Lustre also supports data and I/O libraries such as NetCDF and HDF5, along with MPI-IO, enabling parallel I/O for a wide set of application domains.

Choose a storage infrastructure

Application performance requirements should guide the storage infrastructure or tier of storage for the file system you choose. For example, if you deploy SSDs for applications that don't need high I/O operations per second (IOPS), you might increase costs without much benefit.

The managed storage services Filestore and NetApp Cloud Volumes offer several performance tiers that scale based on capacity.

To determine the correct infrastructure for open source Lustre or DDN Storage EXAScaler Cloud, you must first understand the vCPU and capacity that is required to achieve the needed performance with standard persistent disk, SSD persistent disk, or local SSD. For more information about how to determine the correct infrastructure, see Block storage performance information and Optimizing persistent disk performance. For example, if you use Lustre, you can deploy low-cost and high-bandwidth solutions by using SSD persistent disk for the metadata server (MDS) and standard persistent disk for the storage servers (OSSs).

Network settings

MPI networking performance is critical for many HPC applications. This is especially true for tightly coupled applications in which MPI processes on different nodes communicate frequently or with large data volume. This section includes best practices to tune your network settings for optimal MPI performance.

Increase tcp_*mem settings

C2 machines can support up to 32 Gbps bandwidth, which requires larger TCP memory than the default Linux settings. For better network performance, increase the tcp_mem value.

To increase TCP memory limits, update the following values in /etc/sysctl.conf:

net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216

To load the new values in /etc/sysctl.conf, run sysctl -p.

Use the network-latency profile

You can improve the performance of some applications by enabling busy polling. Busy polling helps reduce latency in the network receive path by allowing socket layer code to poll the receive queue of a network device and by disabling network interrupts. Evaluate your application's latency to see if busy polling helps.

The network-latency profile persists across reboots. If your system has tuned-adm installed, you can enable the network-latency profile by running the following command in CentOS:

tuned-adm profile network-latency

If your system doesn't have tuned-adm installed, you can enable busy polling by adding the following to /etc/sysctl.conf:

net.core.busy_poll = 50
net.core.busy_read = 50

To load the new values in /etc/sysctl.conf, run sysctl -p.

MPI libraries and user applications

MPI library settings and HPC application configurations can affect application performance. To achieve the best performance for HPC applications, it's important to fine-tune those settings or configurations. This section includes best practices for tuning your MPI libraries and user applications for optimal MPI performance.

Use Intel MPI

For best performance, we recommend that you use Intel MPI.

For Intel MPI 2018 versions, specify the underlying communication fabrics to use TCP over the shared-memory-only provider when running on Google Cloud:

mpirun -hostfile HOSTFILE -np NUM_PROCESSES \
    -ppn PROCESSES_PER_NODE -genv I_MPI_FABRICS "shm:tcp" APPLICATION

For Intel MPI 2019+ versions, specify the underlying communication fabrics to use TCP over the RxM OFI provider for better performance:

mpirun -hostfile HOSTFILE -np NUM_PROCESSES \
    -ppn PROCESSES_PER_NODE -genv I_MPI_FABRICS "ofi_rxm;tcp" APPLICATION

Use mpitune to do MPI Collective tuning

MPI implementations such as Intel MPI and OpenMPI have many internal configuration parameters that can affect communication performance. These parameters are especially relevant for MPI Collective communication, which lets you specify algorithms and configuration parameters that can perform very differently in the Google Cloud environment. We strongly recommend that you tune configuration parameters based on the characteristics of your applications.

To manually specify the algorithms and configuration parameters for MPI Collective communication, we recommend using mpitune. To run an mpitune command, you must have write access to the directory or run the command as root.

When you use mpitune, you tune for each combination of the number of VMs and the number of processes per VM. For example, if you're using Intel MPI 2018 versions, you can tune for 22 VMs and 30 processes per VM by running the following:

mpitune -hf HOSTFILE -fl 'shm:tcp' -pr 30:30 -hr 22:22

The preceding mpitune command generates a configuration file in the Intel MPI directory that you can use later to run applications.

To use the tuning configuration for an application, you add the -tune option to the mpirun command:

mpirun -tune -hostfile HOSTFILE -genv I_MPI_FABRICS 'shm:tcp' -np 660 -ppn 30 ./app

To enable tuning, you must explicitly supply the I_MPI_FABRICS environment variable, and the number of nodes per VM and the number of processes per node must match the values used during tuning.

You can find the generated tuning files in the Intel MPI installation directory at intel/mpi/2018.1/etc64.

The filenames encode the device, the number of nodes, and other information, like the following:

mpiexec_shm-tcp_nn_22_np_660_ppn_30.conf

You can reuse the file on a different set of VMs with similar configurations by copying the file into the corresponding directory and adding the -tune option to the mpirun command. You can also optionally add the config file as an argument after the -tune parameter.

Use MPI OpenMP hybrid mode

Many MPI applications support a hybrid mode that you can use to enable OpenMP in MPI applications. In hybrid mode, each MPI process can use a fixed number of threads to accelerate the execution of certain loop structures.

We recommend that you explore the hybrid mode option when you want to optimize application performance. Using hybrid mode can result in fewer MPI processes on each VM, leading to less inter-process communication and lower overall communication time.

Enabling hybrid mode or OpenMP is application dependent. In many cases, you can enable hybrid mode by setting the following environment variable:

export OMP_NUM_THREADS=NUM_THREADS

When you use this hybrid approach, we recommend that the total number of threads doesn't exceed the number of physical cores in the VM. The C2-standard-60 VMs have 2 NUMA sockets of 15 cores and 30 vCPUs each. We recommend that you don't have any MPI process with OpenMP threads spanning multiple NUMA nodes.

Compile applications using vector instructions and the Math Kernel Library

C2 VMs support the AVX2, AVX512 vector instructions. You can improve the performance of many HPC applications by compiling them using AVX instructions. Some applications perform better if you use AVX2 instead of AVX512. We recommend trying both AVX instructions for your workload. For better performance of scientific computation, we also recommend using the Intel Math Kernel Library (Intel MKL).

Use appropriate CPU numbering

C2-standard-60 has two NUMA sockets, and the CPUs are numbered with respect to NUMA nodes as follows:

NUMA node0 CPU(s):     0-14,30-44
NUMA node1 CPU(s):     15-29,45-59

The following diagram illustrates the assignment of CPU numbers to each of the CPUs in a C2-standard-60 instance's NUMA nodes. Socket 0 corresponds to NUMA node0 with CPUs 0-14 and 30-44. Socket 1 corresponds to NUMA node1 with CPUs 15-29 and 45-59.

Virtual core numbering for C2-standard-60 nodes.

The hyper-thread siblings that map to a single core in the VM are (0,30)(1,31)..(29,59).

Intel MPI uses the NUMA CPU numbers for processor pinning of MPI jobs. If you want to use a single hyper-thread per core across all nodes that is consistent across runs, use CPU numbers 0-29.

Open MPI uses logical CPU numbers as reported by Portable Hardware Locality (hwloc). If you use Open MPI, hyper-thread siblings are numbered consecutively as follows:

  • Socket 0: 0 (core 0 HT 0), 1 (core 0 HT 1), 2 (core 1 HT 0) ,...,28 (core 14 HT 0), 29 (core 14, HT 1)

  • Socket 1: 30 (core 0 HT 0), 31 (core 0 HT 1), 2 (core 1 HT 0) ,...,58 (core 14 HT 0), 59 (core 14, HT 1)

The output looks like the following:

lstopo-no-graphics
Machine (240GB total)
  NUMANode L#0 (P#0 120GB) + Package L#0 + L3 L#0 (25MB)
    L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#30)
    L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
      PU L#2 (P#1)
      PU L#3 (P#31)

When you use Open MPI, you can use a single hyper-thread per core across all nodes that is consistent across runs by using CPU numbers 0,2,4,..58. To force MPI to pin a process to a core, use the option --bind-to core when you run openMPI, and then validate the correct binding using the --report-bindings option.

Security settings

You can improve MPI performance by disabling some built-in Linux security features. The performance benefit for disabling each of these features varies. If you are confident that your systems are well protected, you can evaluate disabling the following security features.

Disable Linux firewalls

For Google Cloud CentOS Linux images, the firewall is turned on by default. To disable the firewall, stop and disable the firewalld daemon by running the following commands:

sudo systemctl stop firewalld
sudo systemctl disable firewalld
sudo systemctl mask --now firewalld

Disable SELinux

SELinux in CentOS is turned on by default. To disable SELinux, edit the /etc/selinux/config file, and replace the line SELINUX=enforcing or SELINUX=permissive with SELINUX=disabled.

You must reboot for this change to take effect.

Turn off Meltdown and Spectre mitigation

The following security patches are enabled by default on Linux systems:

  • Variant 1, Spectre: CVE-2017-5753
  • Variant 2, Spectre: CVE-2017-5715
  • Variant 3, Meltdown: CVE-2017-5754
  • Variant 4, Speculative Store Bypass: CVE-2018-3639

The security vulnerabilities described in these CVEs might be found in modern microprocessors, including the processors deployed in Google Cloud. You can disable one or more of these mitigations—and incur the associated security risks—by using the kernel command line at boot (persists across reboots), or by using debugfs at runtime (doesn't persist on reboot).

To permanently disable the preceding security mitigations, follow these steps:

  1. Modify the file /etc/default/grub:

    sudo sed -i 's/^GRUB_CMDLINE_LINUX=\"\(.*\)\"/GRUB_CMDLINE_LINUX=\"\1 spectre_v2=off nopti spec_store_bypass_disable=off\"/' /etc/default/grub
    
  2. After you change the grub file, run the following to update the GRUB system configuration file, and then reboot the system:

    sudo grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
    

    If your system has a legacy BIOS boot mode, run the following command instead:

    sudo grub2-mkconfig -o /boot/grub2/grub.cfg
    
  3. Reboot.

If a system is already running, you can disable the preceding security mitigations by running the following commands. This doesn't persist across reboots.

echo 0 > /sys/kernel/debug/x86/pti_enabled
echo 0 > /sys/kernel/debug/x86/retp_enabled
echo 0 > /sys/kernel/debug/x86/ibrs_enabled
echo 0 > /sys/kernel/debug/x86/ssbd_enabled

For information about how different mitigations can impact your systems and how to control them, see the Red Hat documentation for Controlling the performance impact of microcode and security patches and Kernel side-channel attack using speculative store bypass.

To find the affected vulnerabilities of a CPU, run the following:

grep . /sys/devices/system/cpu/vulnerabilities/*

To find which mitigations are enabled, run the following:

grep . /sys/kernel/debug/x86/*_enabled

Checklist summary

The following table summarizes the best practices for using MPI on Compute Engine.

Area Tasks
Compute Engine configuration
Storage
Network settings
MPI libraries and user applications
Security settings

What's next