Parallel file systems for HPC workloads

This document introduces the storage options in Google Cloud for high performance computing (HPC) workloads, and explains when to use parallel file systems like Lustre for HPC workloads. In a parallel file system, several clients use parallel I/O paths to access shared data that's stored across multiple networked storage nodes.

The information in this document is intended for architects and administrators who design, provision, and manage storage for data-intensive HPC workloads. The document assumes that you have a conceptual understanding of network file systems (NFS), parallel file systems, POSIX, and the storage requirements of HPC applications.

What is HPC?

HPC systems solve large computational problems fast by aggregating multiple computing resources. HPC drives research and innovation across industries such as healthcare, life sciences, media, entertainment, financial services, and energy. Researchers, scientists, and analysts use HPC systems to perform experiments, run simulations, and evaluate prototypes. HPC workloads such as seismic processing, genomics sequencing, media rendering, and climate modeling generate and access large volumes of data at ever increasing data rates and ever decreasing latencies. High-performance storage and data management are critical building blocks of HPC infrastructure.

Storage options for HPC workloads in Google Cloud

Setting up and operating HPC infrastructure on-premises is expensive, and the infrastructure requires ongoing maintenance. In addition, on-premises infrastructure typically can't be scaled quickly to match changes in demand. Planning, procuring, deploying, and decommissioning hardware on-premises takes considerable time, resulting in delayed addition of HPC resources or underutilized capacity. In the cloud, you can efficiently provision HPC infrastructure that uses the latest technology, and you can scale your capacity on-demand.

Google Cloud and our technology partners offer cost-efficient, flexible, and scalable storage options for deploying HPC infrastructure in the cloud and for augmenting your on-premises HPC infrastructure. Scientists, researchers, and analysts can quickly access additional HPC capacity for their projects when they need it.

To deploy an HPC workload in Google Cloud, you can choose from the following storage services and products, depending on the requirements of your workload:

Workload type Recommended storage services and products
Workloads that need low-latency access to data but don't require extreme I/O to shared datasets, and that have limited data sharing between clients. Use NFS storage. Choose from the following options:
Workloads that generate complex, interdependent, and large-scale I/O, such as tightly coupled HPC applications that use the Message-Passing Interface (MPI) for reliable inter-process communication. Use a parallel file system. Choose from the following options:
For more information about the workload requirements that parallel file systems can support, see When to use parallel file systems.

When to use parallel file systems

In a parallel file system, several clients store and access shared data across multiple networked storage nodes by using parallel I/O paths. Parallel file systems are ideal for tightly coupled HPC workloads such as data-intensive artificial intelligence (AI) workloads and analytics workloads that use SAS applications. Consider using a parallel file system like Lustre for latency-sensitive HPC workloads that have any of the following requirements:

  • Tightly coupled data processing: HPC workloads like weather modeling and seismic exploration need to process data repetitively by using many interdependent jobs that run simultaneously on multiple servers. These processes typically use MPI to exchange data at regular intervals, and they use checkpointing to recover quickly from failures. Parallel file systems enable interdependent clients to store and access large volumes of shared data concurrently over a low-latency network.
  • Support for POSIX I/O API and for semantics: Parallel file systems like Lustre are ideal for workloads that need both the POSIX API and semantics. A file system's API and its semantics are independent capabilities. For example, NFS supports the POSIX API, which is how applications read and write data by using functions like open(), read(), and write(). But the way NFS coordinates data access between different clients is not the same as POSIX semantics for coordinating data access between different threads on a machine. For example, NFS doesn't support POSIX read-after-write cache consistency between clients; it relies on weak consistency in NFSv3 and close-to-open consistency in NFSv4.
  • Petabytes of capacity: Parallel file systems can be scaled to multiple petabytes of capacity in a single file system namespace. NetApp Cloud Volumes Service and Filestore High Scale support up to 100 TiB per dataset. Cloud Storage offers low-cost and reliable capacity that scales automatically, but might not meet the data-sharing semantics and low-latency requirements of HPC workloads.
  • Low latency and high bandwidth: For HPC workloads that need high-speed access to very large files or to millions of small files, parallel file systems can outperform NFS and object storage. The latency offered by parallel file systems (0.5 ms to 10 ms) is significantly lower than object storage, which can affect the maximum IOPS. In addition, the maximum bandwidth that's supported by parallel file systems can be orders of magnitude higher than in NFS-based systems. For example, DDN EXAScaler on Google Cloud has demonstrated 10+ Tbps read bandwidth, greater than 700 GBps write bandwidth, and 1.9 million file stat() calls per second using the IO500 benchmark.
  • Extreme client scaling: While NFS storage can support thousands of clients, parallel file systems can scale to support concurrent access to shared data from over 10,000 clients.

Examples of tightly coupled HPC applications

This section describes examples of tightly coupled HPC applications that need the low-latency and high-throughput storage provided by parallel file systems.

AI-enabled molecular modeling

Pharmaceutical research is an expensive and data-intensive process. Modern drug research organizations rely on AI to reduce the cost of research and development, to scale operations efficiently, and to accelerate scientific research. For example, researchers use AI-enabled applications to simulate the interactions between the molecules in a drug and to predict the effect of changes to the compounds in the drug. These applications run on powerful, parallelized GPU processors that ingest, organize, and analyze an extreme amount of data to complete simulations quickly. Parallel file systems provide the storage IOPS and throughput that's necessary to maximize the performance of AI applications.

Credit risk analysis using SAS applications

Financial services institutions like mortgage lenders and investment banks need to constantly analyze and monitor the credit-worthiness of their clients and of their investment portfolios. For example, large mortgage lenders collect risk-related data about thousands of potential clients every day. Teams of credit analysts use analytics applications to collaboratively review different parts of the data for each client, such as income, credit history, and spending patterns. The insights from this analysis help the credit analysts make accurate and timely lending recommendations.

To accelerate and scale analytics for large datasets, financial services institutions use Grid computing platforms such as SAS Grid Manager. Parallel file systems like DDN EXAScaler on Google Cloud support the high-throughput and low-latency storage requirements of multi-threaded SAS applications.

Weather forecasting

To predict weather patterns in a given geographic region, meteorologists divide the region into several cells, and deploy monitoring devices such as ground radars and weather balloons in every cell. These devices observe and measure atmospheric conditions at regular intervals. The devices stream data continuously to a weather-prediction application running in an HPC cluster.

The weather-prediction application processes the streamed data by using mathematical models that are based on known physical relationships between the measured weather parameters. A separate job processes the data from each cell in the region. As the application receives new measurements, every job iterates through the latest data for its assigned cell, and exchanges output with the jobs for the other cells in the region. To predict weather patterns reliably, the application needs to store and share terabytes of data that thousands of jobs running in parallel generate and access.

CFD for aircraft design

Computational fluid dynamics (CFD) involves the use of mathematical models, physical laws, and computational logic to simulate the behavior of a gas or liquid around a moving object. When aircraft engineers design the body of an airplane, one of the factors that they consider is aerodynamics. CFD enables designers to quickly simulate the effect of design changes on aerodynamics before investing time and money in building expensive prototypes. After analyzing the results of each simulation run, the designers optimize attributes such as the volume and shape of individual components of the airplane's body, and re-simulate the aerodynamics. CFD enables aircraft designers to collaboratively simulate the effect of hundreds of such design changes quickly.

To complete design simulations efficiently, CFD applications need submillisecond access to shared data and the ability to store large volumes of data at speeds of up to 100 GBps.

Overview of Lustre and EXAScaler Cloud

Lustre is an open source parallel file system that provides high-throughput and low-latency storage for tightly coupled HPC workloads. In addition to standard POSIX mount points in Linux, Lustre supports data and I/O libraries such as NetCDF, HDF5, and MPI-IO, enabling parallel I/O for a wide range of application domains. Lustre powers many of the largest HPC deployments globally. A Lustre file system has a scalable architecture that contains the following components:

  • A management server (MGS) stores and manages configuration information about one or more Lustre file systems, and provides this information to the other components.
  • Metadata servers (MDS) manage client access to a Lustre file system’s namespace, using metadata (for example, directory hierarchy, filenames, and access permissions).
  • Object storage servers (OSS) manage client access to the files stored in a Lustre file system.
  • Lustre client software allows clients to mount the Lustre file system.

Multiple instances of MDS and OSS can exist in a file system. You can add new MDS and OSS instances when required. For more information about the Lustre file system and how it works, see the Lustre documentation.

EXAScaler Cloud is an enterprise version of Lustre that's offered by DDN, a Google partner. EXAScaler Cloud is a shared-file solution for high-performance data processing and for managing the large volumes of data required to support AI, HPC, and analytics workloads. EXAScaler Cloud is ideal for deep-learning and inference AI workloads in Google Cloud. You can deploy it in a hybrid-cloud architecture to augment your on-premises HPC capacity. EXAScaler Cloud can also serve as a repository for storing longer-term assets from an on-premises EXAScaler deployment.

What's next