Architecture: Lustre file system in Google Cloud using DDN EXAScaler

Last reviewed 2023-11-15 UTC

This document provides architectural guidance to help you design and size a Lustre file system for high performance computing (HPC) workloads. It also provides an overview of the process to deploy a Lustre file system in Google Cloud by using DDN EXAScaler.

Lustre is an open source, parallel file system that provides high-throughput and low-latency storage for tightly coupled HPC workloads. You can scale a Lustre file system to support tens of thousands of HPC clients and petabytes of storage. EXAScaler Cloud is an enterprise version of Lustre that's offered by DDN, a Google partner. You can deploy EXAScaler Cloud in a hybrid-cloud architecture to augment your on-premises HPC capacity. EXAScaler Cloud can also serve as a repository for storing longer-term assets from an on-premises EXAScaler deployment.

The guidance in this document is intended for enterprise architects and technical practitioners who design, provision, and manage storage for HPC workloads in the cloud. The document assumes that you have a conceptual understanding of parallel file systems. You should also have an understanding of the HPC use cases for which parallel file systems like Lustre are ideal. For more information, see Parallel file systems for HPC workloads.

Overview of the Lustre file system

The following diagram shows the architecture of a Lustre file system:

Architecture of a Lustre file system

As shown in the diagram, the architecture contains the following components:

  • Management server (MGS) and management targets (MGT): The MGS stores and manages configuration information about one or more Lustre file systems. The diagram shows the MGS managing a single Lustre file system. The MGS provides configuration information to the other Lustre components in all the file systems that it manages. The MGS records file-system configuration logs in storage devices that are called MGTs.

  • Metadata servers (MDS) and metadata targets (MDT): The MDS nodes manage client access to the namespace of a Lustre file system. This namespace includes all the metadata of the file system, such as the directory hierarchy, file creation time, and access permissions. The metadata is stored in storage devices that are called MDTs. A Lustre file system has at least one MDS and one associated MDT. To improve the performance for metadata‑intensive workloads, such as when thousands of clients create and access millions of small files, you can add more MDS nodes to the file system.

  • Object storage servers (OSS) and object storage targets (OST): The OSS nodes manage client access to the file data that's stored in a Lustre file system. Each file is stored as one or more Lustre objects. The objects are stored either in a single storage device (called an OST) or striped across multiple OSS nodes and OSTs. A Lustre file system has at least one OSS and one associated OST. You can scale the storage capacity and the performance of the file system by adding more OSS nodes and OSTs. The total storage capacity of the file system is the sum of the storage capacities of the OSTs that are attached to all the OSS nodes in the file system.

  • Clients: A Lustre client is a compute node, such as a virtual machine (VM), that accesses a Lustre file system through a mount point. The mount point provides a unified namespace for the entire file system. You can scale a Lustre file system to support concurrent access by over 10,000 clients. Lustre clients access all the MDS and OSS nodes in a Lustre file system in parallel. This parallel access helps to maximize the performance of the file system. The parallel access also helps reduce storage hotspots, which are storage locations that are accessed much more frequently than other locations. Hotspots are common in non-parallel file systems, and can cause a performance imbalance between clients.

    To access a Lustre file system, a client gets the required directory and file metadata from an MDS, and then reads or writes data by communicating with one or more OSS nodes. Lustre provides close compliance with POSIX semantics, and it allows all clients full and parallel access to the file system.

For more information about the Lustre file system and how it works, see the Lustre documentation.

Architecture of Lustre in Google Cloud

The following diagram shows an architecture for deploying a Lustre file system in Google Cloud:

Architecture of Lustre file system in Google Cloud

The architecture shown in this diagram contains the following resources. All the resources are deployed in a single Google Cloud project. The compute and storage resources are provisioned within a single zone.

  • Compute Engine VMs host the MGS, MDS, and OSS nodes and the Lustre clients. You can also choose to deploy the Lustre clients in a Google Kubernetes Engine cluster, and deploy the file system on Compute Engine VMs.

  • The architecture includes the following networking resources:

    • A single Virtual Private Cloud (VPC) subnet that's used for all the VMs.
    • An optional Cloud NAT gateway and an optional Cloud Router for outbound traffic from the private VMs to the internet.
    • Optional firewall rules to allow SSH ingress connections to all the VMs in the topology.
    • An optional firewall rule to allow HTTP access from the internet to the DDN EXAScaler web console on the MGS.
    • A firewall to allow TCP connections between all the VMs.
  • Persistent Disks provide storage capacity for the MGS, MDS, and OSS nodes. If you don't need persistent storage, you can build a scratch file system by using local solid-state drive (SSD) disks, which are attached to the VMs.

    Google has submitted IO500 entries demonstrating the performance of both persistent and scratch Lustre file systems. Read about the Google Cloud submission that demonstrates a 10+ Tbps, Lustre-based scratch file system on the IO500 ranking of HPC storage systems.

Design guidelines

Use the following guidelines to design a Lustre file system that meets the requirements of your HPC workloads. The guidelines in this section are not exhaustive. They provide a framework to help you assess the storage requirements of your HPC workloads, evaluate the available storage options, and size your Lustre file system.

Workload requirements

Identify the storage requirements of your HPC workloads. Define the requirements as granularly as possible, and consider your future requirements. Use the following questions as a starting point to identify the requirements of your workload:

  • What are your requirements for throughput and I/O operations per second (IOPS)?
  • How much storage capacity do you need?
  • What is your most important design goal: throughput, IOPS, or storage capacity?
  • Do you need persistent storage, scratch storage, or both?

Storage options

For the most cost-effective storage options, you can choose from the following types of Persistent Disks:

  • Standard Persistent Disks (pd-standard) are the most cost-effective option when storage capacity is the main design goal. They provide efficient and reliable block storage by using hard-disk drives (HDD).
  • SSD Persistent Disks (pd-ssd) are the most cost-effective option when the goal is to maximize IOPS. They provide fast and reliable block storage by using SSDs.
  • Balanced Persistent Disks (pd-balanced) are the most cost-effective option for maximizing throughput. They provide cost-effective and reliable block storage by using SSDs.

Extreme Persistent Disks (pd-extreme) can provide higher performance than the other disk types, and you can choose the required IOPS. But pd-extreme costs more than the other disk types.

For more information about the performance capabilities of Persistent Disks, see Block storage performance.

For scratch storage, you can use ephemeral local SSDs. Local SSDs are physically attached to the server that hosts the VMs. So local SSDs provide higher throughput and lower latency than Persistent Disks. But the data that's stored on a local SSD persists only until the VM is stopped or deleted.

Object storage servers

When you design the infrastructure for the OSS nodes, we recommend the following:

  • For storage, choose an appropriate Persistent Disk type based on your requirements for storage capacity, throughput, and IOPS.

    • Use pd-standard for workloads that have the following requirements:
      • The workload needs high storage capacity (for example, more than 10 TB), or it needs both high read throughput and high storage capacity.
      • I/O latency is not important.
      • Low write throughput is acceptable.
    • Use pd-balanced for workloads that have any of the following requirements:
      • High throughput at low capacity.
      • The low latency that's provided by SSD-based disks.
    • Use pd-ssd for workloads that require high IOPS (either small I/O requests or small files).
  • Provision enough storage capacity to achieve the required IOPS. Consider the read and write IOPS provided by each disk type.

  • For the VMs, use a machine type from the N2 or N2D machine family. These machine types provide predictable and cost-efficient performance.

  • Allocate enough vCPUs to achieve the required Persistent Disk throughput. The maximum Persistent Disk throughput per VM is 1.2 GBps, and this throughput can be achieved with 16 vCPUs. So start with a machine type that has 16 vCPUs. Monitor the performance, and allocate more vCPUs when you need to scale the IOPS.

Metadata servers

The MDS nodes don't need high storage capacity for serving metadata, but they need storage that supports high IOPS. When designing the infrastructure for the MDS nodes, we recommend the following:

  • For storage, use pd-ssd because this type of Persistent Disk provides high IOPS (30 IOPS per GB) even at low storage capacity.
  • Provision enough storage capacity to achieve the required IOPS.
  • For the VMs, use a machine type from the N2 or N2D machine family. These machine types provide predictable and cost-efficient performance.
  • Allocate enough vCPUs to achieve the required IOPs:
    • For low-metadata workloads, use 16 vCPUs.
    • For medium-metadata workloads, use 32 vCPUs.
    • For metadata-intensive workloads, use 64 vCPUs. Monitor the performance, and allocate more vCPUs when necessary.

Management server

The MGS needs minimal compute resources. It is not a storage-intensive service. Start with a small machine type for the VM (for example, n2-standard-2) and a 128-GB pd-ssd disk for storage, and monitor the performance. If the response is slow, allocate more vCPUs and increase the disk size.

Availability and durability

If you need persistent storage, the pd-standard and pd-balanced Persistent Disk types provide highly available and durable storage within a zone. For cross-zone or cross-region persistence, you can copy the data to low-cost Cloud Storage by using the Google Cloud CLI or Storage Transfer Service. To reduce the cost of storing data in Cloud Storage, you can store infrequently accessed data in a bucket of the Nearline storage or Coldline storage class.

If you need only ephemeral storage for a scratch deployment, use local SSDs as the data disks for the OSS and MDS nodes. This design delivers high performance with the fewest number of OSS VMs. This design also helps you achieve an optimal cost-to-performance ratio when compared with the other options.

Sizing example for OSS VMs

The recommended strategy for sizing and provisioning your Lustre file system is to provision enough OSS VMs to meet the overall throughput requirement. Then, increase the storage capacity of the OST disks until you reach the required storage capacity. The example workload that is used in this section shows you how to implement this strategy by using the following steps:

  1. Determine the workload requirements.
  2. Choose a Persistent Disk type.
  3. Calculate the number of OSS VMs.
  4. Calculate the disk size per VM.
  5. Determine the number of vCPUs per VM.

Determine the workload requirements

In this example, the workload requires 80 TB of persistent storage capacity with a read throughput of 30 GBps.

Choose a Persistent Disk type

As discussed in the Storage options section, pd-standard is the most cost-effective option when storage capacity is the main design goal, and pd-balanced is the most cost-effective option for maximizing throughput. The maximum throughput is different for each disk type, and the throughput scales with the disk size.

For each Persistent Disk type that can be used for this workload, calculate the storage capacity that's necessary to scale the read throughput to the target of 30 GBps.

Disk type Read throughput per TB Storage capacity required to achieve the target throughput
pd-standard 0.12 GBps 30 divided by 0.12 = 250 TB
pd-balanced 0.28 GBps 30 divided by 0.28 = 107 TB

To achieve the target read throughput of 30 Gbps by using pd-standard, you would need to provision 250 TB of storage capacity. This amount is over three times the required capacity of 80 TB. So for the workload in this example, pd-balanced provides cost-efficient storage that meets the performance requirements.

Calculate the number of OSS VMs

The maximum read throughput per Compute Engine VM is 1.2 GBps. To achieve the target read throughput of 30 GBps, divide the target read throughput by the maximum throughput per VM as follows:

   30 GBps / 1.2 GBps = 25

You need 25 OSS VMs to achieve the target read throughput.

Calculate the disk size per VM

To calculate the disk size per VM, divide the capacity (107 TB) that's necessary to achieve the target throughput (30 GBps) by the number of VMs as follows:

   107 TB / 25 VMs = 4.3

You need 4.3 TB of pd-balanced capacity per VM.

Determine the number of vCPUs per VM

The read throughput of a VM scales with the number of vCPUs allocated to the VM. The read throughput peaks at 1.2 Gbps for 16 (or more) vCPUs. You need a machine type that provides at least 16 vCPUs. So choose a machine type from the N2 or N2D machine family, such as n2-standard-32.

Configuration summary

The example workload has the following requirements:

  • 80-TB persistent storage capacity
  • 30-Gbps read throughput

To meet the requirements of this workload, you need the following compute and storage resources:

  • Number of OSS VMs: 25
  • VM machine family: N2 or N2D
  • Number of vCPUs per VM: 16 or more
  • Persistent Disk type: pd-balanced
  • Disk size per VM: 4.3 TB

Configuration example for a scratch file system

The following configuration is based on a Google Cloud submission to IO500 that demonstrates the performance of an extreme-scale scratch file system that uses Lustre and is deployed on Google Cloud:

Configuration parameter MDS OSS Clients
Number of VMs 50 200 1,000
Machine type n2-standard-80 n2-standard-64 c2-standard-16
Number of vCPUs per VM 80 vCPUs 64 vCPUs 16 vCPUs
RAM per VM 320 GB 256 GB 64 GB
OS CentOS 8 CentOS 8 CentOS 8
Network bandwidth 100 Gbps 75 Gbps 32 Gbps
Local SSD storage 9-TB NVMe
(24 disks)
9-TB NVMe
(24 disks)
None

The preceding configuration provided the following performance:

Performance metric Result
Write throughput 700 GBps (5.6 Tbps)
Read throughput 1,270 GBps (10.16 Tbps)
File stat() operations 1.9 million per second
Small file reads (3,901 bytes) 1.5 million per second

Deployment options

This section provides an overview of the methods that you can use to deploy an EXAScaler Cloud file system in Google Cloud. This section also outlines the steps to follow when you deploy the client VMs.

EXAScaler Cloud file system deployment

You can choose from the following methods to deploy an EXAScaler Cloud file system in Google Cloud:

With either method, you can customize the file system when you deploy it. For example, you can specify the number of OSS VMs, the machine type for the VMs, the Persistent Disk types, and the storage capacity.

When choosing a method, consider the following differences:

  • Post-deployment modification: If you use the Cloud HPC Toolkit, you can efficiently modify the file system after deployment. For example, to add storage capacity, you can increase the number of OSS nodes by updating the Cloud HPC Toolkit blueprint and applying the generated Terraform configuration again. For a list of the parameters that you can specify in the blueprint, see the Inputs section in the README for the Terraform module. To modify a file system that's deployed by using the Cloud Marketplace solution, you must make the changes for each compute and storage resource individually by using the Google Cloud console, gcloud CLI, or API.

  • Support: The Cloud Marketplace solution uses Deployment Manager, which is not supported by VPC Service Controls. For more information about this limitation, see VPC Service Controls supported products and limitations.

Client deployment

You can use either of the methods described in the EXAScaler Cloud file system deployment section to deploy the client VMs. However, Google recommends that you provision and manage your client VMs separately from the file system. The recommended way to deploy your clients is by using the Google-provided HPC VM image, which is optimized for HPC workloads.

The following is an overview of the process of using the HPC VM image to deploy Lustre clients:

  1. Create a VM by using the HPC VM image.
  2. Install the Lustre client packages on the VM.
  3. Customize the VM as necessary.
  4. Create a custom image from the VM.
  5. Provision the Lustre client VMs by using the custom image that you created. To automate the provisioning and management of the client VMs, you can use a Compute Engine managed instance group or a third-party tool like Slurm Workload Manager.
  6. Mount the Lustre file system on the client VMs.

Data transfer options

After you deploy a Lustre file system in Google Cloud, you need to move your data to the file system. The following table shows the methods that you can use to move data to your Lustre file system. Choose the method that corresponds to the volume of data that you need to move and the location of your source data (on-premises or in Cloud Storage).

Data source Data size Recommended data-transfer method
On‑premises Small (for example, less than 1 TB) Stage the data in Cloud Storage by using the gsutil tool. Then, download the data to the Lustre file system by using Storage Transfer Service or the gcloud CLI.
On-premises Large Move the data to the Lustre file system by using Storage Transfer Service. Follow the instructions in Transfer data between POSIX file systems.

This method involves the use of an intermediary Cloud Storage bucket. After completing the transfer job, Storage Transfer Service deletes the data in the intermediary bucket.

Cloud Storage Large or small Download the data from Cloud Storage to the Lustre file system by using Storage Transfer Service or the gcloud CLI.

What's next

Contributors

Author: Kumar Dhanagopal | Cross-Product Solution Developer

Other contributors: