The need for speed: Using C2 machines for your HPC workloads
Cloud opens many new possibilities for High Performance Computing (HPC). But while the cloud offers the latest technologies and a wide variety of machine types (VMs), not every VM is suited to the demands of HPC workloads.
Google Cloud’s Compute-optimized (C2) machines are specifically designed to meet the needs of the most compute-intensive workloads, such as HPC applications in fields like scientific computing, Computer-aided Engineering (CAE), biosciences, and Electronic Design Automation (EDA), among many others.
The C2 is based on the second generation Intel® Xeon® Scalable Processor and provides up to 60 virtual cores (vCPUs) and 240GB of system memory. C2s can run at a sustained frequency of 3.8GHz and offer more than 40% improvement compared to previous generation VMs for general applications. Compared to previous generation VMs, total memory bandwidth improves by 1.21X and memory bandwidth/vCPU improves by 1.94X.1 Here we take a deeper look at using C2 VMs for your HPC workloads on Google Cloud.
Tightly-coupled HPC workloads rely on resource isolation for predictable performance. C2 is built for isolation and consistent mapping of shared physical resources (e.g., CPU caches, and memory bandwidth). The result is reduced variability and more consistent performance. C2 also exposes and enables explicit user control of CPU power states (“C-States”) on larger VM sizes, enabling higher effective frequencies and performance.
In addition to hardware improvements, Google Cloud has enabled a number of HPC-specific optimizations on C2 instances. In many cases, tightly-coupled HPC applications require careful mapping of processes or threads to physical cores, along with care to ensure processes access memory that is closest to their physical cores. C2s provide explicit visibility and control of NUMA domains to the guest operating system (OS), enabling maximum performance.
Second generation Xeon processors support Intel Advanced Vector Extension 512 (Intel AVX-512) for data parallelism. AVX-512 instructions are SIMD (Single Instruction Multiple Data) instructions, and along with additional and wider registers enable packing of 64 single-precision (or 32 double-precision) floating point operations into one instruction. This means that more can be done in every clock cycle, reducing overall execution time. The latest generation of AVX-512 instructions in the 2nd generation Xeon processor include DL Boost instructions that significantly improve performance for AI inferencing by combining three INT8 instructions into one—thereby maximizing the use of compute resources, utilizing the cache better, and avoiding potential bandwidth bottlenecks.
HPC workloads often scale out to multiple nodes in order to accelerate time to completion. Google Cloud has enabled “Compact Placement Policy” on the C2, which allocates up to 1320 vCPUs placed in close physical proximity, minimizing cross-node latencies. Compact placements, in conjunction with Intel MPI library, optimizes multi-node scalability of HPC applications. You can learn more about best practices for ensuring low latency on multi-node workloads here.
Along with the hardware optimizations, Intel offers a comprehensive suite of development tools (including performance libraries, Intel Compilers, and performance monitoring and tuning tools) to make it simpler to build and modernize code with the latest techniques in vectorization, multithreading, multi-node parallelization, and memory optimization. Learn more about Intel’s Parallel Studio XE here.
Bringing it all together
Combining all the improvements in hardware and optimizations done in Google Cloud stack, C2 VMs perform up to 2.10X better compared to previous generation N1 for HPC workloads for roughly the same size VM.2
In many cases HPC applications can scale up to the full node. A single C2 node (60 vCPUs and 240GB) offers up to 2.49X better performance/price compared to a single N1 node (96 vCPUs and 360GB).3
C2s are offered in predefined shapes intended to deliver the most appropriate vCPU and memory configurations for typical HPC workloads. In some cases, it is possible to further optimize performance or performance/price via a custom VM shape. For example, if a certain workload is known to require less than the default 240GB of a C2 standard 60vCPU VM, a custom N2 machine with less memory can deliver roughly the same performance at a lower cost. We were able to achieve up to 1.09X better performance/price by tuning the VM shape to the needs of several common HPC workloads.4
Get started today
As more HPC workloads start to benefit from the agility and flexibility of cloud, Google Cloud and Intel are joining forces to create optimized solutions for specific needs of these workloads. With the latest optimizations in Intel 2nd generation Xeon processors and Google Cloud, C2 VMs deliver the best solution for running HPC applications in Google Cloud, while giving you the freedom to build and evolve around your unique business needs. Many of our customers with need for high performance have moved their workloads to C2 VMs and confirmed our expectations.
To learn more about C2 and the second generation of Intel Xeon Scalable Processor, contact your sales representative or reach out to us here. And if you’re participating in SC20 this week, be sure to check out our virtual booth, where you can watch sessions, access resources, and chat with our HPC experts.
1. Based on internal analysis of our c2-standard-60 and n1-standard-96 machine types, using the STREAM Triad Best Rate benchmark.
2. Based on internal analysis of our c2-standard-60 and n1-standard-96 machine types, using our Weather Research Forecasting (WRF) benchmark.
3. Based on the High Performance Conjugate Gradients (HPCG) benchmark, analyzing Google Cloud VM Instance pricing on C2-standard-60 ($3.1321/hour) and N1-standard-96 ($4.559976) as of 10/15/2020
4. Based on GROMACS and NAMD benchmarks, analyzing Google Cloud VM Instance pricing on N2-custom-80 with 160GB ($3.36528) and C2-standard-60 ($3.1321/hour) as of 10/15/2020