Jump to Content
Systems

How we’ll build sustainable, scalable, secure infrastructure for an AI-driven future

October 17, 2023
Parthasarathy Ranganathan

VP & Technical Fellow

Amin Vahdat

VP/GM, ML, Systems, and Cloud AI

Editor’s note: Today, we hear from Parthasarathy Ranganathan, Google VP and Technical Fellow and Amin Vahdat, VP/GM. Partha delivered a keynote address today at the OCP Global Summit, an annual conference for leaders, researchers, and pioneers in the open hardware industry. Partha served on the OCP Board of Directors from 2020 to earlier this year, when he was succeeded by Amber Huffman as Google’s representative. Read on to hear about the macro trends driving systems design today, and an overview of all of our activities in the community.


At Google, we build planet-scale computing for services that power billions of users, and these services have led to incredible opportunities for system designers to create hardware that operates with high performance, resilience, efficiency, and all at scale. In short, we have embraced open innovation for a new era of systems design.

Today, we are at a new fundamental inflection point in computing: the rise of AI. Google products have always had a strong AI component, but in the past year, we have seen a tectonic shift in the industry and have supercharged our core products with the power of generative AI.

These advances have shown up across our computing systems and workloads, from the original Transformer model in 2017, to PaLM in 2022, to Bard today. Large language models have grown from having hundreds of millions of parameters to trillions of parameters, growing by almost an order of magnitude every year. As model sizes increase, so does the computation needed to run these models. That, in essence, sets up the challenge and opportunity that the open innovation community needs to solve together.

AI isn’t just an enabler of new applications — it also represents a fundamental platform shift — something that we need to innovate on across hardware and software. Together, we need to build the hardware and software platforms that deliver powerful AI solutions across complex machine-learning supercomputers, all in a sustainable, secure, and scalable manner.

Towards sustainable systems

Sustainability is an imperative that we all share. Here are several efforts we are engaged in to help our industry towards achieving net-zero emissions:

  • Net Zero Innovation Hub: The industry answered our call from the OCP Regional Summit in April for a pan-European public and private collaboration to advance sustainability at a regional level. We launched the Net Zero Innovation Hub with co-founders Danfoss, Google, Microsoft, and Schneider Electric on September 28 with an ambitious agenda across all scopes, including waste-heat reuse and grid availability.
  • Greener concrete: In collaboration with iMasons Climate Accord, AWS, Google, Meta, and Microsoft, we delivered an ambitious technology roadmap to decarbonize concrete. We invite the community to partner with us to execute this roadmap together.
  • Sustainability metrics: Last year, we formed the OCP Data Center Facilities Sustainability Subproject, co-led by Google and Microsoft. The group is making important progress on establishing clear, consistent and standardized metrics for emissions/carbon, energy, water, and beyond. This work will enable an apples-to-apples data-driven approach to assess the best approaches to help achieve our shared goals.

Enhancing security across the systems stack

Security includes both trusted computing and reliable computing, and there are several exciting developments coming in this space, including:

  • Caliptra: Caliptra is a re-usable IP block for root-of-trust management. Last year, with industry leaders, AMD, Microsoft, and NVIDIA, we contributed the draft Caliptra specification to OCP. The Caliptra specification will be complete this year, with the IP block ready for integration into CPUs, GPUs, and other devices. Check out the code repository at https://github.com/chipsalliance/caliptra.
  • OCP S.A.F.E.: In partnership with OCP and Microsoft, we have developed the OCP Security Appraisal Framework and Enablement (S.A.F.E.) program. OCP S.A.F.E. provides a standardized approach for provenance, code quality, and software supply chain for firmware releases. Learn more at https://www.opencompute.org/projects/ocp-safe-program.
  • Reliable Computing: Last year, we formed a server-component resilience workstream at OCP along with AMD, ARM, Intel, Meta, Microsoft, and NVIDIA to take a systems approach to addressing silicon faults and silent data errors. The team has made great strides, including publishing the draft specification and open-sourcing Silent Data Corruption (SDC) frameworks (e.g., Intel and ARM collaborating on Open Datacenter Diagnostics, AMD’s Open Field Health Check, and NVIDIA’s Datacenter GPU Manager). To advance this important area faster, we are launching a new academic grant program — the first of its kind at OCP — with member companies supporting significant academic research in this area.

Scalability from silicon to the cloud

Scalable infrastructure is a primary area of focus for both Google and OCP, from silicon all the way to the cloud. At the OCP Summit this week, we will discuss a few advancements, specifically:

  • Accelerators: This year, we partnered with AMD, ARM, Intel, Meta, and NVIDIA to deliver the OCP 8-bit Floating Point specification to enable training on one accelerator and serving on another. We partnered with Microsoft and NVIDIA to deliver a set of firmware specifications for GPUs and accelerators covering reliability, manageability, and updates.
  • AI: During the AI Track, we are highlighting the progress we are making with partners in the OpenXLA ecosystem. We are also discussing the Architecture Gym, a new effort in collaboration with MLCommons to go beyond systems for AI, to AI for systems, looking at how AI can transform systems design.
  • Networking: To truly build large-scale AI infrastructure, you need world-class networking systems innovation. To help with this, we are opening Falcon, Google’s reliable low-latency hardware transport, and sharing some of the advances we have made over the past 10 years on performance, latency, traffic control, etc. This is part of our ongoing effort to advance Ethernet to the industry as a high-performance, low-latency fabric for hyperscaler environments. Learn more in the blog “Google opens Falcon, a reliable low-latency hardware transport, to the ecosystem”.
  • Storage: Google is joining the OCP Data Center NVM Express™ (NVMe) specification, working group with Meta, Microsoft, Dell, and HPE to provide clear requirements for features in datacenter SSDs including Flexible Data Placement, security, and telemetry. We are also kicking off a new open-source hardware effort to develop an NVMe Key Management block with partners Microsoft, Samsung, Kioxia and Solidigm.

There is tremendous opportunity for all of us in the industry to create even more open ecosystems for innovation. At Google, we have a legacy of embracing and fostering open ecosystems, whether it’s Android, Chromium, Kubernetes, Kaggle, Tensorflow, or Jax. We set industry standards, grow communities, and share our innovations broadly. Our contributions to the Open Compute Project Foundation go back several years, from our first 48V contribution to today, sitting on the OCP Board and being one of its largest contributors. We believe the best is yet to come, through codesign and collaboration across hardware and software, multiple layers of the stack, compute, network, storage, infrastructure, industry and academia, and of course, across companies.

It is exciting to be in an era where we are literally inventing the future with new AI advances every day. All these amazing AI advances in turn need a healthy innovation ecosystem around infrastructure, from all of us — to build the sustainable, secure, scalable societal infrastructure that we need for this AI-driven future. And all of this will be possible only through collaboration across all of us in the community. You can learn more about the OCP Global Summit agenda here and talks by Google here. We are looking forward to the vibrant discussions this week.

Posted in