Systems

Agile AI architectures: A fungible data center for the intelligent era

October 14, 2025

Parthasarathy Ranganathan

VP, Engineering Fellow

Amin Vahdat

VP/GM, AI & Infrastructure, Google Cloud

Try Gemini 3

Our most intelligent model is now available on Vertex AI and Gemini Enterprise

Try now

It’s not hyperbole to say that AI is transforming all aspects of our lives: human health, software engineering, education, productivity, creativity, entertainment… Consider just a few of the developments from Google this past year: Magic Cue on the Pixel 10 for more personal, proactive, and contextually-relevant assistance; our viral Nano Banana Gemini 2.5 Flash image generation; Code Assist for developer productivity; and AlphaFold, which won its creators the Nobel prize for chemistry. We like to joke that the past year in AI has been an amazing decade!

Underpinning all these advances in AI are equally amazing advances in the computing infrastructure powering AI. If AI researchers are like space explorers discovering new worlds, then systems and infrastructure designers are the ones building the rockets. But keeping up with the demands of AI services will require even more from us. At Google I/O earlier this year, we announced nearly 50X annual growth in the monthly tokens processed by Gemini models, hitting 480 trillion tokens per month. Since then we have seen an additional 2X growth, hitting nearly a quadrillion monthly tokens. Other statistics paint a similar picture: AI accelerator consumption has grown by 15X in the last 24 months; our Hyperdisk ML data has grown 37X since GA; and we’re seeing more than 5 billion AI-powered retail search queries per month.

With great AI comes great computing

This kind of growth brings with it new challenges. When planning for data centers and systems, we are accustomed to long lead times, paralleling the long time to build out hardware. However, AI demand projections are now changing dynamically and dramatically, creating a significant divergence in supply and demand. This mismatch requires new architectures and system design approaches that can respond to extreme volatility and growth.

Rapid technology innovations are essential, but must be carefully managed across the stack. For example, each generation of AI hardware (like TPUs and GPUs) has introduced new features, functionality, but also power, rack, networking and cooling requirements. The rate of introduction of these new generations is also on the rise, making it hard to build a coherent end-to-end system that can accommodate such a vast rate of change. Further, changes in form factors, board densities, networking topologies, power architectures, liquid cooling solutions, etc., all incrementally compound heterogeneity, so that when taken together, there is a combinatorial increase in the complexity of designing, deploying, and maintaining systems and data centers. In addition, we need to design for a spectrum of data center facilities — beyond traditional hyperscalar- or cloud-optimized offerings to “neoclouds” and industry-standard colocation providers – across multiple geographical regions. This adds yet another layer of diversity and dynamism, further constraining data center design for the new AI era.

We can address these two challenges — dealing with dynamic growth and compounding heterogeneity — if we design data centers with fungibility and agility as first-class considerations. Architectures need to be modular, where components can be designed and deployed independently. They should be interoperable across different vendors or generations. Equally important, they should support the ability to late-bind the facility and systems to handle dynamically changing requirements (for example, reuse infrastructure designed for one generation to the next ). Data centers should also be built on agreed-upon standard interfaces, so data center investments can be reused across multiple customer segments. And finally, these principles need to be applied holistically across all components of the data center – power delivery, cooling, server hall design, compute, storage, and networking.

With great computing comes great power (and cooling and systems)

To achieve agility and fungibility in power, we must standardize power delivery and management to build a resilient end-to-end power ecosystem, including common interfaces at the rack power level. Partnering with other members of the Open Compute Project (OCP), we introduced new technologies around +/-400Vdc designs and an approach for transitioning from monolithic to disaggregated solutions using side-car power, a.k.a. Mt. Diablo. Promising new technologies, like low-voltage DC power combined with solid state transformers, will enable these systems to transition to future fully integrated data center solutions.

We are also evaluating solutions for data centers to become suppliers to the grid, not just consumers from it, with corresponding standardization around battery-operated storage and microgrids. We already used such solutions to manage the challenges around the “spikiness” of AI training workloads and are also applying them for additional savings around power efficiency and grid power usage.

Data center cooling, meanwhile, is also being reimagined for the AI era. Earlier this year, we announced Project Deschutes, a state-of-the-art liquid cooling solution that we contributed to the Open Compute community, and have since published the specification and design collateral. The community is responding enthusiastically, with liquid cooling suppliers like Boyd, CoolerMaster, Delta, Envicool, Nidec, nVent, and Vertiv showcasing demos at major events this year, including the OCP Global Summit and SuperComputing 2025. But we have more opportunities to collaborate on: industry-standard cooling interfaces, new components like rear-door-heat exchangers, reliability, etc. One particularly important area is standardizing layouts and fit-out scopes across colos and third-party data centers, so we as an industry can enable more fungibility.

Finally, we need to bring together compute, networking, and storage in the server hall, including physical attributes of the data center design such as rack height, width, and depth (and more recently, weight); aisle widths and layouts; as well as rack and network interfaces. We also need standards for telemetry and mechatronics to build and maintain these future data centers. With our fellow OCP partners, we are standardizing telemetry integration for third-party data centers, including establishing best practices, developing common naming and implementations, and creating standard security protocols.

Beyond physical infrastructure, we are collaborating with our partners to deliver open standards for more scalable and secure systems. A few highlights include:

Resilience: We’ve expanded our multi-year effort on manageability, reliability and serviceability from GPUs to include CPU firmware updates and debuggability.
Security: Caliptra 2.0, the open-source hardware root of trust, now defends against future threats with post-quantum cryptography, while OCP S.A.F.E. makes security audits routine and cost-effective.
Storage: OCP L.O.C.K. builds on Caliptra’s foundation to provide a robust, open-source key management solution for any storage device.
Networking: Congestion Signaling (CSIG) has been standardized and is delivering measured improvements in load balancing. Alongside continued advancements in SONiC, a new effort is underway to standardize Optical Circuit Switching.

Sustainability is embedded in our work. To provide insight into the environmental impact of AI, we developed a new methodology for measuring the energy, emissions, and water impact of emerging AI workloads, demonstrating that the median Gemini Apps text prompt consumes less than five drops of water and has the energy impact of watching TV for under nine seconds. We apply this type of data-driven approach to other collaborations across the OCP community: on an embodied carbon disclosure specification, green concrete, clean backup power, and reduced manufacturing emissions.

A call to action: community-driven innovation and AI-for-AI

Google has a long history of collaboration with open ecosystems that have demonstrated the compounding power of community collaborations, and we have the opportunity to repeat as we design agile and fungible data centers for the AI era. Join us in the new OCP Open Data Center for AI Strategic Initiative on common standards and optimizations for agile and fungible data centers.

As we look ahead to the next waves of growth in AI, and the amazing advances they will unlock, we will need to leverage these AI advances in our own work, to amplify our productivity and innovation. An early example is Deepmind AlphaChip, which uses AI to accelerate and optimize chip design. We are seeing more promising uses of AI for systems: across hardware, firmware, software, and testing; for performance, agility, reliability, and sustainability; and across design, deployment, maintenance, and security. These AI-enhanced optimizations and workflows are what will bring the next order-of-magnitude improvements to the data center. We look forward to the innovations ahead, and to your continued collaboration in driving them forward.

Posted in