Achieving cloud-native network automation at a global scale with Nephio
Ankur Jain
Vice President, Engineering, Google Distributed Cloud
Randolph Chung
Senior Staff Software Engineer, Telecom, Google Cloud
In 2007, in order to meet ever increasing traffic demands of YouTube, Google started building what is now the Google Global Cache program. Over the past 15 years, we have added thousands of edge caching locations around the world, with widely varying hosting conditions—some in customer data centers, some in remote locations with limited connectivity. Google manages the software and hardware lifecycles of all these systems remotely.
Although the fleet size and serving corpus have grown by several orders of magnitude during this time, the operations team overseeing it has remained relatively small and agile. How did we do it?
We started with a set of automation tools for software deployment (remotely executing commands), a set of tools for auditing/repairs (if this condition occurs, run that command), and a third set of tools for configuration management. As the fleet grew and was deployed in more varied environments, we discovered and fixed more edge cases in our automation tools. Soon, the system started reaching its scaling limits, and we built a new, more uniform and more scalable system in its place. We learned a few key lessons in the process:
- Intent-driven, continuously reconciling systems are more robust at scale than imperative, fire-and-forget tools.
- Distributed actuation of intent is a must for large-scale edge deployments. Triggering all actions from a centralized location is not reliable and does not scale, especially for edge deployments.
- Uniformity in systems is easier to maintain. Being able to manage deployment, repairs, and configuration using common components and common workflows (in other words, files checked into a repository with presubmit validation, review, version control, and rollback capability) reduces cognitive load for the operations team and allows more rapid response with fewer human errors
This pattern repeats time and time again across many large distributed systems at Google, and we believe these tenets are key as network function vendors and communication service providers look to adopt cloud-based network technologies. For example, in a 5G deployment involving hundreds of locations (or many hundreds of thousands, in the case of RAN), with containerized software components, the industry needs better tools to handle deployment and operations at scale. Working with the community to address these issues, we hope to drive a common Kubernetes-based, cloud-native network automation architecture, while also providing extension points for vendors to innovate and adapt to their specific requirements.
That’s why Google Cloud founded the Nephio project in April 2022. The Nephio community launched with 24 founding organizations and has now grown 2X since launching. In addition to the founding members, new participating organizations include Vodafone, Verizon, Telefonica, Deutsche Telekom, KT, HPE, Red Hat, Windriver, Tech Mahindra, and others. Over 150 developers across the globe participated in the community kickoff meeting hosted by the Linux Foundation on May 17, 2022.
Google Cloud is collaborating with communication service providers, network function vendors, and cloud providers in Nephio by:
- Working with the community to refine the cloud native automation architecture, and define a common data model based on the Kubernetes Resource Model (KRM) and Configuration as Data (CaD) approach. This new model needs to support cloud infrastructure, network function deployment, and management of user journeys.
- Contributing to the development of an open, fully functional reference implementation of this architecture.
- Open sourcing several key building blocks, such as kpt, Porch and ConfigSync. We are also planning to open source controllers, Google Cloud infrastructure CRDs, additional sample NF CRDs, and operators to jumpstart the Nephio project.
Google Cloud will also integrate Nephio with our Google Distributed Cloud Edge platform, combining the advantages of a fully managed hardware platform with Nephio-powered deployment and management of network functions to our customers.
The Nephio community is complementary to many of the existing open source communities and standards. Nephio is closely working with adjacent communities in CNCF, LF Networking, and LF Edge to provide an end-to-end automation framework for telecommunication networks.
By working with the community in this open manner, we believe that, together, we can advance the state of the art of network automation, improving the deployment and management of network functions on cloud native infrastructure.
We welcome the industry to join us in this effort. For more information, please visit the Nephio website at www.nephio.org. And please register to join us online or in-person at the Nephio developer summit on June 22 and 23.