Cloud Blog: Speed, scale and reliability: 25 years of Google data-center networking evolution

Source URL: https://cloud.google.com/blog/products/networking/speed-scale-reliability-25-years-of-data-center-networking/
Source: Cloud Blog
Title: Speed, scale and reliability: 25 years of Google data-center networking evolution

Feedly Summary: Rome wasn’t built in a day, and neither was Google’s network. But 25 years in, we’ve built out network infrastructure with scale and technical sophistication that’s nothing short of remarkable.
It’s all the more impressive because in the beginning, Google’s network infrastructure was relatively simple. But as our user base and the demand for our services grew exponentially, we realized that we needed a network that could handle an unprecedented scale of data and traffic, and that could adapt to dynamic traffic patterns as our workloads changed over time. This ignited a 25-year journey marked by numerous engineering innovations and milestones, ultimately leading to our current fifth-generation Jupiter data center network architecture, which now scales to 13 Petabits/sec of bisectional bandwidth. To put this data rate in perspective, this network could support a video call (@1.5 Mb/s) for all 8 billion people on Earth! 
Today, we have hundreds of Jupiter fabrics deployed around the world, simultaneously supporting hundreds of services, billions of active daily users, all of our Google Cloud customers, and some of the largest ML training and serving infrastructures in the world. I would like to share more about our journey as we look ahead to the next generation of data center network infrastructure.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/products?#networking’), (‘image’, None)])]>

Guiding principles
Our network evolution has been guided by a few key principles:

Anything, anywhere: Our data center networks support efficiency and simplicity by allowing large-scale jobs to be placed anywhere among 100k+ servers within the same network fabric, with high-speed access to needed storage and support services. This scale improves application performance for internal and external workloads and eliminates internal fragmentation. 

Predictable, low latency: We prioritize consistent performance and minimizing tail latency by provisioning bandwidth headroom, maintaining 99.999% network availability, and proactively managing congestion through end-host and fabric cooperation.

Software-defined and systems-centric: Leveraging software-defined networking (SDN) for flexibility and agility, we qualify and globally release dozens of new features every two weeks across our global network.

Incremental evolution and dynamic topology: Incremental evolution helps us to refresh the network granularly (rather than bringing it down wholesale), while dynamic topology helps us to continuously adapt to changing workload demands. The combination of optical circuit switching and SDN supports in-place physical upgrades and an ever-evolving, heterogeneous network that supports multiple hardware generations in a single fabric.

Traffic engineering and application-centric QoS: Optimizing traffic flows and ensuring Quality of Service helps us tailor the network to each application’s needs.

Integrating across the above principles is the foundation for our work. The network is the foundation of reliability for all other compute services, from storage to AI. As such, the network must fail last and fail least. To support this foundational responsibility, we rigorously define and monitor every bad minute1 across hundreds of clusters and millions of ports across our global network. Our progress on reliability is such that our in-house, software-defined Jupiter networks deliver a factor of 50x more reliability than prior versions of our data center networks. 
2015 – Jupiter, the first Petabit network 
In a seminal paper, we showed that Jupiter data center networks scaled to 1.3 Pb/s of aggregate bandwidth by leveraging merchant switch silicon, Clos topologies and Software Defined Networking (SDN). This generation of Jupiter was the culmination of five generations of data center networks developed in house by the Google networking team. At that time, this data rate — in one Google data center — was more than the estimated aggregate IP traffic data rate for the global internet. 
2022 – Enabling 6 Petabit per second
In 2022 we announced that our Jupiter networks scaled to over 6 Pb/s, with deep integration of optical circuit switching (OCS), wave division multiplexing (WDM), and a highly scalable Orion SDN controller. These technologies unlocked a range of advancements, including incremental network builds, enhanced performance, reduced costs, lower power consumption, dynamic traffic management, and seamless upgrades.
2023 – 13 Petabit per second network
We have further enhanced Jupiter to support native 400 Gb/s link speeds in the network core. The fundamental building block of Jupiter networks (called the aggregation block) now consists of 512 ports of 400 Gb/s of connectivity both to end hosts and to the rest of the data center, for an aggregate of 204.8 Tb/s of bidirectional non-blocking bandwidth per block. We support 64 such blocks for a total bisection bandwidth of 64*204.8 Tb/s = 13.1 Pb/s. This technology has been powering Google’s production data centers for over a year, fueling the rapid advancement of artificial intelligence, machine learning, web search, and other data-intensive applications.
2024 and beyond – Extreme networking in the age of AI
While celebrating over two decades of innovation in data center networking, we’re already charting the course for the next generation of network infrastructure to support the age of AI. For example, our teams are busy working on networking infrastructure needs for our upcoming A3 Ultra VMs, that feature NVIDIA ConnectX-7 networking,  supports non-blocking 3.2 Tbps per server of GPU-to-GPU traffic over RoCE (RDMA over converged ethernet) and our future offerings based on NVIDIA GB200 NVL72.
Over the next few years, we will deliver significant advances in network scale and bandwidth, both per-port and network-wide. We will continue to push the boundaries of end-host integration, including the transport and congestion control stack, and streamline network stages to achieve even lower latency with tighter tails. Real-time topology engineering, deeper integration with the compute and storage stacks, and continued refinements to host-based load balancing techniques will further enhance network reliability and latency. With these innovations, our network will remain a cornerstone for the transformative applications and services that enrich the lives of our users throughout the world while simultaneously supporting the groundbreaking AI capabilities that power both our internal services and Google Cloud products.
We are excited to take on these challenges and opportunities to see what the next 25 years hold for Google networking!
Further resources

Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network, SIGCOMM ‘15 [paper]

Journey of the first Jupiter datacenter network leveraging merchant switch silicon, Clos topologies and Software Defined Networking (SDN).

First deployed in production in 2012.

Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale, arxiv.org, 2022 [paper]

First deployed in production in 2013.

Orion: Google’s Software-Defined Networking Control Plane. NSDI ‘21 [paper]

Google’s high-performance, scalable, intent-based distributed SDN platform used in both datacenter and wide area networks.

First deployed in production in 2016.

Jupiter Evolving: Transforming Google’s Datacenter Network via Optical Circuit Switches and Software-Defined Networking, SIGCOMM ’22 [paper]

Enabling technologies: OCS (2013), Orion SDN (2016), 200Gbps networking (2020), direct-connect topology (2017), dynamic traffic engineering (2018), dynamic topology engineering (2021).

Swift: Delay is Simple and Effective for Congestion Control in the Datacenter, SIGCOMM ‘20 [paper]

Swift, a congestion control protocol using hardware timestamps and AIMD control with a delay target, delivers excellent performance in Google datacenters with low flow completion times for short RPCs and high throughput for long RPCs.

First deployed in production in 2017

PLB: Congestion Signals are Simple and Effective for Network Load Balancing, SIGCOMM ‘22 [paper]

Protective Load Balancing (PLB) is a simple, effective host-based load balancing design that reduces network congestion and improves performance by randomly changing paths for congested connections, preferring to repath after idle periods to minimize packet reordering.

First deployed in production in 2020

1. Any minute where a statistically significant number of network flows in the data center network experience a total or partial outage above a defined threshold.

AI Summary and Description: Yes

Summary: The text outlines the evolution of Google’s network infrastructure over 25 years, emphasizing engineering milestones and innovations that have advanced the scalability, reliability, and performance of their Jupiter data center network. It highlights the role of software-defined networking (SDN) and other technologies that position Google to meet the demands of AI and data-intensive applications in the future.

Detailed Description:
– **Historic Overview**: Google’s network infrastructure has dramatically evolved from simple beginnings to an advanced, sophisticated framework capable of handling immense data and user demands.
– **Current Capacity**: The Jupiter data center network supports 13 Petabits/sec of bisectional bandwidth—enough for a video call for all 8 billion people on Earth.
– **Global Deployment**: Hundreds of Jupiter fabrics are now deployed worldwide, managing billions of daily users and numerous services.

– **Guiding Principles**: Several key principles underpin Google’s continuous network evolution:
– **Anything, Anywhere**: Large-scale jobs can be placed across a vast network of servers, enabling high performance and eliminating fragmentation.
– **Predictable, Low Latency**: Focus on performance consistency with proactive congestion management.
– **Software-defined and Systems-centric**: Utilization of SDN for flexibility with frequent global feature releases.
– **Incremental Evolution and Dynamic Topology**: Allows for gradual upgrades without taking down the network.
– **Traffic Engineering and Application-centric QoS**: Tailors network performance to specific application needs.

– **Reliability Focus**: Monitoring and defining network disruptions is crucial for high availability, achieving 50 times more reliability than earlier versions.

– **Recent Innovations**:
– **2022 Enhancements**: Jupiter networks achieved over 6 Petabits/sec through integrating OCS and WDM technologies.
– **2023 Advancements**: Continuous integration of technologies for native 400 Gb/s link speeds, pushing toward enhanced capacities.

– **Future Perspectives**: Plans are underway for advancements that align with the increased requirements of AI applications, including integrating NVIDIA technologies for further enhancements in networking capabilities.

In summary, Google’s dedication to rigorous network evolution through technological innovation aligns precisely with future needs in AI and cloud services, ensuring that their infrastructure remains a robust backbone for delivering critical applications and services globally. This information is especially relevant to professionals in AI, cloud, and infrastructure security as it highlights the need for scalable, reliable networking solutions to support modern applications and services.