Cloud Blog: Sustainable silicon to intelligent clouds: collaborating for the future of computing

Oct 15, 2024

—

Source URL: https://cloud.google.com/blog/topics/systems/2024-ocp-global-summit-keynote/
Source: Cloud Blog
Title: Sustainable silicon to intelligent clouds: collaborating for the future of computing

Feedly Summary: Editor’s note: Today, we hear from Parthasarathy Ranganathan, Google VP and Technical Fellow and Amber Huffman, Principal Engineer. Partha delivered a keynote address today at the 2024 OCP Global Summit, an annual conference for leaders, researchers, and pioneers in the open hardware industry. Amber is on the board of directors at the Open Compute Project (OCP). Read on to hear about the past and future of hyperscale computing, and an overview of all of our activities in the OCP community.

We are in an exciting era of hyperscale computing, one where a new wave of innovations is building the foundation for AI/ML computing in the cloud. Building on Google’s rich 25-year history in hyperscale computing, we look ahead to how co-design and collaboration — across the hardware-software stack, disciplines, and communities — will be key to this exciting new future.
From scrappy beginnings to societal infrastructure
When Google was founded in 1998, it was clear that successful web search would require enormous amounts of computing power and storage. This led to the design of the very first hyperscale computers specialized for search. These early makeshift systems included creative cost-reduction approaches like corkboard servers and off-the-shelf fans from Walmart, and they set the stage for the hardware-software co-design and workload-specific specialization principles that we follow to this day.
Building on these first systems, over the subsequent decade, Google laid the groundwork for modern hyperscale computing, pioneering custom servers, custom networking, and custom data centers, and expanding our services beyond search to include Gmail, YouTube, and Android. All of this presaged the modern multi-workload cloud. During this period, we also developed essential systems software like Borg, Colossus, MapReduce, and Bigtable. In the following years, we focused on scaling these systems, while also prioritizing security, reliability, and power efficiency. The formation of the Open Compute Project (OCP) in 2011 marked the transition of hyperscale computing from niche discipline to more mainstream offering. In the current decade, hyperscale computing is characterized by innovations to counter the slowing of Moore’s law: specialized hardware to support machine learning and video processing as well as software-defined servers to manage heterogeneity.
Today, hyperscale computing has truly come into its own, evolving into the crucial societal infrastructure that drives cloud and AI workloads.
Cross-disciplinary co-design: the heart of innovation
Across all these Google innovations over the past 25 years, one theme has remained constant: a strong commitment to cross-disciplinary systems innovation and co-design. Looking ahead to the AI era, we continue to take a holistic approach: from “mud to cloud” — starting at the very ground on which we build our data centers up to to broader cloud computing services; and from “chip to ship” — designing hardware that we then deploy and use in production. This philosophy has driven some incredible efficiency gains, delivering orders-of-magnitude improvements across multiple generations of systems.
Take our Tensor Processing Units (TPUs). Multiple generations of these purpose-built AI accelerators (including our latest Trillium TPU) have driven significant advances in machine learning, including large-language models like Gemini and Nobel-prize-winning scientific breakthroughs like AlphaFold. However, we’ve gone beyond just chip design to considering the entire system that surrounds them. We’ve coupled TPUs with innovations like liquid cooling, advanced networking systems featuring cutting-edge optics and topology awareness, and a commitment to sustainable power, all in the service of creating a truly amazing AI platform. We’ve then layered open software frameworks like JAX, TensorFlow, OpenXLA, and Kubernetes on top of this hardware foundation, creating what we call the AI Hypercomputer. This hypercomputer is further enhanced by integrating with model gardens and applications, creating a vertically integrated ecosystem that’s optimized for AI workloads.

Cross-industry collaboration: from ideas to impact
But there’s also another aspect of holistic co-design that has served us well: cross-industry collaborations, i.e., building standards and ecosystems. Our partnership with OCP is an important example of this. Since formally joining OCP in 2016, we’ve continued to grow our contributions year after year. Looking ahead, we want to highlight progress and opportunities in four key areas.
SustainabilityLast year, Google, along with fellow hyperscalers, rallied the industry to reduce carbon emissions with an ambitious roadmap towards greener concrete. We have since made good progress, collaborating to develop new metrics and benchmarks, identifying streamlined data center designs that minimize concrete use, and even using AI to research new materials. At a recent event, we demonstrated proof-of-concept concrete mixtures that can reduce carbon emissions by 20% to 40%.
As we work towards net-zero emissions by 2030 across our operations and value chain, there’s a lot more we can do. At OCP this year, we are discussing how to develop product category rules (PCRs) to accurately measure hardware emissions across the lifecycle, make more high-quality carbon data available, and develop clean reliable power backup for our data centers. Further, we’re continuing to look holistically at all aspects of our energy consumption, carbon footprint, and water usage.
Trusted siliconTrusted silicon is a foundational element of hyperscaler systems. Over the past three years, we have collaborated on Caliptra, a re-usable IP block for root-of-trust management, and delivered an open-source implementation of Caliptra 1.0 that is being integrated by companies across the ecosystem. Google’s future TPUs and ARM SoCs will also include Caliptra. Leveraging Caliptra, the OCP L.O.C.K. project will provide layered open-source cryptographic key management for storage devices, improving both trust and sustainability.
In the area of silicon reliability, we are continuing our industry-academia collaborations around a systems approach to addressing silicon faults and silent data errors, including funding six leading academic institutions for novel research. The Server Component Resilience (SDC) Specification discusses the opportunities ahead with standardized information exchange and test metrics and open frameworks for detecting and mitigating errors.
AI acceleratorsAI represents a fundamental platform shift requiring us to innovate across hardware and software. Google has played an active role in driving standardization efforts for AI accelerators, particularly in areas like low-precision data formats (e.g., OCP FP8 and MX), software frameworks (e.g., OpenXLA, JAX, TensorFlow), and networking (Falcon, Ultra Ethernet, Ultra Accelerator Link). Working with other hyperscalers and GPU suppliers, we have also aligned on common specifications for firmware updates, management interfaces, and RAS (reliability, availability, serviceability).
But as AI continues to drive exponential demands on computing, we can do more. As part of the OCP AI Strategic initiative, we are sharing learnings from deploying over 1 GW of liquid cooled infrastructure to help the industry scale this capability. We are also identifying new power-delivery solutions, from chips to racks to data centers. Notably, akin to how Google led the industry with 48V racks, at OCP Summit this year, we are proposing 400V DC distribution and rack solutions that can significantly improve data center density and efficiency.
Systems infrastructureFinally, we continue to make great progress on foundational systems infrastructure. Google’s contributions this past year span contributions to NVM Express for the data center (e.g., security enhancements, open test repositories), servers (e.g., OpenTitan platform root of trust), and networking (Falcon, SONiC advancements in telemetry and simulation, advanced PCIe enclosure compatible form factor), as well as new efforts such as the open-source random shock and vibration testing. At the same time, we’ve gone beyond technical contributions to form and co-chair the OCP Advisory Board as well as guide the formation of the OCP AI Strategic Initiative.
Looking ahead, we will continue to keep innovating in this space, particularly to meet the next level of scale required by AI infrastructure. Notably, at the OCP Summit this year, we are discussing the adoption of robotics and automation for data centers. Across a range of activities (material movement, monitoring/inspection, servicing/repair, media management), robotics enable data center operations to scale safely and sustainably, and present a fundamental shift in how we build these facilities.
Innovating for the new intelligence revolution
We have a lot to be proud of over the past 25 years of hyperscale computing, but the best is yet to come. With AI, we are at an exciting inflection point in computing: the beginning of the new intelligence revolution. Akin to prior shifts — the industrial revolution for manufacturing or the information revolution with the mobile internet — this revolution will have a profound impact on both technology and society, and holistic system innovations will be key to enabling it. We look forward to collaborating with all of you on this exciting journey.

AI Summary and Description: Yes

Summary: The text discusses the evolution and future of hyperscale computing, emphasizing its critical role in AI/ML workloads and the importance of holistic co-design and collaboration across various technological domains. It highlights Google’s contributions to sustainable practices, trusted silicon, AI accelerator standardization, and foundational systems infrastructure, encapsulating a comprehensive vision for the next generation of cloud and AI computing.

Detailed Description:

The provided text outlines key insights into the evolution of hyperscale computing as articulated by Google leadership during the 2024 OCP Global Summit. Several major points are addressed, making it highly relevant to professionals in AI, cloud, infrastructure, and security:

– **Historical Context and Evolution**:
– Google’s journey began in 1998 with innovations in computing power for web search.
– Development of hyperscale computing led to custom servers, networks, and data centers supporting multiple workloads.
– Creation of essential systems software (Borg, Colossus, MapReduce, Bigtable) that laid a strong foundation for modern cloud computing.

– **Holistic Approach to AI/ML**:
– Emphasis on cross-disciplinary collaboration in system design from infrastructure (“mud to cloud”) to application deployment (“chip to ship”).
– Innovations such as Tensor Processing Units (TPUs) have redefined accelerators for machine learning, indicating a trend toward specialization in hardware to support AI workloads.

– **Sustainability Initiatives**:
– Collaboration with other hyperscalers to create greener data centers aimed at reducing carbon emissions by up to 40%.
– Development of metrics and benchmarks for hardware emissions is a priority, showcasing a commitment to environmental responsibility alongside technological advancements.

– **Trusted Silicon and Security Enhancements**:
– Introduction of Caliptra for root-of-trust management is vital for enhancing system security facilitated through open-source initiatives.
– Research collaborations to address silicon reliability, including the Server Component Resilience Specification, reflect a proactive stance on reducing risks of silent data errors.

– **AI Accelerator Standardization**:
– Google’s influence in standardizing aspects of AI accelerators—low precision data formats, software frameworks, and networking—is crucial for interoperability across systems and performance optimization.

– **Future Infrastructure Initiatives**:
– Continued innovations in systems infrastructure focus on enhancing NVM Express security, server robustness, and networking advancements with the aim of optimizing data centers for AI.
– Robotics and automation are introduced as transformative for safe, sustainable, and efficient data center operations.

– **Concluding Vision**:
– The call to action encourages collaboration across industries to harness the upcoming ‘intelligence revolution,’ which has the potential for a broad societal impact.

In summary, the text serves as a testament to the strategic imperatives shaping the future of hyperscale computing, through contributions to AI/ML, sustainability, and security that will resonate deeply with stakeholders in cloud and infrastructure domains. Integrating these insights can help professionals navigate upcoming challenges and opportunities in technology and compliance landscapes.

4 academia accelerators Act advisory AGI AI AI accelerators AI workloads AlphaFold Android API applications Arch ARM art Auto automation availability awareness backup benchmark Bigtable C challenges chip design chips Cloud cloud computing collaboration community companies compliance Computing computing power conference Context critical cross-disciplinary collaboration data data center data center operations data centers data formats DeFi deployment design development ecosystem efficiency efficiency gains energy energy consumption environment errors Ethernet firmware firmware update firmware updates framework funding future of computing Gemini Gen Go Google GPU hardware Highlight http HTTPS hyperscal hyperscale computing hyperscalers ICO implementation industry industry collaboration inflection Influence infrastructure infrastructure initiatives innovation institutions Intel intelligence intelligent clouds interoperability IRS Key Management Kubernetes Labor language model language models leadership liability liquid cooling lm logic Machine Learning management manufacturing media metrics mission ML model models monitoring network Networking networks open Open Compute Project open-source open-source initiatives operation optics optimization ory partnership performance performance optimization porting power efficiency precision proactive proactive stance production professionals Progress proof-of-concept RCE reliability research researchers resilience responsibility Risk risks robotics robustness ROI Rust s Sable search sec security security enhancements server services Sig silicon reliability Sim simulation societal impact society software stakeholders standardization standardization efforts standards sustainability sustainability initiatives sustainable silicon system system innovation system security systems technological advancement technological advancements technology telemetry Tensor Processing Units TensorFlow Testing Tor transition trie trust trusted silicon Uber update updates usage Vision water usage workload workloads YouTube