The Register: Oracle boasts zettascale ‘AI supercomputer,’ just don’t ask about precision

Source URL: https://www.theregister.com/2024/09/11/oracle_zettascale_supercluster/
Source: The Register
Title: Oracle boasts zettascale ‘AI supercomputer,’ just don’t ask about precision

Feedly Summary: Cluster of 131,072 Blackwell GPUs up for grabs starting H1 2025
Comment Oracle says it’s already taking orders on a 2.4 zettaFLOPS cluster with “three times as many GPUs as the Frontier supercomputer."…

AI Summary and Description: Yes

Summary: The text discusses Oracle’s latest AI supercomputer offering, emphasizing the performance claims related to zettaFLOPS. While Oracle touts a significant leap in computational power, the article highlights the importance of precise measurements and the implications of varying floating-point precision on model training and inferencing.

Detailed Description:
The content elaborates on Oracle’s announcement of its AI supercomputer equipped with 131,072 Blackwell accelerators capable of delivering 2.4 zettaFLOPS of compute performance. The discussion revolves around the validity of this claim and the complexities involved in measuring computer performance, especially in AI applications. Here are the significant points:

– **Performance Claims**:
– Oracle claims to take orders for a supercomputer with unprecedented performance.
– The reference to “AI zettaFLOPS” raises questions about the selection of floating-point precision.

– **Floating-Point Precision**:
– Common performance metrics like FLOPS can mislead without proper context; they are often presented without specifying the precision used.
– The article notes that while Oracle cites zettaFLOPS, the actual training performance at standard precision (FP16) is much lower, around 459 exaFLOPS.

– **Implications for Model Accuracy**:
– Training AI models at lower precision (such as FP4 or FP8) may provide speed but significantly impacts model accuracy.
– Lower precision is commonly used for optimizing inference rather than training, which is a critical distinction for practitioners in AI.

– **Cluster Performance and Networking**:
– If Oracle can effectively network all GPUs together using advanced technologies like RoCEv2 or InfiniBand, it would constitute a high-performance computing (HPC) cluster.
– The peak performance of the Blackwell Supercluster at FP64 could exceed that of prior systems, emphasizing its potential for serious computations in AI.

– **Comparative Analysis**:
– The text contrasts the performance of Oracle’s cluster with that of AMD’s Frontier system, illustrating the competitive landscape in AI supercomputing.

– **Future Availability**:
– Blackwell-based superclusters are anticipated to be available by the first half of 2025, which could influence market competition and drive advancements in AI computing capabilities.

Overall, the text serves as a critical reminder for security, compliance, and infrastructure professionals to scrutinize claims regarding AI performance metrics and consider the implications of floating-point precision on both model performance and business outcomes.