Source URL: https://cloud.google.com/blog/products/compute/updates-to-ai-hypercomputer-software-stack/
Source: Cloud Blog
Title: AI Hypercomputer software updates: Faster training and inference, a new resource hub, and more
Feedly Summary: The potential of AI has never been greater, and infrastructure plays a foundational role in driving it forward. AI Hypercomputer is our supercomputing architecture based on performance-optimized hardware, open software, and flexible consumption models. Together, these offer exceptional performance and efficiency, resiliency at scale, and give you the flexibility to choose offerings at each layer to suit your needs.
Today, we’re announcing major updates to the AI Hypercomputer software layer for training and inference performance, improved resiliency at scale, as well as a centralized hub for AI Hypercomputer resources.
AI Hypercomputer resources on Github
AI Hypercomputer’s open software layer not only supports leading ML Frameworks and orchestration options, but also provides workload optimizations and reference implementations to improve the time-to-value for your specific use case. To make the innovations in our open software stack easily accessible to developers and practitioners, we are introducing AI Hypercomputer GitHub organization, a central place where you can discover reference implementations such as MaxText, MaxDiffusion, orchestration tools such as xpk (the Accelerated Processing Kit for cluster creation and workload management), and performance recipes for GPUs on Google Cloud. We’ll continue to add to this list and adapt these resources to a rapidly evolving landscape, and we invite you to contribute with us.
aside_block
MaxText now supports A3 Mega VMs
MaxText is a high-performance, highly scalable, open-source reference implementation for large language models (LLMs). You can now use performance-optimized LLM training examples for A3 Mega VMs, which are powered by NVIDIA H100 Tensor Core GPUs and offer a 2X improvement in GPU-to-GPU network bandwidth over A3 VMs. We worked closely with NVIDIA to optimize JAX and XLA, enabling the overlap of collective communication and computation on GPUs. Additionally, we added optimized model configurations and example scripts for GPUs with XLA flags enabled.
MaxText with A3 Mega VMs can deliver near linear scaling of training performance as you scale the number of VMs in the cluster, as demonstrated below by Llama2-70b pre-training.
Figure-1a: Google internal data for Llama2-70b (MaxText) pretraining on A3 Mega. Relative performance (bf16 training) vs ideal scaling. A3 Mega training with bf16 also demonstrates close to linear scaling.
Furthermore, you can use FP8 mixed-precision training on A3 Mega VMs to achieve additional acceleration and hardware utilization. We added FP8 support in MaxText via Accurate Quantized Training (AQT), the same quantization library that also powers INT8 mixed-precision training on Cloud TPUs.
Our benchmarks on dense models indicate FP8 training with AQT can deliver up to 55% improvement in effective model flops utilization (EMFU) compared to bf16. You can find recipes and optimal training examples for A3 Mega here.
Figure-1a: Google internal data for Llama2-70b (MaxText) pretraining on A3 Mega. Effective MFU (EFMU) is computed with base bf16 peakflops for both bf16 and fp8 mixed-precision training. Sequence length is 4096 tokens.
Reference implementations and kernels for MoEs
For most mixture of experts (MoE) use cases, it’s useful to have consistent resource utilization of a limited number of experts. However, for certain applications, the ability to use more experts to develop richer responses is more important. To give you this flexibility, we’ve now expanded MaxText to include both “capped" and "no-cap" MoE implementations so you can choose the best implementation for your model architecture. Capped MoE models offer predictable performance, while no-cap models dynamically allocate resources for optimal efficiency.
To further accelerate MoE training, we’ve open-sourced Pallas kernels, which are optimized for block-sparse matrix multiplication on Cloud TPUs (Pallas is an extension to JAX to allow fine-grained control over code generated for XLA devices such as GPUs and TPUs; Block-sparse matmul is currently available only for TPUs). These kernels can be used with both PyTorch and JAX, providing high-performance building blocks for training your MoE models.
Figure-2: Google internal data for Mixtral-8x7b (MaxText) pretraining on Cloud TPU v5p. Sequence length is 4096 tokens. Weak scaling is measured with fixed per-device batch size.
Our benchmarks with no-cap MoE model (Mixtral-8x7b) indicate near linear scaling with fixed per device batch size (Figure-2). We also observed close to linear scaling when we increased the number of experts (Figure-3) in the base configuration with the number of accelerators, indicative of performance on models with higher sparsity as well.
Figure-3: Google internal data for Mixtral-8xNb (MaxText) pretraining on Cloud TPU v5p.Sequence length is 4096 tokens. Weak scaling is measured by increasing the number of experts (N) with the size of the cluster ranging from 64-512 v5p chips.
Monitoring large-scale training
Having large clusters of accelerators that are expected to work together as a unit on a training task can complicate MLOps. You may have questions such as “Did host transfer latencies spike for a reason?” or “Why is this one device in a segfault?” Yet, monitoring large-scale training jobs with the right metrics is imperative to maximizing your resource utilization and improving overall ML Goodput. To simplify this crucial part of your MLOps charter, we’ve introduced a reference monitoring recipe. This recipe helps you create a Cloud Monitoring dashboard within your Google Cloud project that shows useful statistical measures such as average or max CPU utilization and helps identify outliers in the setup so you can take corrective actions.
SparseCore on Cloud TPU v5p is now GA
Recommender models, as well as models that rely on embeddings, require high-performance random memory access to use those embeddings. SparseCore, the TPU’s hardware accelerator for embeddings, enables you to build more powerful and efficient recommendation systems. Each Cloud TPU v5p chip has four dedicated SparseCores delivering up to 2.5X performance improvement for DLRM-V2 over its predecessor.
Improving LLM inference performance
Finally, to improve LLM inference performance, we introduced KV cache quantization and ragged attention kernels in JetStream, an open-source throughput-and-memory-optimized engine for LLM inference. Together, these enhancements improve inference performance by up to 2X on Cloud TPU v5e.
JetStream throughput (output tokens / second). Google internal data. Baseline: Accelerate AI Inference with Google Cloud TPUs and GPUsCurrent: Measured using Gemma 7B (MaxText), Llama 2 7B (PyTorch/XLA), and Llama 2 70B on Cloud TPU v5e-8. Maximum input length: 1024, maximum output length: 1024.
Empowering your AI journey
From pushing the boundaries of model training and inference, to enhancing accessibility through a central resource repository, each component of the AI Hypercomputer is a building block for the next generation of AI. We envision a future where AI practitioners can seamlessly scale from concept to production, unburdened by infrastructure limitations.
Explore the latest AI Hypercomputer resources, including the optimized recipes, accelerated processing kit, reference implementations and more.
AI Summary and Description: Yes
Summary: The text discusses the latest advancements in the AI Hypercomputer infrastructure, highlighting updates in software for enhanced performance and flexibility in large language models (LLMs). It introduces new features for training and inference, optimizations for cluster management, and open-source resources for developers.
Detailed Description: The passage outlines critical improvements and offerings associated with the AI Hypercomputer framework, which significantly affects professionals in AI, cloud computing, and infrastructure security.
Key points include:
– **AI Hypercomputer Architecture**:
– Performance-optimized hardware combined with an open software architecture is designed to accelerate AI workloads efficiently.
– Major updates to the software layer aim for better training and inference performance.
– **MaxText Implementation**:
– MaxText, a reference implementation for LLMs, can now utilize A3 Mega VMs powered by NVIDIA H100 Tensor Core GPUs.
– Achieves a 2X improvement in GPU-to-GPU network bandwidth, enhancing efficiency and performance.
– **Training Performance**:
– The implementation supports FP8 mixed-precision training, offering up to 55% improvement in effective model flops utilization.
– Includes optimized configurations for both “capped” and “no-cap” mixture of experts (MoE) models that users can select based on application needs.
– **Monitoring and MLOps Enhancements**:
– Introduces monitoring tools to track large-scale training jobs within Google Cloud, facilitating better resource utilization.
– A Cloud Monitoring dashboard is included to provide insights on performance metrics such as CPU utilization.
– **SparseCore Availability**:
– The SparseCore technology enhances the performance of recommender models, delivering a performance improvement of up to 2.5X for embeddings.
– **Inference Performance**:
– Improvements in inference tasks using KV cache quantization and ragged attention kernels are set to boost performance by up to 2X in LLM applications.
– **Resource Accessibility**:
– A GitHub organization has been launched to provide central access to various reference implementations and workload optimizations, promoting community collaboration and contribution.
This information is highly relevant to security and compliance professionals as it signifies the ongoing commitment to robustness in AI infrastructures while addressing requirements for scalable performance and stability, which are critical in maintaining secure and efficient cloud environments.