Cloud Blog: How to deploy and serve multi-host gen AI large open models over GKE

Nov 8, 2024

—

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/deploy-and-serve-open-models-over-google-kubernetes-engine/
Source: Cloud Blog
Title: How to deploy and serve multi-host gen AI large open models over GKE

Feedly Summary: Context
As generative AI experiences explosive growth fueled by advancements in LLMs (Large Language Models), access to open models is more critical than ever for developers. Open models are publicly available pre-trained foundational LLMs. Platforms like Google Cloud’s Vertex AI, Kaggle and Hugging Face already provide easy access to open models to data scientists, ML engineers, and application developers.
Some of these models require powerful infrastructure and deployment capabilities, which is why today we’re excited to announce the capability to deploy and serve open models such as Llama 3.1 405B FP16 LLM over GKE (Google Kubernetes Engine). Published by Meta, Llama 3.1 with 405 billion parameters demonstrates significant improvements in general knowledge, reasoning abilities, and coding proficiency. When run at FP (floating point) 16 precision to store and process 405 billion parameters, the model requires more than 750GB GPU memory for inference. The GKE solution described in this article makes the challenge of deploying and serving such large models easier to achieve.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Customer experience
As a Google Cloud customer, you can find the Llama 3.1 LLM by going to Vertex AI Model Garden and selecting the Llama 3.1 model tile.

After clicking the deploy button, you can select GKE and pick the Llama 3.1 405B FP16 model.

On this page, you can find the auto generated Kubernetes yaml and detailed instructions for deployment and serving Llama 3.1 405B FP16.
Multi-host deployment and serving
Llama 3.1 405B FP16 LLM requires more than 750 GB GPU memory and presents considerable challenges for deployment and serving. In addition to the memory consumed by model weights, factors such as KV (Key-Value) cache storage and longer sequence length support also contribute to the overall memory requirements. Currently the most powerful GPU offering in the Google Cloud platform, the A3 virtual machines, is equipped with 8 H100 Nvidia GPUs, each featuring 80 GB of HBM (High-Bandwidth Memory). For serving LLMs like the FP16 Llama 3.1 405B model, multi-host deployment and serving is the only viable solution. We use LeaderWorkerSet with Ray and vLLM to deploy over GKE.
LeaderWorkerSet
The LeaderWorkerSet (LWS) is a deployment API specifically developed to address the workload requirements of multi-host inference, facilitating the sharding and execution of the model across multiple devices on multiple nodes. Constructed as a Kubernetes deployment API, LWS is both cloud agnostic and accelerator agnostic, and can run on both GPUs and TPUs. LWS leverages the upstream StatefulSet API as its fundamental building block, as illustrated below.

Within the LWS architecture, a group of pods is managed as a singular entity. Each pod within this group is assigned a unique index ranging from 0 to n-1, with the pod bearing the index 0 designated as the leader of the group. The creation of each pod within the group is executed concurrently, and they share an identical lifecycle. LWS facilitates rollout and rolling updates at the group level. Each group is regarded as a single unit for rolling updates, scaling, and mapping to an exclusive topology for placement. The upgrade process for each group is executed as a single atomic unit, ensuring that all pods within the group are updated simultaneously. Co-location of all pods within the same group in the same topology is permissible, with optional support for topology-aware placement. The group is treated as a single entity in the context of failure handling as well, with optional all-or-nothing restart support. When enabled, all pods within the group will be recreated if a single pod in the group fails or if a single container within any of the pods is restarted.
Within the LWS framework, the concept of a replica encompasses a group consisting of a single leader and a set of workers. LWS supports dual templates, one designated for the leader and the other for the workers. LWS provides a scale endpoint for HPA, enabling dynamic scaling of the number of replicas.
Multi-host deployment with vLLM and LWS
vLLM is a popular open source model server and supports multi-node multi-GPU inference by employing tensor parallelism and pipeline parallelism. vLLM supports distributed tensor parallelism with Megatron-LM’s tensor parallel algorithm. For pipeline parallelism, vLLM manages the distributed runtime with Ray for multi-node inferencing.
Tensor parallelism involves horizontally partitioning the model across multiple GPUs, resulting in the tensor parallel size being equivalent to the number of GPUs within each node. It is important to note that this approach necessitates fast network communication among the GPUs.
On the other hand, pipeline parallelism vertically partitions the model by layer and does not demand constant communication between GPUs. Typically, this corresponds to the number of nodes employed for multi-host serving.
The combination of these parallelism strategies is essential to accommodate the entirety of the Llama 3.1 405B FP16 model. Two A3 nodes, each equipped with 8 H100 GPUs, will provide an aggregate memory capacity of 1280 GB, sufficient to accommodate the model’s 750 GB memory requirement. This configuration will also provide the necessary buffer memory for the key-value (KV) cache and support long context lengths. For this LWS deployment, the tensor parallel size is set to 8, while the pipeline parallel size is set to 2.
Summary
In this blog, we showed you how LWS gives you the essential capabilities required for multi-host serving. This technique can also serve smaller models, such as Llama 3.1 405B FP8, on more cost-effective machines, which optimizes price-to-performance ratios. To learn more, visit this blog post that shows how to pick a machine type that fits your model. LWS is open sourced and has strong community engagements – take a look at our Github to learn more and contribute directly.
As Google Cloud Platform helps customers adopt a gen AI workload, you can come to Vertex AI Model Garden to deploy and serve open models over managed Vertex AI backend or GKE DIY (Do It Yourself) clusters. Our goal is to create a seamless customer experience, one example is multi-host deployment and serving. We look forward to hearing your feedback.

AI Summary and Description: Yes

Summary: This text highlights the significance of deploying and serving advanced generative AI models, specifically the Llama 3.1 405B FP16 model, using Google Kubernetes Engine (GKE). It emphasizes the challenges associated with handling large language models and introduces innovative solutions like the LeaderWorkerSet architecture and vLLM, which facilitate efficient multi-host deployment. This content is particularly relevant for professionals in AI, cloud computing, and infrastructure security.

Detailed Description:
– **Generative AI Context**: The text discusses the burgeoning field of generative AI and its reliance on large language models (LLMs). It underscores the importance of open access to these models for developers, data scientists, and ML engineers.

– **Deployment Challenges & Solutions**:
– **Model Specifications**: Llama 3.1 is noted for its 405 billion parameters, requiring over 750 GB of GPU memory for inference. This highlights significant resource demands.
– **Google Kubernetes Engine (GKE)**: The text outlines how GKE simplifies the deployment of such a large model, addressing common challenges faced by developers.

– **LeaderWorkerSet (LWS)**:
– A specialized deployment API designed for multi-host inference, capable of leveraging multiple nodes and devices.
– Incorporates Kubernetes StatefulSet API for managing pods as cohesive units, ensuring efficient updates, scaling, and failure handling.
– Allows for both cloud and accelerator agnosticity, making it versatile across platforms.

– **Multi-host Deployment**:
– Introduces vLLM as an open-source model server that utilizes tensor and pipeline parallelism for effective model serving across multiple GPUs.
– Clearly delineates the processes of tensor and pipeline parallelism, and describes how these strategies collectively accommodate the memory and computational demands of the Llama model.

– **Scalability and Performance**:
– Discusses the scaling configuration suitable for multi-host deployment, including buffer memory for key-value caches and support for longer sequence lengths.

– **Community Engagement**:
– Promotes the open-source nature of LWS, encouraging community contributions and interactions.

– **Customer Experience**:
– Aims to enhance user experience in deploying AI models via user-friendly interfaces and resources on the Google Cloud Platform.

This content offers crucial insights into the deployment and operational challenges of running advanced AI models in cloud environments and presents effective solutions, making it relevant for security and compliance professionals engaged in AI and cloud infrastructure.

4 access Act advancement AGI AI AI models algorithm API application developers Arch architecture art Auto backend bandwidth C capabilities capacity challenges Cloud cloud agnostic cloud computing cloud environment cloud environments cloud infrastructure coding communication community community contributions community engagement compliance compliance professionals computational demand Computing Configuration Console container Context context length cost-effective critical customer experience data data scientists demo deployment design developer developers e-learning endpoint engineers environment execution feedback framework Gen generative Generative AI generative AI models geo git GitHub GKE Go Google Google Cloud Google Cloud Platform Google Kubernetes Google Kubernetes Engine GPU GPUs Group high-bandwidth memory Highlight host deployment http HTTPS Hugging Face in Inference infrastructure infrastructure security insights inter interaction iOS ite k knowledge Kubernetes Kubernetes Engine l language model language models large language model large language models LeaderWorkerSet llama llm llms lm mac making memory memory capacity memory requirements Meta ML model model serving model weights models multi network Network Communication Nvidia NVIDIA GPUs oE open open-source operation operational challenges ory parallelism performance pipeline parallelism post Power precision products professionals Ray RCE reasoning reasoning abilities resource demands resources s scalability Scale scaling sec security security and compliance server sharding Sig Sim smaller models SoC source SSE storage tensor parallelism Tor Uber update updates upgrade user user experience user-friendly User-Friendly Interface Vertex Vertex AI virtual machine virtual machines workers workload x