Cloud Blog: Efficiently serve optimized AI models with NVIDIA NIM microservices on GKE

Source URL: https://cloud.google.com/blog/products/containers-kubernetes/nvidia-nims-are-available-on-gke/
Source: Cloud Blog
Title: Efficiently serve optimized AI models with NVIDIA NIM microservices on GKE

Feedly Summary: In the rapidly evolving landscape of AI, efficiently serving AI models is critical to ensure the platform delivers value at optimal performance and cost. But the complexities of optimizing and operating an increasing variety of AI models prevents many organizations from fully realizing AI’s value. We’ve been partnering closely with NVIDIA to bring the power of the NVIDIA AI accelerated computing platform to Google Kubernetes Engine (GKE) to address these complexities. Today, we’re thrilled to announce the availability of NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, on GKE, letting you deploy NIM microservices directly from the GKE console.
NVIDIA NIM containerized microservices for accelerated computing optimize deployment for common AI models that can run across various environments, including Kubernetes clusters, with a single command, providing standard APIs for seamless integration into generative AI applications and workflows.
The combination of NVIDIA NIM and GKE unlocks new potential for AI model inference, helping organizations to deliver optimal latency and throughput with the scale and operational efficiency of GKE. And deploying these powerful capabilities on GKE is easier than ever. With NVIDIA NIM microservices available directly in the GKE console, you can deploy the latest NIM-optimized models including meta/llama-3.1-70b-instruct, mistralai/mixtral-8x7b-instruct-v0.1 and nvidia/nv-embedqa-e5-v5 on GKE with just a few clicks. This deployment experience expands upon the previously available helm-based deployment and enables customers to seamlessly deploy the latest NIM models from the NVIDIA API catalog on NVIDIA GPUs orchestrated by GKE, and integrated high-performance storage for model weights. 
Writer is transforming work with enterprise-grade AI models optimized for NVIDIA NIM and delivered on GKE:
“Writer is excited to expand our partnership with Google Cloud and NVIDIA to enable us to deliver Writer’s advanced AI models in a highly performant, scalable and efficient way. Together, NVIDIA NIMs and GKE provide outstanding inference performance, making it easy to integrate and scale across different applications. This collaboration improves our deployment abilities and uses advanced technology to ensure top performance and reliability." – Waseem Alshikh, CTO, Writer, Inc.
The ability to deploy NVIDIA NIM microservices directly to GKE marks an important milestone in Google Cloud’s partnership with NVIDIA.  
“With NVIDIA NIM microservices integrated as ready to deploy solution in Google Kubernetes Engine, organizations can bring AI to market faster with models optimized for NVIDIA GPUs that can be efficiently scaled and operated with GKE,” said Abhishek Sawarkar, product manager for NVIDIA AI Enterprise. “We’re seeing significant latency and throughput improvements on popular GenAI models, which can be deployed and scaled on GKE’s production-ready platform in minutes rather than hours.”
Get started with NVIDIA NIM on GKE
Navigate to the Google Kubernetes Engine in the Google Cloud console and select NVIDIA NIM, then launch it to configure your deployment.

In the UI, specify the deployment name, service account information and confirm APIs are enabled. Specify a unique cluster name and GPU type and shape for the cluster. Select your model from the drop-down and click Deploy. The deployment will create a new GKE cluster and deploy the specified NIM.

After deployment has successfully completed, connect to your NIM endpoint with the following commands, where $CLUSTER is the GKE cluster name, $DEPLOYMENT is the deployment name and $PROJECT is the GCP project in which it was deployed.

code_block
)])]>

code_block
<ListValue: [StructValue([(‘code’, ‘kubectl -n nim port-forward service/my-nim-nim-llm 8000:8000’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5455deca60>)])]>

Send a test inference to your NIM endpoints with a curl command, specifying the model previously selected (this example shows how to query a llama-3.1-8b-instruct model)

code_block
<ListValue: [StructValue([(‘code’, ‘curl -X \’POST\’ \\\r\n \’http://localhost:8000/v1/chat/completions\’ \\\r\n -H \’accept: application/json\’ \\\r\n -H \’Content-Type: application/json\’ \\\r\n -d \'{\r\n "messages": [\r\n {\r\n "content": "You are a polite and respectful chatbot helping people plan a vacation.",\r\n "role": "system"\r\n },\r\n {\r\n "content": "What should I do for a 4 day vacation in Spain?",\r\n "role": "user"\r\n }\r\n ],\r\n "model": "meta/llama-3.1-8b-instruct",\r\n "max_tokens": 4096,\r\n "top_p": 1,\r\n "n": 1,\r\n "stream": false,\r\n "stop": "\\n",\r\n "frequency_penalty": 0.0\r\n}\”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5455dec970>)])]>

And that’s it! Now you know how to deploy an NVIDIA NIM microservice to GKE directly from the console, making it easy to enjoy the power of NVIDIA GPUs and software on Google Cloud’s high-performance, reliable containerized infrastructure. Learn more about the Google Cloud and NVIDIA partnership at cloud.google.com/NVIDIA.

AI Summary and Description: Yes

Summary: The text discusses the recent integration of NVIDIA NIM microservices within Google Kubernetes Engine (GKE), enabling organizations to efficiently deploy and scale AI models. This partnership aims to optimize performance and reduce deployment complexities, which is crucial for leveraging AI capabilities effectively in an enterprise environment.

Detailed Description:
The announcement highlights a significant advancement in AI deployment capabilities through the collaboration between NVIDIA and Google Cloud. Key points include:

– **NVIDIA NIM Integration with GKE**:
– NVIDIA NIM (NVIDIA AI Microservices) is designed for accelerated computing and aims to simplify the deployment of AI models on GKE.
– This integration allows for one-command deployment of containerized microservices for various AI models, facilitating seamless integration with generative AI applications.

– **Operational Efficiency**:
– The announcement emphasizes enhanced latency and throughput for AI model inference, which can now be achieved faster and more efficiently.
– Organizations can expect to rapidly deploy NIM-optimized models such as ‘meta/llama-3.1-70b-instruct’ and others directly from the GKE console.

– **User Experience**:
– The deployment process has been streamlined, allowing users to specify parameters easily through the GKE interface.
– Detailed steps are provided for new users to deploy and test AI models, including commands for Kubernetes operations.

– **Statements from Partner Organizations**:
– Notably, Waseem Alshikh, CTO of Writer, Inc., expressed excitement over the partnership, highlighting improved performance and reliability of AI model deployment.
– Abhishek Sawarkar from NVIDIA emphasized the capacity for organizations to launch AI models in minutes, enhancing market readiness.

– **Market Implications**:
– The integration signifies a crucial evolution in cloud computing and AI operations, especially for businesses looking to enhance their AI offerings through scalable infrastructure.

– **Practical Steps for Users**:
– Users are guided through the process of launching NVIDIA NIM on GKE via the Google Cloud console, reinforcing accessibility and operational readiness.

The integration of NVIDIA NIM with GKE reflects a broader trend towards simplifying the operational complexities associated with deploying advanced AI applications, making it a critical development for technology and cloud computing professionals. It highlights what organizations can achieve with optimized solutions while focusing on effective AI model deployment and scaling.