Cloud Blog: Run your AI inference applications on Cloud Run with NVIDIA GPUs

Source URL: https://cloud.google.com/blog/products/application-development/run-your-ai-inference-applications-on-cloud-run-with-nvidia-gpus/
Source: Cloud Blog
Title: Run your AI inference applications on Cloud Run with NVIDIA GPUs

Feedly Summary: Developers love Cloud Run for its simplicity, fast autoscaling, scale-to-zero capabilities, and pay-per-use pricing. Those same benefits come into play for real-time inference apps serving open gen AI models. That’s why today, we’re adding support for NVIDIA L4 GPUs to Cloud Run, in preview.
This opens the door to many new use cases to Cloud Run developers:

Performing real-time inference with lightweight open models such as Google’s open Gemma (2B/7B) models or Meta’s Llama 3 (8B) to build custom chat bots or on-the-fly document summarization, while scaling to handle spiky user traffic. 

Serving custom fine-tuned gen AI models, such as image generation tailored to your company’s brand, and scaling down to optimize costs when nobody’s using them.

Speeding up your compute-intensive Cloud Run services, such as on-demand image recognition, video transcoding and streaming, and 3D rendering.

As a fully managed platform, Cloud Run lets you run your code directly on top of Google’s scalable infrastructure, combining the flexibility of containers with the simplicity of serverless to help boost your productivity. With Cloud Run, you can run frontend and backend services, batch jobs, deploy websites and applications, and handle queue processing workloads — all without having to manage the underlying infrastructure.
At the same time, many workloads that perform AI inference, especially applications that demand real-time processing, require GPU acceleration to deliver responsive user experiences. With support for NVIDIA GPUs, you can perform on-demand online AI inference using the LLMs of your choice in seconds. With 24GB of vRAM, you can expect fast token rates for models with up to 9 billion parameters, including Llama 3.1(8B), Mistral (7B), Gemma 2 (9B). When your app is not in use, the service automatically scales down to zero so that you are not charged for it.
“With the addition of NVIDIA L4 Tensor GPU and NVIDIA NIM support, Cloud Run provides users a real-time, fast-scaling AI inference platform to help customers accelerate their AI projects and get their solutions to market faster — with minimal infrastructure management overhead.” – Anne Hecht, Senior Director of Product Marketing, NVIDIA
Early customers are excited about the combination of Cloud Run and NVIDIA GPUs.
“Cloud Run’s GPU support has been a game-changer for our real-time inference applications. The low cold-start latency is impressive, allowing our models to serve predictions almost instantly, which is critical for time-sensitive customer experiences. Additionally, Cloud Run GPUs maintain consistently minimal serving latency under varying loads, ensuring our generative AI applications are always responsive and dependable — all while effortlessly scaling to zero during periods of inactivity. Overall, Cloud Run GPUs have significantly enhanced our ability to provide fast, accurate, and efficient results to our end users.” – Thomas MENARD, Head of AI – Global Beauty Tech, L’Oreal
“Cloud Run GPUs are hands-down the best way to consume GPU compute on Google Cloud. I love how it provides a high degree of control and customizability using open-source standards (Knative) as well as great observability tools out of the box, together with fully managed infrastructure that scales to zero. And since we can easily migrate to GKE using Knative primitives, there is always an option to get even more control at the cost of higher complexity and maintenance. GPU allocation and startup times were also faster for our use-case compared to most competing services.” – Alex Bielski, Director of Innovation, Chaptr
Using NVIDIA GPUs on Cloud Run
Today, we support attaching one NVIDIA L4 GPU per Cloud Run instance, and you do not need to reserve your GPUs in advance. To start, Cloud Run GPUs are available today in us-central1(Iowa), with availability in europe-west4 (Netherlands) and asia-southeast1 (Singapore) expected before the end of the year. 
To deploy a Cloud Run service with NVIDIA GPUs, add the –gpu=1 flag to specify the number of GPUs and –gpu-type=nvidia-l4 flag to specify the type of GPU in the command line. Or, you can do this from the Google Cloud console:

And with the recently announced Cloud Run functions, you can also attach a GPU to your functions to perform event-driven AI inference with simplicity.
“The newly released Cloud Run functions with GPU support enables Python developers to use Hugging Face models without having to worry about infrastructure, GPU drivers or containers. Cloud Run’s scales to zero and fast startup capabilities are a great match for developers looking at getting started with AI using HuggingFace models with just a few lines of serverless code” – Julien Chaumond, CTO, Hugging Face
Performance
Along with simple operations, Cloud Run with NVIDIA GPUs also offers strong performance. We keep our infrastructure latency to a minimum so that you can get the best performance when serving your models. 
Cloud Run instances with an attached L4 GPU with driver pre-installed start in approximately 5 seconds, at which point the processes running in your container can start to use the GPU. Then, you’ll need another few seconds for the framework and model to load and initialize. The table below shows cold-start times for Gemma 2b, Gemma2 9b, Llama2 7b/13b, and Llama3.1 8b models with the Ollama framework, ranging from 11 to 35 seconds. This measures the time to start an instance from 0, load the model in the GPU, and for the LLM to return its first word.

Model

Model Size 

Cold Start Time

gemma:2b

1.7 GB

11-17 seconds

gemma2:9b

5.1 GB

25-30 seconds

llama2:7b

3.8 GB

14-21 seconds

llama2:13b

7.4 GB

23-35 seconds

llama3.1:8b

4.7 GB

15-21 seconds

Cold start time: Time taken for first invocation to the service URL for Cloud Run instance to go from 0-1 and serve the first word of the response.Models: we used 4 bit quantized versions of each of the models above. These models were deployed using the Ollama framework. Note that these numbers are observed in a controlled lab environment and actual performance numbers may vary depending on a variety of factors. “
Deploy a sample app using Ollama

Below, you can see how to deploy Google’s Gemma2 9b model with Ollama using Cloud Run with NVIDIA GPUs. Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Ollama is a framework that provides a simple API to manage large language models. 
First, create a container image with Ollama and the model with this Dockerfile:

code_block
)])]>

Then deploy using the following command:

code_block
<ListValue: [StructValue([(‘code’, ‘gcloud beta run deploy –source . –port 11434 –no-cpu-throttling –cpu 8 –memory 32Gi –gpu 1 –gpu-type=nvidia-l4’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e2a22ad3070>)])]>

And that’s it! Once deployed, you can use the Ollama API to start chatting with Gemma 2!
“Deploying a Large Language Model using Ollama on Cloud Run is remarkably straightforward, thanks to the latest GPU support. With just a few commands, you can leverage Ollama’s seamless integration with your app and Cloud Run’s serverless infrastructure to deploy, and manage your LLMs effortlessly. The fast coldstarts and rapid scaling of Cloud Run let you scale your application reliably. No deep knowledge of infrastructure or machine learning is required — simply focus on your application and let the tools handle the rest.” – Jeffrey Morgan, Founder, Ollama
Additionally, you can also leverage NVIDIA NIM inference microservices, part of the NVIDIA AI Enterprise software suite available in the Google Cloud Marketplace. This provides secure, reliable deployment of high-performance AI model inferencing accelerated to simplify AI inference deployments and maximize performance on NVIDIA L4 GPUs on Cloud Run. Check out this NVIDIA blog to learn how to get started.
Get started today
Cloud Run makes it super easy to host your web applications. And now with GPU support, we are extending the best of serverless, simplicity and scalability to your AI inference applications too! To start using Cloud Run with NVIDIA GPUs, sign up at g.co/cloudrun/gpu to join our preview program today and wait for our welcome email.
To learn more about Cloud Run with GPUs, join this livestream on August 21, 2024 with NVIDIA and Ollama. We will discuss new features for Cloud Run and demo how to use Cloud Run in different scenarios.

AI Summary and Description: Yes

**Summary:**
The text details the introduction of NVIDIA L4 GPU support for Google Cloud Run, highlighting its benefits for developers focusing on real-time inference applications using generative AI models. This advancement allows for the deployment of lightweight open models, enhances performance for various compute-intensive tasks, and simplifies infrastructure management, thus streamlining AI project development and enhancing user experience.

**Detailed Description:**
The text provides comprehensive insights into the new capabilities of Google Cloud Run, particularly in relation to NVIDIA GPUs. Here are the significant points:

– **NVIDIA L4 GPU Support for Cloud Run:**
– Introduction of NVIDIA L4 GPUs in Google Cloud Run allows developers to leverage GPU acceleration for AI inference applications.

– **Key Benefits:**
– **Real-time Inference:** Developers can perform fast, real-time inference with open generative AI models, such as Google’s Gemma and Meta’s Llama series, which enhances the functionality for use cases like chatbots and document summarization.
– **Cost-Optimization:** The scale-to-zero capability ensures that costs are minimized by automatically scaling down when usage is low.
– **Performance Improvement:** Running compute-intensive services such as image recognition and video transcoding becomes significantly faster with GPU support.

– **User Experiences:**
– Prominent developers and companies express excitement regarding the GPU’s impact on their real-time inference applications. They noted:
– Low cold-start latency leading to instant model predictions.
– Consistent low serving latency under varied workloads, ensuring responsiveness.
– High control and customizability using open-source standards (Knative), facilitating easy migration to Google Kubernetes Engine (GKE) for more complex operations.

– **Technical Specifications:**
– Users can attach one NVIDIA l4 GPU per Cloud Run instance without prior reservation.
– Cold-start times for various models range from 11 to 35 seconds, reflecting the time needed for the instance to start and load the model.

– **Deployment and Usability:**
– Instructions are provided on deploying a model using Ollama, demonstrating ease of deployment with the command line and Google Cloud console.
– Developers can leverage the Ollama framework and utilize NVIDIA NIM for secure high-performance AI model inferencing.

– **Next Steps for Developers:**
– Developers are encouraged to collaborate in the preview program to gain early access to the new features and enhancements.
– An upcoming livestream is scheduled to demonstrate new features, offering further education and insight into leveraging Cloud Run for AI applications.

This text is particularly relevant for professionals in AI, cloud security, and infrastructure, as it outlines cutting-edge technologies and practices that enhance the scalability and performance of AI projects while promoting a serverless infrastructure approach.