Cloud Blog: How to deploy Llama 3.2-1B-Instruct model with Google Cloud Run GPU

Nov 14, 2024

—

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/how-to-deploy-llama-3-2-1b-instruct-model-with-google-cloud-run/
Source: Cloud Blog
Title: How to deploy Llama 3.2-1B-Instruct model with Google Cloud Run GPU

Feedly Summary: As open-source large language models (LLMs) become increasingly popular, developers are looking for better ways to access new models and deploy them on Cloud Run GPU. That’s why Cloud Run now offers fully managed NVIDIA GPUs, which removes the complexity of driver installations and library configurations. This means you’ll benefit from the same on-demand availability and effortless scalability that you love with Cloud Run’s CPU and memory, with the added power of NVIDIA GPUs. When your application is idle, your GPU-equipped instances automatically scale down to zero, optimizing your costs.
In this blog post, we’ll guide you through deploying the Meta Llama 3.2 1B Instruction model on Cloud Run. We’ll also share best practices to streamline your development process using local model testing with Text Generation Inference (TGI) Docker image, making troubleshooting easy and boosting your productivity.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Why Cloud Run with GPU?
There are four critical reasons developers benefit from deploying open models on Cloud Run with GPU:

Fully managed: No need to worry about drivers, libraries, or infrastructure.

On-demand scaling: Scale up or down automatically based on demand.

Cost effective: Only pay for what you use, with automatic scaling down to zero when idle.

Performance: NVIDIA GPU-optimized for Meta Llama 3.2.

Initial Setup

First, create a Hugging Face token.

Second, check that your Hugging Face token has permission to access and download Llama 3.2 model weight here. Keep your token handy for the next step.

Third, use Google Cloud’s Secret Manager to store your Hugging Face token securely. In this example, we will be using Google user credentials. You may need to authenticate for using gcloud CLI, setting default project ID, and enable necessary APIs, and grant access to Secret Manager and Cloud Storage.

code_block
<ListValue: [StructValue([(‘code’, ‘# Authenticate CLI\r\ngcloud auth login\r\n\r\n# Set default project\r\ngcloud config set project <your_project_id>\r\n\r\n# Create new secret key, remember to update <your_huggingface_token>\r\ngcloud secrets create HF_TOKEN –replication-policy=”automatic"\r\necho -n <your_huggingface_token> | gcloud secrets versions add HF_TOKEN –data-file=-\r\n\r\n# Retrieve the key\r\nHF_TOKEN=$(gcloud secrets versions access latest –secret="HF_TOKEN")’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee4bb5e0400>)])]>

Local debugging

Install huggingface_cli python package in your virtual environment.

Run huggingface-cli login to set up a Hugging Face credential.

Use the TGI Docker image to test your model locally. This allows you to iterate and debug your model locally before deploying it to Cloud Run.

code_block
<ListValue: [StructValue([(‘code’, ‘export LOCAL_MODEL_DIR=~/.cache/huggingface/hub\r\nexport CONTAINRE_MODEL_DIR=/root/.cache/huggingface/hub\r\nexport LOCAL_PORT=3002\r\n\r\ndocker run –gpus all -ti –shm-size 1g -p $LOCAL_PORT:8080 \\\r\n -e MODEL_ID=meta-llama/Llama-3.2-1B-Instruct \\\r\n -e NUM_SHARD=1 \\\r\n -e HF_TOKEN=$(gcloud secrets versions access latest –secret="HF_TOKEN") \\\r\n -e MAX_INPUT_LENGTH=500 \\\r\n -e MAX_TOTAL_TOKENS=1000 \\\r\n -e HUGGINGFACE_HUB_CACHE=$CONTAINRE_MODEL_DIR \\\r\n -v $LOCAL_MODEL_DIR:$CONTAINRE_MODEL_DIR \\\r\nus-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee4bb5e0160>)])]>

Deployment to Cloud Run

Deploy the model to Cloud Run with NVIDIA L4 GPU: (Remember to update SERVICE_NAME).

code_block
<ListValue: [StructValue([(‘code’, ‘export LOCATION=us-central1\r\nexport CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310\r\nexport SERVICE_NAME=<your-cloudrun-service-name>\r\n\r\ngcloud beta run deploy $SERVICE_NAME \\\r\n –image=$CONTAINER_URI \\\r\n –args="–model-id=meta-llama/Llama-3.2-1B-Instruct,–max-concurrent-requests=1" \\\r\n –port=8080 \\\r\n –cpu=8 \\\r\n –memory=32Gi \\\r\n –no-cpu-throttling \\\r\n –gpu=1 \\\r\n –gpu-type=nvidia-l4 \\\r\n –max-instances=3 \\\r\n –concurrency=64 \\\r\n –region=$LOCATION \\\r\n –no-allow-unauthenticated \\\r\n –set-secrets=HF_TOKEN=HF_TOKEN:latest’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee4bb5e0bb0>)])]>

Endpoint testing

Test your deployed model using curl

This sends a request to your Cloud Run service for a chat completion, demonstrating how to interact with the deployed model.

code_block
<ListValue: [StructValue([(‘code’, ‘URL=https://your-url.us-central1.run.app\r\n\r\n\r\ncurl $URL/v1/chat/completions \\\r\n -X POST \\\r\n -H "Authorization: Bearer $(gcloud auth print-identity-token)" \\\r\n -H \’Content-Type: application/json\’ \\\r\n -d \'{\r\n "model": "tgi",\r\n "messages": [\r\n {\r\n "role": "system",\r\n "content": "You are a helpful assistant."\r\n },\r\n {\r\n "role": "user",\r\n "content": "What is Cloud Run?"\r\n }\r\n ],\r\n "max_tokens": 128\r\n }\”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee4bb5e0ca0>)])]>

Cold start improvements with Cloud Storage FUSE
You’ll notice that it takes more than a minute during a cold start for the response to return. Can we do better?
We can use Cloud Storage FUSE. Cloud Storage FUSE is an open-source tool that lets you mount Google Cloud Storage buckets as a file system.
First, you need to download the model files and upload them to the Cloud Storage bucket. (Remember to update GCS_BUCKET).

code_block
<ListValue: [StructValue([(‘code’, ‘# 1. Download model\r\nMODEL=meta-llama/Llama-3.2-1B-Instruct\r\nLOCAL_DIR=/mnt/project/google-cloudrun-gpu/gcs_folder/hub/Llama-3.2-1B-Instruct\r\nGCS_BUCKET=gs://<YOUR_BUCKET_WITH_MODEL_WEIGHT>\r\n\r\nhuggingface-cli download $MODEL –exclude "*.bin" "*.pth" "*.gguf" ".gitattributes" –local-dir $LOCAL_DIR\r\n\r\n# 2. Copy to GCS\r\ngsutil -o GSUtil:parallel_composite_upload_threshold=150M -m cp -e -r $LOCAL_DIR $GCS_BUCKET’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee4bb5e09a0>)])]>

Now, we will create a new Cloud Run service using the deployment script as follows. (Remember to update BUCKET_NAME). You may also need to update the network and subnet name as well.

code_block
<ListValue: [StructValue([(‘code’, ‘export LOCATION=us-central1\r\nexport CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311\r\nexport SERVICE_NAME=cloudrun-gpu-fuse-llama32-1b-instruct\r\nexport VOLUME_NAME=fuse\r\nexport BUCKET_NAME=<YOUR_BUCKET_WITH_MODEL_WEIGHT>\r\nexport MOUNT_PATH=/mnt/fuse\r\n\r\ngcloud beta run deploy $SERVICE_NAME \\\r\n –image=$CONTAINER_URI \\\r\n –args="–model-id=$MOUNT_PATH/Llama-3.2-1B-Instruct,–max-concurrent-requests=1" \\\r\n –port=8080 \\\r\n –cpu=8 \\\r\n –memory=32Gi \\\r\n –no-cpu-throttling \\\r\n –gpu=1 \\\r\n –gpu-type=nvidia-l4 \\\r\n –max-instances=3 \\\r\n –concurrency=64 \\\r\n –region=$LOCATION \\\r\n –network=default \\\r\n –subnet=default \\\r\n –vpc-egress=all-traffic \\\r\n –no-allow-unauthenticated \\\r\n –update-env-vars=HF_HUB_OFFLINE=1 \\\r\n –add-volume=name=$VOLUME_NAME,type=cloud-storage,bucket=$BUCKET_NAME \\\r\n –add-volume-mount=volume=$VOLUME_NAME,mount-path=$MOUNT_PATH’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee4bb5e0820>)])]>

Next Steps
To learn more about Cloud Run with NVIDIA GPUs and to deploy your own open-source model from Hugging Face, check out these resources below:

Cloud Run GPU

Best practices

Meta Llama 3.2 1B Instruct

AI Summary and Description: Yes

Summary: The text outlines the process of deploying open-source large language models (LLMs) on Google Cloud Run with NVIDIA GPU support. This service simplifies the management of AI models by offering automatic scaling, cost efficiency, and enhanced performance, particularly in the context of developing applications that require heavy computation from models like Meta Llama 3.2.

Detailed Description:
The provided text is significant for professionals in AI, cloud computing, and infrastructure security due to the following points:

– **Managed Environment**: Developers benefit from a fully managed cloud service that abstracts away the complexities of driver installations and library configurations, allowing for a focus on application development rather than infrastructure management.

– **On-Demand Scalability**: The integration of NVIDIA GPUs in Cloud Run supports automatic scaling based on demand. This elasticity means resources are utilized only when necessary, which is crucial for optimizing overall operational costs, especially in environments with fluctuating workloads.

– **Performance Optimizations**: Specific benefits include:
– Utilizing NVIDIA GPUs optimized for LLM performance.
– Automatic scaling down to zero for idle instances, which saves costs and reduces waste.

– **Deployment Guidance**: The text provides a complete guide on deploying the Meta Llama 3.2 model using Cloud Run, detailing essential steps such as:
– **Creating and Securing Hugging Face Tokens**: Proper token handling enhances security, demonstrating how to store sensitive information with Google Cloud’s Secret Manager.
– **Local Development Setup**: Suggestions for using Docker images for model testing locally to streamline the development process.
– **Deployment Commands**: Command-line interfaces are provided for users to easily deploy models to Cloud Run, enhancing productivity and workflow efficiency.

– **Testing and Cold Start Improvements**: The text also discusses endpoint testing using curl, a practical aspect for developers to validate their deployment. It mentions cold start improvements through Cloud Storage, addressing performance latency—which is a common challenge in serverless environments.

– **Resource Links**: Lastly, the text points to further resources for understanding Cloud Run with GPUs, encouraging continued learning and adaptation in deploying open-source models.

Overall, this straightforward guide empowers developers by providing necessary steps and best practices for leveraging powerful AI models in cloud environments safely and efficiently.

2 4 access Act adaptation AGI AI AI models API APIs application development applications art as assistant authorization Auto availability best practices Bug C chat Cloud cloud computing cloud environment cloud environments Cloud Run cloud service cloud storage code command command-line interface complexity Computing concurrency Configuration Console container Context cost efficiency Costs credentials critical D data Debugging demo deployment developer developers development Docker docker images e e-learning efficiency endpoint enhanced performance Entra environment Gen generation git Go Google Google Cloud Google Cloud Storage GPU GPUs guidance http HTTPS Hugging Face Huggingface identity image in Inference information infrastructure infrastructure management infrastructure security installation integration inter IRS ite json k l language language model language models large language model large language models latency learning learning and adaptation libraries library Link llama Llama 3.2 llm llms lm low mac making management memory Meta Meta Llama mission ML model models network no non NPU Nvidia NVIDIA GPUs o of on oost open open-source open-source models operation optimization optimizations ory performance performance optimization policy post Power productivity products professionals Python RCE replication resources response s scalability Scale scaling sec Secret Manager secrets security sensitive information server serverless service side Sig Sim source SSE storage subnet system Testing text generation third token tokens Tor traffic trie troubleshooting Ubuntu unauthenticated update user uth Vertex virtual environment VPC workload workloads x zero