Source URL: https://cloud.google.com/blog/products/application-development/choosing-a-self-hosted-or-managed-solution-for-ai-app-development/
Source: Cloud Blog
Title: Choosing between self-hosted GKE and managed Vertex AI to host AI models
Feedly Summary: In today’s technology landscape, building or modernizing applications demands a clear understanding of your business goals and use cases. This insight is crucial for leveraging emerging tools effectively, especially generative AI foundation models such as large language models (LLMs).
LLMs offer significant competitive advantages, but implementing them successfully hinges on a thorough grasp of your project requirements. A key decision in this process is choosing between a managed LLM solution like Vertex AI and a self-hosted option on a platform such as Google Kubernetes Engine (GKE).
In this blog post, we equip developers, operations specialists, or IT decision-makers to answer the critical questions of “why" and "how" to deploy modern apps for LLM inference. We’ll address the balance between ease of use and customization, helping you optimize your LLM deployment strategy. By the end, you’ll understand how to:
deploy a Java app on Cloud Run for efficient LLM inference, showcasing the simplicity and scalability of a serverless architecture.
use Google Kubernetes Engine (GKE) as a robust AI infrastructure platform that complements Cloud Run for more complex LLM deployments
Let’s get started!
Why Google Cloud for AI development
But first, what are some of the factors that you need to consider when looking to build, deploy and scale LLM-powered applications? Developing an AI application on Google Cloud can deliver the following benefits:
Choice: Decide between managed LLMs or bring your own open-source models to Vertex AI.
Flexibility: Deploy on Vertex AI or leverage GKE for a custom infrastructure tailored to your LLM needs.
Scalability: Scale your LLM infrastructure as needed to handle increased demand.
End-to-end support: Benefit from a comprehensive suite of tools and services that cover the entire LLM lifecycle.
Managed vs. self-hosted models
When weighing the choices for AI development in Google Cloud with your long-term strategic goals, consider factors such as team expertise, budget constraints and your customization requirements. Let’s compare the two options in brief.
Managed solution
Pros:
Ease of use with simplified deployment and management
Automatic scaling and resource optimization
Managed updates and security patches by the service provider
Tight integration with other Google Cloud services
Built-in compliance and security features
Cons:
Limited customization in fine-tuning the infrastructure and deployment environment
Potential vendor lock-in
Higher costs vs. self-hosted, especially at scale
Less control over the underlying infrastructure
Possible limitations on model selection
Self-hosted on GKE
Pros:
Full control over deployment environment
Potential for lower costs at scale
Freedom to choose and customize any open-source model
Greater portability across cloud providers
Fine-grained performance and resource optimization
Cons:
Significant DevOps expertise for setup, maintenance and scaling
Responsibility for updates and security
Manual configuration for scaling and load balancing
Additional effort for compliance and security
Higher initial setup time and complexity
In short, managed solutions like Vertex AI are ideal for teams for quick deployment with minimal operational overhead, while self-hosted solutions on GKE offer full control and potential cost savings for strong technical teams with specific customization needs. Let’s take a couple of examples.
Build a gen AI app in Java, deploy in Cloud Run
For this blog post, we wrote an application that allows users to retrieve quotes from famous books. The initial functionality was retrieving quotes from a database, however gen AI capabilities offer an expanded feature set, allowing a user to retrieve quotes from a managed or self-hosted large-language model.
The app, including its frontend, are being deployed to Cloud Run, while the models are self-hosted in GKE (leveraging vLLM for model serving) and managed in Vertex AI. The app can also retrieve pre-configured book quotes from a CloudSQL database.
Why is Java a good choice for enterprises building generative AI applications?
A mature ecosystem and extensive libraries
Scalability and robustness, perfect for handling AI workloads
Easy integration with AI models via Spring AI
Strong security features
Vast Java talent expertise in many organizations
Cloud Run is the easiest and fastest way to get your gen AI apps running in production, allowing a team to:
Build fast-scaling, scale-to-zero, API endpoints to serve requests
Run your Java gen AI apps in portable containers that are interoperable with GKE
Pay only when your code is running
Write code that is idiomatic to developers, with high app deployment velocity
Before you start
The Spring Boot Java application supporting this blog post leverages the Spring AI Orchestration Framework. The app is built on Java 21 LTS and Spring Boot and includes build, test, deployment and runtime guidelines to Cloud Run.
Follow the instructions to clone the Git repository and validate that you have Java 21 and GraalVM set up.
The codebase is complemented with reference documentation for building and deploying the app to Cloud Run, respectively deploying and configuring open models to GKE.
Deploy an open model to GKE
Let’s start by deploying an open model LLM to GKE. For this blog post you will deploy the Meta-Llama-3.1-8B-Instruct open model to GKE.
Setting up Hugging Face access and API Token for LLM deployment
To download the LLM during runtime, follow these steps to set up your Hugging Face account and API token:
1. Prerequisites:
Ensure you have access to a Google Cloud project with available L4 GPUs and sufficient quota in the selected region.
Have a computer terminal with `kubectl` and the Google Cloud SDK installed. You can use the Google Cloud project console’s Cloud Shell, which already has the required tools installed.
2. Hugging Face account and API token:
Models, such as Llama 3.1 used here, require a Hugging Face API token to download.
Visit Meta’s resource page for Llama models to request access: Meta Llama Downloads. You will need to register an email address to download the files.
Go to Hugging Face and create an account using the same email address registered in the Meta access request.
Locate the Llama 3 model and fill out the access request form: [Meta Llama 3.1-8B Instruct]. Patiently check for the approval email.
Once approved, retrieve your Hugging Face access token from your account profile settings. This access token will be used to authenticate and download the model files during the deployment process.
3. To set up a GKE cluster with the appropriate node pool and GPU configurations for deploying a large language model (LLM) on GCP, follow the steps in the repository. The main steps:
code_block
You can find open-model LLM deployment and GCLB configurations reference documentation here.
GKE configuration break down
Deployment:
Creates a single instance of a vllm-inference-server pod
Uses an NVIDIA L4 GPU and allocates specific resources (CPU, memory, ephemeral storage)
Mounts empty directories for cache and shared memory
Service:
Exposes the deployment internally using a ClusterIP
Configures the service to be accessible on port 8000
BackendConfig:
Specifies HTTP health checks for the load balancer, ensuring the service’s health
Ingress:
Configures an ingress resource to expose the service via Google Cloud Load Balancer (GCLB)
Routes external traffic to the vllm-inference-server service on port 8000
vLLM exposes the OpenAI API, in addition to a Native vLLM API. For this blog, we’ll use the API compatible with the OpenAI API specification, as it allows consistency across managed and GKE hosted open models.
With the model and GCLB deployed, notice that the deployment environment variable references two secrets — OpenAPI key and the Hugging face token:
The Open API Key in vLLM is defined by you and set using the environment VLLM_API_KEY. It can be any combination of alphanumeric and special characters. You can use Google Cloud Secret Manager to manage this secret.
The Hugging Face token is available in the Hugging Face account that you set up eariler.
code_block
<ListValue: [StructValue([(‘code’, ‘export HF_TOKEN=<paste-your-own-token>\r\nexport OPENAPI_KEY=<paste-your-own-token>\r\nkubectl create secret generic huggingface –from-literal="HF_TOKEN=$HF_TOKEN" -n $NAMESPACE\r\nkubectl create secret generic openapikey –from-literal="key=$OPENAPI_KEY" -n $NAMESPACE’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8d9c66ea60>)])]>
The alternative? Use a managed model in VertexAI
You can also access a Meta LLama3.1 open model by simply enabling it from the Vertex AI Model Garden as a fully managed Vertex AI service.
You’ll be using the meta/llama3-405b-instruct-maas open model with 405b parameters in the codebase for this blog, available at:
code_block
<ListValue: [StructValue([(‘code’, ‘REGION=us-central1\r\nENDPOINT=${REGION}-aiplatform.googleapis.com\r\nPROJECT_ID="YOUR_PROJECT_ID"\r\n\r\n\r\nLLM_ENDPOINT=https://${ENDPOINT}/v1beta1/projects/${PROJECT_ID}/locations/${REGION}/endpoints/openapi/chat/completions’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8d9c66e580>)])]>
Deploying and configuring the Gen AI Java application to Cloud Run
Validate that you have cloned the code repository and Java 21 is installed [read]:
code_block
<ListValue: [StructValue([(‘code’, ‘# clone repository\r\ngit clone https://github.com/GoogleCloudPlatform/serverless-production-readiness-java-gcp.git\r\n\r\n# app available here\r\ncd serverless-production-readiness-java-gcp/ai-patterns/spring-ai-quotes-llm-in-gke\r\n\r\n# install Java 21 if it is not installed in your env or cloudshell\r\ncurl -s "https://get.sdkman.io" | bash\r\nsource "$HOME/.sdkman/bin/sdkman-init.sh"\r\nsdk install java 21.0.4-graal \r\n\r\n# Select Y to set as default or use\r\nsdk use java 21.0.4-graal\r\n\r\n# validate our Java runtime\r\njava -version\r\n\r\n\r\n# observe the output\r\njava version "21.0.4" 2024-07-16 LTS\r\nJava(TM) SE Runtime Environment Oracle GraalVM 21.0.4+8.1 (build 21.0.4+8-LTS-jvmci-23.1-b41)\r\nJava HotSpot(TM) 64-Bit Server VM Oracle GraalVM 21.0.4+8.1 (build 21.0.4+8-LTS-jvmci-23.1-b41, mixed mode, sharing)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8d9c66e4c0>)])]>
Set the environment variables you’ll require to connect to your test models:
code_block
<ListValue: [StructValue([(‘code’, ‘# LLM in VertexAI env\r\nexport VERTEX_AI_PROJECT_ID=<your project id>\r\nexport VERTEX_AI_LOCATION=us-central1\r\nexport VERTEX_AI_MODEL=meta/llama3-405b-instruct-maas\r\n\r\n# LLM in GKE env\r\nexport OPENAI_API_KEY=<you API key for the LLM in GKE>\r\nexport OPENAI_API_GKE_IP=<IP for deployed model>\r\nexport OPENAI_API_GKE_MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct\r\n\r\n\r\n\r\n# Gemini in VertexAI env\r\nexport VERTEX_AI_GEMINI_PROJECT_ID=<your project id>\r\nexport VERTEX_AI_GEMINI_LOCATION=us-central1\r\nexport VERTEX_AI_GEMINI_MODEL=gemini-1.5-pro-001’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8d9a6ff9a0>)])]>
Build the app and validate the app in your local environment [read]:
code_block
<ListValue: [StructValue([(‘code’, ‘# Build app\r\n./mvnw clean package -Pproduction -DskipTests\r\n\r\n# Start the app locally\r\njava -jar target/spring-ai-quotes-llm-in-gke-1.0.0.jar\r\n\r\n# Access the app in a browser window\r\nhttp://localhost:8083\r\n\r\n# Test from a terminal\r\ncurl localhost:8083/random-quote \r\ncurl localhost:8083/random-quote-llm\r\ncurl localhost:8083/random-quote-llmgke’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8d9a6ff8e0>)])]>
Build the container image and push the image to Artifact Registry [read]:
code_block
<ListValue: [StructValue([(‘code’, ‘# build container image using cloud-native buildpacks\r\n./mvnw spring-boot:build-image -DskipTests -Pproduction -Dspring-boot.build-image.imageName=quotes-llm\r\n\r\n# tag the image\r\nexport PROJECT_ID=$(gcloud config list –format \’value(core.project)\’)\r\necho $PROJECT_ID\r\n\r\n# The region for the image and deployments are assumed to be us-central1\r\n# Change them if you wish to use a different region\r\n\r\n# tag and push image to Artifact Registry\r\ngcloud artifacts repositories create quotes-llm \\\r\n –repository-format=docker \\\r\n –location=us-central1 \\\r\n –description="Quote app images accessing LLMs" \\\r\n –immutable-tags \\\r\n –async\r\ngcloud auth configure-docker us-central1-docker.pkg.dev\r\n\r\ndocker tag quotes-llm us-central1-docker.pkg.dev/${PROJECT_ID}/quotes-llm/quotes-llm\r\ndocker push us-central1-docker.pkg.dev/${PROJECT_ID}/quotes-llm/quotes-llm’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8d9a6ff850>)])]>
Deploy to Cloud Run and provision the configuration for JVM, Cloud Run and LLM access configuration [read]:
code_block
<ListValue: [StructValue([(‘code’, "gcloud run deploy quotes-llm \\\r\n –image us-central1-docker.pkg.dev/${PROJECT_ID}/quotes-llm/quotes-llm \\\r\n –region us-central1 \\\r\n –memory 2Gi –cpu=2 \\\r\n –execution-environment gen1 \\\r\n –set-env-vars=SERVER_PORT=8080 \\\r\n –set-env-vars=JAVA_TOOL_OPTIONS=’-XX:+UseG1GC -XX:MaxRAMPercentage=80 -XX:ActiveProcessorCount=2 -XX:+TieredCompilation -XX:TieredStopAtLevel=1 -Xss256k’ \\\r\n –set-env-vars=VERTEX_AI_GEMINI_PROJECT_ID=${PROJECT_ID} \\\r\n –set-env-vars=VERTEX_AI_GEMINI_LOCATION=us-central1 \\\r\n –set-env-vars=VERTEX_AI_GEMINI_MODEL=gemini-1.5-pro-001 \\\r\n –set-env-vars=VERTEX_AI_PROJECT_ID=${PROJECT_ID} \\\r\n –set-env-vars=VERTEX_AI_LOCATION=us-central1 \\\r\n –set-env-vars=VERTEX_AI_MODEL=meta/llama3-405b-instruct-maas \\\r\n\r\n –set-env-vars=OPENAI_API_KEY=${OPENAI_API_KEY}\r\n\r\n –set-env-vars=OPENAI_API_GKE_IP=${OPENAI_API_GKE_IP} \\\r\n –set-env-vars=OPENAI_API_GKE_MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct \\\r\n –cpu-boost \\\r\n –allow-unauthenticated \r\n\r\n# observe the URL, use it for UI or cURL access\r\n# example:\r\n…\r\nService [quotes-llm] revision [quotes-llm-00008-wq5] has been deployed and is serving 100 percent of traffic.\r\nService URL: https://quotes-llm-6hr…-uc.a.run.app"), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8d9a6ffdf0>)])]>
Test the application end-to-end
You can now test the application end-to-end as deployed in Cloud Run:
code_block
<ListValue: [StructValue([(‘code’, ‘# Test from a terminal\r\ncurl https://quotes-llm-6hrfwttbsa-uc.a.run.app/random-quote \r\ncurl https://quotes-llm-6hrfwttbsa-uc.a.run.app/random-quote-llm\r\ncurl https://quotes-llm-6hrfwttbsa-uc.a.run.app/random-quote-llmgke\r\n\r\n\r\n# Access the app in a browser window\r\nhttps://quotes-llm-6hrfwttbsa-uc.a.run.app’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8d9a6ff520>)])]>
Here are some model responses:
code_block
<ListValue: [StructValue([(‘code’, ‘# Llama3.1 Model in GKE response: \r\n{\r\n "id": 0,\r\n "quote": "You never really understand a person until you consider things from his point of view… Until you climb inside of his skin and walk around in it.",\r\n "author": "Harper Lee",\r\n "book": "To Kill a Mockingbird"\r\n}\r\n\r\nLLama3.1 Model in VertexAI response:\r\n{\r\n "id": 0,\r\n "quote": "The only way to get rid of temptation is to yield to it.",\r\n "author": "Oscar Wilde",\r\n "book": "The Picture of Dorian Gray"\r\n}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8d9a6ff9d0>)])]>
In the App UI you can explore the quotes available in the database, then ask the LLMs to generate quotes for you, using the GKE LLM deployment or in Vertex AI:
Summary
This blog post offers an opinionated guide to Large Language Model (LLM) and Gen AI application deployment on Google Cloud Platform. It emphasizes aligning AI adoption with business needs, comparing managed Vertex AI solutions with self-hosted options on Google Kubernetes Engine (GKE).
The blog post presents:
Google Cloud’s advantages for AI development: flexibility, scalability, and comprehensive support
Pros and cons of managed vs. self-hosted LLM solutions
GKE’s ability to handle complex LLM deployments
A production-ready, practical example: a Spring AI Java app deployed to Cloud Run for efficient LLM inference
We hope this post provides you with valuable insights into how to balance ease of use with your customization needs, and gives you the knowledge you need to make informed decisions on LLM deployment strategies.
Let’s not forget that generative AI adoption starts from business needs, not technological aspects!
References
Serving Open Source LLMs on GKE using vLLM framework blog post from our colleague Rick(Rugui) Chen
AI Summary and Description: Yes
**Summary:**
The text provides a comprehensive guide on deploying Large Language Models (LLMs) within Google Cloud, emphasizing the choice between managed solutions like Vertex AI and self-hosted platforms using Google Kubernetes Engine (GKE). It discusses the challenges and advantages of both deployment strategies while offering practical insights into using a Java application to leverage LLM inference through Cloud Run.
**Detailed Description:**
The article frames its discussion around the growing importance of LLMs in application development, especially within the business context. The key takeaways revolve around deployment strategies, the implications of technological choices, and a step-by-step guide on application deployment.
– **Emergence of LLMs:**
– LLMs provide competitive advantages in application functionalities but require clarity in business goals to implement effectively.
– **Managed vs. Self-hosted Solutions:**
– **Managed LLM Solutions (e.g., Vertex AI):**
– **Pros:**
– Easy deployment and management.
– Automatic scaling and integrated security features.
– Pre-configured compliance aspects.
– **Cons:**
– Limited customization and potential for higher costs compared to self-hosting.
– Risk of vendor lock-in with lower control over infrastructure.
– **Self-hosted Solutions (e.g., GKE):**
– **Pros:**
– Full control over the deployment environment.
– Potentially lower costs at scale and freedom to customize.
– Greater adaptability across cloud providers.
– **Cons:**
– Requires significant DevOps expertise for maintenance.
– Greater responsibility for security and compliance.
– Longer setup times and complexity in configuration.
– **Environmental Preparations:**
– Steps for cloud setup are tactically enumerated, including prerequisites for using GPUs, setting up Hugging Face access for model deployment, and creating Kubernetes clusters for LLM operations.
– **Java Application Deployment:**
– Discussion of a practical Java application that retrieves quotes, illustrating deployment through Cloud Run and integrating both managed and self-hosted LLM capabilities.
– Highlights the advantages of using Java in AI workloads such as a mature development ecosystem and security features.
– **Testing and Validation:**
– Guidance on validating the application end-to-end post-deployment.
– Presentations of model responses provide a real-world application context.
**Practical Implications:**
– Professionals in cloud computing, AI, and security will find relevance in the detailed breakdown of deployment options, providing a fair view of trade-offs in terms of control, customization, scalability, and costs.
– Strategic decision-making is emphasized; organizations must align their technological approaches not only with existing IT capabilities but also with long-term business goals.
This analysis serves to guide IT decision-makers as they strategize LLM adoption in their cloud initiatives, providing a clear roadmap coupled with practical use cases to reinforce understanding.