Source URL: https://cloud.google.com/blog/products/containers-kubernetes/improve-data-loading-times-for-ml-inference-apps-on-gke/
Source: Cloud Blog
Title: Data loading best practices for AI/ML inference on GKE
Feedly Summary: As AI models increase in sophistication, there’s increasingly large model data needed to serve them. Loading the models and weights along with necessary frameworks to serve them for inference can add seconds or even minutes of scaling delay, impacting both costs and the end-user’s experience.
For example, inference servers such as Triton, Text Generation Inference (TGI), or vLLM are packaged as containers that are often over 10GB in size; this can make them slow to download, and extend pod startup times in Kubernetes. Then, once the inference pod starts, it needs to load model weights, which can be hundreds of GBs in size, further adding to the data loading problem.
This blog explores techniques to accelerate data loading for both inference serving containers and downloading models + weights, so you can accelerate the overall time to load your AI/ML inference workload on Google Kubernetes Engine (GKE).
aside_block
1. Accelerating container load times using secondary boot disks to cache container images with your inference engine and applicable libraries directly on the GKE node.
2. Accelerating model + weight load times from Google Cloud Storage with Cloud Storage Fuse or Hyperdisk ML.
The image above shows a secondary boot disk (1) that stores the container image ahead of time, avoiding the image download process during pod/container startup. And for AI/ML inference workloads with demanding speed and scale requirements, Cloud Storage Fuse (2) and Hyperdisk ML (3) are options to connect the pod to model + weight data stored in Cloud Storage or a network attached disk. Let’s look at each of these approach in more detail below.
Accelerating container load times with secondary boot disks
GKE lets you pre-cache your container image into a secondary boot disk that is attached to your node at creation time. The benefit of loading your containers this way is that you skip the image download step and can begin launching your containers immediately, which drastically improves startup time. The diagram below shows container image download times grow linearly with container image size. Those times are then compared with using a cached version of the container image that is pre-loaded on the node.
Caching a 16GB container image ahead of time on a secondary boot disk has shown reductions in load time of up to 29x when compared with downloading the container image from a container registry. Additionally, this approach lets you benefit from the acceleration independent of container size, allowing for large container images to be loaded predictably fast!
To use secondary boot disks, first create the disk with all your images, create an image out of the disk, and specify the disk image while creating your GKE node pools as a secondary boot disk. For more, see the documentation.
Accelerating model weights load times
Many ML frameworks output their checkpoints (snapshots of model weights) to object storage such as Google Cloud Storage, a common choice for long-term storage. Using Cloud Storage as the source of truth, there are two main products to retrieve your data at the GKE-pod level: Cloud Storage Fuse and Hyperdisk ML (HdML).
When selecting one product or the other there are two main considerations:
Performance – how quickly can the data be loaded by the GKE node
Operational simplicity – how easy is it to update this data
Cloud Storage Fuse provides a direct link to Cloud Storage for model weights that reside in object storage buckets. Additionally there is a caching mechanism for files that need to be read multiple times to prevent additional downloads from the source bucket (which adds latency). Cloud Storage Fuse is appealing because there are no pre-hydration operational activities for a pod to do to download new files in a given bucket. It’s important to note that if you switch buckets that the pod is connected to, you will need to restart the pod with an updated Cloud Storage Fuse configuration. To further improve performance, you can enable parallel downloads, which spawns multiple workers to download a model, significantly improving model pull performance.
Hyperdisk ML gives you better performance and scalability than downloading files directly to the pod from Cloud Storage or other online location. Additionally, you can attach up to 2500 nodes to a single Hyperdisk ML instance, with aggregate bandwidth up 1.2 TiB/sec. This makes it a strong choice for inference workloads that span many nodes and where the same data is downloaded repeatedly in a read-only fashion. To use Hyperdisk ML, load your data on the Hyperdisk ML disk prior to using it, and again upon each update. Note that this adds operational overhead if your data changes frequently.
Which model+weight loading product you use depends on your use case. The table below provides a more detailed comparison of each:
Storage Option
Ideal Use Case
Performance
Availability
Model update process
Cloud Storage Fuse with GKE CSI
Frequent data updates
Fast
Regional. Pods can freely be deployed across zones and access data.
Redeploy pods and update persistent volume claims to point the Cloud Storage CSI to the bucket or file with the new data.
Hyperdisk ML with GKE CSI
Minimal data updates
Fastest
Zonal. Data can be made regional with an automated GKE clone feature to make data available across zones.
Create new persistent volume, load new data, and redeploy pods that have a PVC to reference the new volume.
As you can see there are other considerations besides throughput to take into account when architecting a performant model loading strategy.
Conclusion
Loading large AI models, weights, and container images into GKE-based AI models can delay workload startup times. By using a combination of the three methods described above — secondary boot disk for container images, Hyperdisk ML OR Cloud Storage Fuse for models + weights — get ready to accelerate data load times for your AI/ML inference applications.
Next steps:
Learn more about secondary boot disk
Learn more about Cloud Storage Fuse
Learn more about Hyperdisk ML
AI Summary and Description: Yes
Summary: The text discusses techniques for improving the loading times of AI/ML inference workloads in Google Kubernetes Engine (GKE). By leveraging secondary boot disks and tools like Cloud Storage Fuse and Hyperdisk ML, organizations can significantly reduce startup delays, enhancing performance and user experience.
Detailed Description: The text provides a comprehensive overview of strategies to alleviate data loading issues for AI/ML inference workloads. Here are the major points and their significance:
– **Challenges with Model Loading**: As AI models grow in complexity, the data they require also expands, leading to increased startup times due to large model weights and container sizes. This poses a critical challenge for organizations leveraging AI in production.
– **Inference Servers**: The mention of inference servers like Triton, TGI, or vLLM emphasizes the need for efficient handling of large container images and model weights, which can exceed several hundred gigabytes.
– **Acceleration Techniques**:
1. **Secondary Boot Disks**:
– By caching container images on secondary boot disks, GKE can skip the download phase during pod startup.
– This technique has demonstrated a reduction in load times by as much as 29 times compared to downloading directly from a registry, allowing for rapid deployment of larger container images.
2. **Cloud Storage Solutions**:
– **Cloud Storage Fuse**: Provides a mechanism to directly connect GKE pods to stored model weights in Google Cloud Storage while offering caching to mitigate latency from repeated downloads.
– **Hyperdisk ML**: This solution sits ahead of Cloud Storage, allowing for a high aggregate bandwidth and efficient use across multiple nodes, making it suitable for environments that require frequent data retrieval and low latency.
– **Use Case Comparison**:
– A detailed comparison between the two storage options (Cloud Storage Fuse and Hyperdisk ML) outlines their ideal use cases, performance metrics, and operational processes for updating data, allowing organizations to make informed decisions based on their specific requirements.
– **Conclusion & Next Steps**:
– The text concludes by summarizing the importance of these techniques in enhancing the deployment of AI workloads in cloud environments and encourages readers to explore further resources related to each solution.
This analysis underscores the growing importance of operational efficiency in AI and ML applications, providing security and compliance professionals with insights on managing infrastructure performance and optimizing resource utilization effectively.