Cloud Blog: Your infrastructure resources, your way, with new GKE custom compute class API

Source URL: https://cloud.google.com/blog/products/containers-kubernetes/introducing-new-gke-custom-compute-class-api/
Source: Cloud Blog
Title: Your infrastructure resources, your way, with new GKE custom compute class API

Feedly Summary: Picture this: You’re in the middle of your peak sales period and your e-commerce platform is humming along. Despite surging customer demand, your Kubernetes infrastructure is seamlessly adapting to handle the increased traffic. That’s because behind the scenes, your GKE cluster autoscaler is intelligently selecting the best compute resources from a range of options you’ve defined. No late-night scrambles for resources, no lost sales due to capacity issues — just a smooth, uninterrupted customer experience.
With the new custom compute class API in Google Kubernetes Engine (GKE), this scenario is not just a vision, it’s a reality. By giving you fine-grained control over your infrastructure choices, GKE can now prioritize and utilize a variety of compute and accelerator options based on your specific needs ensuring that your applications, including AI workloads, always have the resources they need to thrive. This results in improved cost efficiency, enhanced resource availability, and the flexibility to configure your infrastructure exactly how you want it.
What challenges does the compute classes API address?
1. GKE custom compute classes maximize obtainability and reliability by providing fall-back compute priorities as a list of candidates’ node characteristics or statically defined node pools. This increases the chances of successful autoscaling while giving you control over the resources that get spun up. If your first priority resource is unable to scale up, GKE will automatically try the second priority of node selection, and then continue to other lower priorities on the list.
Compute classes support accelerators and related attributes, making it ideal for AI use cases. For instance, you can specify GPU or TPU priorities, related storage needs, and node consolidation delays and thresholds.

2. Reconcile to preferential infrastructure. Without custom compute classes, during a scaling event, your top priority nodes may not be available, so your pods land on lower priority instances…  and remain there until you manually intervene.
Custom compute classes enable active migration of workloads to preferential node shapes, such as those based on cost or performance optimization. If workloads initially land on lower priority nodes, GKE automatically tries to migrate workloads to higher priority nodes over time, as availability allows.

3. Platform teams need ways to centralize default node attributes for related workloads, including characteristics such as machine families, minimum cores, spot/on-demand, GPU and TPU settings, and autoscaling consolidation parameters so that these settings can be easily consumed by application teams without complicating the developer experience.
Compute classes provide a consistent node configuration API that abstracts away node configuration complexity for developers. App teams can consume compute classes with a simple nodeSelector label or rely on namespace-scoped default compute classes.
Control, flexibility and a streamlined developer experience
Custom compute classes give platform teams increased control and flexibility, while shielding application developers from unnecessary overhead and configuration complexity. In addition to defining fallback compute priorities, the compute class can be used to tune autoscaling attributes like utilization thresholds and consolidation delays. Compute classes currently support both CPUs as well as accelerators with some limitations (see documentation).
Platform teams craft and deploy custom compute classes as Kubernetes Custom Resources (CRs). This Kubernete-native approach means they can be rolled into GitOps and Infrastructure-as-Code (IaC) playbooks. Here’s an example custom compute class:

code_block
)])]>

In this example, the GKE Cluster Autoscaler first tries to spin up a n2 node with a minimum of 16 vCPU cores. If, for any given reason, such nodes are unavailable, CA tries the e2-standard-16 resource. And if that is unavailable, CA tries to spin up nodes from two previously defined nodepools. To avoid extra node churn, the consolidation delay setting ensures nodes stay around for 20 minutes after they are no longer needed. Node auto-provisioning (NAP) is enabled in this example and creates new node pools automatically as needed. 
See the documentation for all configuration options.
Workloads can consume custom compute classes via a nodeSelector label, as follows:

code_block
<ListValue: [StructValue([(‘code’, ‘apiVersion: apps/v1\r\nkind: Deployment\r\nmetadata:\r\n name: workload\r\nspec:\r\n replicas: 1\r\n selector:\r\n matchLabels:\r\n app: workload\r\n template:\r\n metadata:\r\n labels:\r\n app: workload\r\n spec:\r\n nodeSelector:\r\n cloud.google.com/compute-class: my-class\r\n containers:\r\n – name: test\r\n image: gcr.io/google_containers/pause\r\n resources:\r\n requests:\r\n cpu: 1\r\n memory: “1Gi"’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e2fb3b63100>)])]>

To further streamline the developer experience, you can designate a custom compute class as the default for specific Kubernetes namespaces. When a compute class is set as the namespace default, any workloads deployed within that namespace automatically inherit the compute class configuration, eliminating the need for repetitive workload-level settings. To create a namespace-level default custom compute class, set a namespace label using the cloud.google.com/default-compute-class key, for example:

code_block
<ListValue: [StructValue([(‘code’, ‘kubectl label namespaces default cloud.google.com/default-compute-class=my-class’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e2fb3b63c70>)])]>

What customers are saying
Delivery Hero is an online food-ordering service based in Germany that operates in around 70 countries, and uses GKE to run their core service offering. Delivery Hero has been using the GKE custom compute class API extensively, and is pleased with the offering.

“The new custom compute class feature allows us to provide our internal platform customers with clusters that can accommodate their individual needs with ease, confidently scaling on demand. With the priority-based rules, we have a much more fine-grained control to serve our customers their preferred instance family for their workloads, without having to manage node pools ourselves thanks to its integration with node auto-provisioning. It also allows us to build cost-efficient systems, by utilizing spot instances automatically where it makes sense. We don’t need to worry about instance availability anymore, thanks to the fallback rules kicking in if some instance type is at low capacity. And if our preferences cannot be met at some point in time, the active migration feature has us covered – constantly working to reconcile the nodes in the cluster to match our needs as best as it can.” –  Heiko Rothe, (Staff Systems Engineer) & Kevin Nandu, Senior Systems Engineer, Delivery Hero
Watch custom compute classes in action
This video provides an overview of what custom classes are and how to use them in practice:

We’re just getting started
At Google Cloud, we strive to make it easy for you to configure and manage your GKE environment; the declarative compute classes node API is just the latest example, and complements other existing ways to configure your nodes. The features we’re launching today represent a first look at the custom compute class API; stay tuned for more features.

AI Summary and Description: Yes

Summary: The text discusses the introduction of a custom compute class API in Google Kubernetes Engine (GKE) that enhances resource management and autoscaling capabilities for e-commerce platforms during peak traffic. This functionality is particularly beneficial for AI workloads, offering improved cost efficiency and flexibility while simplifying the developer experience.

Detailed Description:
The text presents relevant advancements in Google Kubernetes Engine (GKE), specifically focusing on the custom compute class API. This development is significant for professionals in cloud computing and infrastructure security as it impacts how resources are managed and optimized in Kubernetes environments.

Key insights and implications include:

– **Autoscaling Enhancements**: The compute classes API allows for intelligent resource selection based on pre-defined parameters, addressing common scalability issues during high-demand periods.
– **Custom Node Management**: Organizations can prioritize various compute and accelerator options (such as CPUs, GPUs, and TPUs), ensuring that critical applications, including AI workloads, have the necessary resources readily available.
– **Fallback Mechanism**: If the highest priority resources are unavailable during scaling events, the system automatically attempts to select the next best option, thereby increasing reliability and minimizing service disruption.
– **Active Workload Migration**: Workloads initially placed on lower priority nodes can be actively migrated to more suitable nodes as availability changes, optimizing performance without the need for manual intervention.
– **Developer Experience Simplification**: By centralizing node attributes into a consistent compute class configuration, application teams can avoid complexity in managing different settings, making deployment processes smoother.
– **GitOps and IaC Integration**: The Kubernetes-native approach allows for the integration of compute classes into GitOps and Infrastructure-as-Code (IaC) workflows, enhancing automation and consistency in resource management.

Overall, these developments signify a move toward more automated, flexible, and developer-friendly cloud infrastructure solutions while maintaining robust operational performance during peak loads. The commentary from Delivery Hero highlights real-world application and efficiency gained through leveraging these new capabilities.