The Register: The troublesome economics of CPU-only AI

Source URL: https://www.theregister.com/2024/10/29/cpu_gen_ai_gpu/
Source: The Register
Title: The troublesome economics of CPU-only AI

Feedly Summary: At the end of the day, it all boils down to tokens per dollar
Analysis Today, most GenAI models are trained and run on GPUs or some other specialized accelerator, but that doesn’t mean they have to be. In fact, several chipmakers have suggested that CPUs are more than adequate for many enterprise AI use cases.…

AI Summary and Description: Yes

**Summary:** Google’s exploration of CPU-based inference for generative AI, specifically using its advanced matrix extensions in Intel’s Xeon processors, reveals the potential for acceptable performance levels in AI workloads. The study compares output latencies and throughput of various AI models, indicating a significant speed up with newer architectures. This raises cost implications and poses considerations about the use of CPUs versus GPUs in AI deployments.

**Detailed Description:**
This text discusses Google’s analysis of leveraging CPU-based architectures for large language model (LLM) workloads in generative AI. Specifically, it highlights the performance of Intel’s 4th-Gen Xeon processors with advanced matrix extensions (AMX) in running models like Llama 2.

– **Performance Metrics:**
– Google achieved a time per output token (TPOT) of 55 milliseconds for its 7 billion parameter Llama 2 model using a C3 VM environment.
– Throughput performance at a batch size of six for the 7B version was approximately 220 to 230 tokens per second.
– Fine-tuning times for a significantly smaller RoBERTa model were completed in under 25 minutes using the AMX technology.

– **Cost Considerations:**
– CPU-based instances are assessed to be costlier per token compared to utilizing GPUs, though initial perceptions about CPU costs might suggest otherwise.
– For the C3 instance, the cost per million tokens was calculated to be about $9, while with the GPU (Nvidia L40S), this cost dropped significantly to around $1.87 per million tokens.

– **Scalability and Use Cases:**
– The text conveys challenges with CPU scalability in extensive AI workloads, contrasting it with the superior capabilities of dedicated GPUs.
– It addresses the advantage of using serverless options that can dynamically allocate resources, paying only for actual usage—further benefiting organizations managing fluctuating workloads.

– **Strategic Implications:**
– While GPUs remain optimal for high-end AI tasks, the analysis discusses scenarios where CPUs can be effectively utilized, especially in environments with pre-existing hardware commitments.
– Situations where allocating underused CPU resources for AI tasks is practical demonstrate a cost-effective alternative to investing in costly GPU systems.

– **Broader Market Context:**
– The demand and price spikes for GPUs fuel considerations about the necessity of revamping existing infrastructure with CPU alternatives, as they can be operationally flexible and multifunctional beyond AI tasks.

Overall, the analysis holds significance for professionals in AI and cloud infrastructures by illustrating the evolving landscape of AI operational capabilities, cost models, and deployment strategies when balancing performance and fiscal responsibility.