The Register: Cerebras gives waferscale chips inferencing twist, claims 1,800 token per sec generation rates

Source URL: https://www.theregister.com/2024/08/27/cerebras_ai_inference/
Source: The Register
Title: Cerebras gives waferscale chips inferencing twist, claims 1,800 token per sec generation rates

Feedly Summary: Faster than you can read? More like blink and you’ll miss the hallucination
Hot Chips Inference performance in many modern generative AI workloads is usually a function of memory bandwidth rather than compute. The faster you can shuttle bits in and out of a high-bandwidth memory (HBM) the faster the model can generate a response.…

AI Summary and Description: Yes

Summary: The text discusses the performance capabilities of Cerebras Systems’ new inference offering based on its WSE-3 accelerator, which utilizes SRAM instead of high-bandwidth memory (HBM) for rapid processing of generative AI models. CEO Andrew Feldman claims that this innovation allows for significantly faster token generation rates for large language models (LLMs), potentially revolutionizing AI usage similar to the way broadband transformed internet access.

Detailed Description:
– **Key Performance Insights**:
– Cerebras’ WSE-3 accelerator can achieve processing speeds of up to 1,800 tokens per second for the Llama 3.1 model, a substantial improvement over H100 instances.
– When distributing a 70 billion parameter model across its CS-3 systems, Cerebras claims to reach 450 tokens per second, greatly surpassing Nvidia’s H100 capabilities.

– **Technological Shift**:
– The use of SRAM provides high performance with 21 PBps of bandwidth compared to HBM’s 4.8 TBps, indicating a shift in how AI models can be accessed and executed.
– Cerebras has modified its compiler to effectively run multiple layers on a chip simultaneously, optimizing the architecture for inference tasks.

– **Batch Processing and Latency**:
– The architecture is designed to support larger batch sizes, aimed at minimizing latency for user interactions and enabling iterative processes for models.
– By parallelizing models across multiple systems, Cerebras seeks to balance the need for capacity with performance, even when some latency from inter-system communication is inherent.

– **Comparison to Competitors**:
– The text notes similarities and differences between Cerebras’ approach and that of its competitor Groq, particularly in their reliance on SRAM.
– While both companies pursue high performance, Groq reportedly requires a larger number of systems (576 LPUs) to achieve lower token generation rates than what Cerebras accomplishes with fewer accelerators.

– **Market Implications**:
– The innovation is expected to make generative AI more accessible, potentially reducing the delays currently seen in the industry and allowing for more complex agentic AI applications.
– Cerebras aims to undercut the competition financially by offering its services at 60 cents per million tokens, compared to H100’s $2.90, which could incentivize adoption among developers.

– **Future Directions**:
– The company plans to expand the supported models, including the recently developed 405 billion parameter variant of Llama 3, indicating ongoing innovation and responsiveness to industry needs.
– Cerebras will also offer on-premise systems for customers in regulated industries, signaling a commitment to security and tailored solutions.

This text is highly relevant for AI, hardware security, and cloud computing professionals, emphasizing the importance of evaluating performance enhancements and cost efficiencies as they integrate generative AI into their infrastructures.