Cloud Blog: We tested Intel’s AMX CPU accelerator for AI. Here’s what we learned

Source URL: https://cloud.google.com/blog/products/identity-security/we-tested-intels-amx-cpu-accelerator-for-ai-heres-what-we-learned/
Source: Cloud Blog
Title: We tested Intel’s AMX CPU accelerator for AI. Here’s what we learned

Feedly Summary: At Google Cloud, we believe that cloud computing will increasingly shift to private, encrypted services where users can be confident that their software and data are not being exposed to unauthorized actors. In support of this goal, our aim is to make confidential computing technology ubiquitous across our Cloud without compromising usability, performance, and scale.
The latest C3 Confidential VMs we’ve developed with Intel have a built-in, on by default accelerator: Intel AMX. Intel AMX improves the performance of deep-learning training and inference on the CPU, and is ideal for workloads including natural-language processing, recommendation systems, and image recognition.
These Confidential VMs leverage confidential computing technology called Intel Trust Domain Extensions (Intel TDX). Confidential VMs with Intel TDX can help safeguard against insider attacks and software vulnerabilities and ensure data and code confidentiality and integrity. Together, Intel’s AMX and TDX technologies can make it easier for customers to secure their AI inference, fine tuning, and small-to-medium sized training jobs.  
“We are excited to partner with Google, delivering Confidential Computing and AI acceleration on C3 instances with Intel TDX and Intel AMX and software acceleration available in the popular AI libraries, to democratize access to secured, accelerated computing,” Andres Rodriguez, Intel Fellow and datacenter AI architect.
Performance testing AI/ML workloads on Confidential VMs
We know that sometimes performance concerns arise when security extensions are enabled, so we conducted ML training and inference experiments to evaluate the performance of Confidential VMs with Intel TDX and Intel AMX. 
For our experiments, we compared three VMs that were the largest machine types by vCPU available:

N2 VM: N2 machine series VMs that do not have Intel AMX or Intel TDX enabled. 
C3 VM: C3 machine series VMs that only have Intel AMX enabled. 
C3+TDX VM: C3 machine series VMs have both Intel TDX and Intel AMX enabled. 

AI Training findings
We compared the training time (in minutes) on the aforementioned N2, C3, and C3+TDX VMs. Compared to the N2 VM, the C3 VM provided a 4.54x speedup on training time whereas the C3+TDX VM provided a 4.14x speedup on training time for one workload, with the same hyperparameters such as global batch size and on-par quality metrics.

Training time by virtual machine.

We found that workloads running in Confidential VMs with Intel TDX may experience some performance overhead. A greater batch size may speed up training and reduce the Intel TDX overhead. This is due to reduced memory encryption cost involved in I/O operations, given the same amount of ingested data. 
AI Inference findings
We saw similar improvements on inference workloads using Intel AMX and observed some performance overhead from Intel TDX. 
We ran a text generation task on two LLMs, Llama-2-7B and Llama-2-13B. Using the aforementioned N2, C3, and C3+TDX VMs, we evaluated task performance based on two metrics: 

Time Per Output Token (TPOT): Time to generate an output token for a single request. This metric corresponds to how a user would perceive the “speed" of the model. For example, a TPOT of 100 milliseconds per token would be approximately 450 words per minute (WPM), which is faster than an average person’s reading speed.
Throughput (token per sec): The number of output tokens per second an inference server can generate for batch requests.

We compared the N2 VM to C3 and C3+TDX VMs and found that Intel AMX on the C3 VM and C3+TDX VM provided a speedup of approximately a three-fold improvement in latency (TPOT) and approximately a seven-fold increase in throughput (with a batch size of six).

Time per output token by virtual machine.

Throughput by virtual machine.

Try Intel AMX on Confidential VMs today 
For a full report on our testing and results, read our whitepaper here. You can collect your own benchmarks for LLMs by following the Intel® Extension for PyTorch instructions. Learn more about Google Cloud’s Confidential Computing products here, and get started today with your own confidential VMs using Intel AMX and TDX.

Related Article

Confidential VMs on Intel CPUs: Your new intelligent defense
Through our partnership with Intel, Google Cloud is extending our Confidential VMs on new C3 machines to use 4th Gen Intel Xeon Scalable …

Read Article

AI Summary and Description: Yes

**Summary:** The text outlines Google Cloud’s efforts to enhance confidentiality in cloud computing through the integration of Intel’s confidential computing technology, particularly focusing on the performance improvements of AI workloads using these advanced Virtual Machines (VMs). The results suggest significant speed advantages in AI training and inference, while also addressing security concerns against insider threats.

**Detailed Description:** The text discusses a strategic collaboration between Google Cloud and Intel aimed at advancing confidential computing technologies within cloud infrastructure. Key highlights include:

– **Introduction of Confidential VMs:**
– Google Cloud is committed to promoting private, encrypted services to safeguard user data and software.
– The C3 Confidential VMs leverage Intel’s Trust Domain Extensions (TDX) for enhanced security against insider attacks and software vulnerabilities.

– **Performance Enhancements with Intel AMX:**
– The integration of Intel AMX accelerates deep learning tasks, particularly suitable for natural language processing, recommendation systems, and image recognition.
– Confidential VMs with AMX and TDX are designed to improve both performance and security in AI training and inference workloads.

– **Performance Testing Insights:**
– Testing involved three types of VMs: N2 (without AMX or TDX), C3 (AMX only), and C3+TDX (both AMX and TDX).
– Findings reported up to a 4.54x speedup in training times for C3 VMs and a 4.14x speedup for C3+TDX VMs in certain workloads compared to N2.
– Performance was evaluated through metrics like Time Per Output Token (TPOT) and throughput, showcasing a three-fold improvement in latency and a substantial increase in throughput when using Intel AMX.

– **Implications for Users:**
– While the enhancements in performance are significant, some overhead may occur when using TDX; increased batch sizes can mitigate this.
– Users are encouraged to test Intel AMX on Confidential VMs to leverage the security benefits along with performance gains.

– **Call to Action:**
– Availability of Intel AMX on Confidential VMs through Google Cloud for immediate use, with an invitation to access further reports and benchmarks.

**Key Insights:**
– The advancement of confidential computing presents a vital shift in how organizations can secure their AI-based workloads in the cloud.
– Performance metrics indicate the potential for enhanced efficiency without compromising on security, making it essential for organizations to consider these technologies for their AI and machine learning operations.

This communication is particularly relevant for professionals in cloud security, AI development, and infrastructure management who are looking to integrate security and performance in their operational models.