Simon Willison’s Weblog: lm.rs: run inference on Language Models locally on the CPU with Rust

Source URL: https://simonwillison.net/2024/Oct/11/lmrs/
Source: Simon Willison’s Weblog
Title: lm.rs: run inference on Language Models locally on the CPU with Rust

Feedly Summary: lm.rs: run inference on Language Models locally on the CPU with Rust
Impressive new LLM inference implementation in Rust by Samuel Vitorino. I tried it just now on an M2 Mac with 64GB of RAM and got very snappy performance for this Q8 Llama 3.2 1B, with Activity Monitor reporting 1000% CPU usage over 13 threads.
Here’s how I compiled the library and ran the model:
cd /tmp
git clone https://github.com/samuel-vitorino/lm.rs
cd lm.rs
RUSTFLAGS=”-C target-cpu=native" cargo build –release –bin chat
curl -LO ‘https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/tokenizer.bin?download=true’
curl -LO ‘https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/llama3.2-1b-it-q80.lmrs?download=true’
./target/release/chat –model llama3.2-1b-it-q80.lmrs

It looks like the performance is helped by two key dependencies: wide, which provides data types optimized for SIMD operations and rayon for running parallel iterators across multiple cores (used for matrix multiplication).
(I used LLM and files-to-prompt to help figure this out.)
Via Hacker News
Tags: llm, rust, ai-assisted-programming, generative-ai, ai, llms

AI Summary and Description: Yes

Summary: The text discusses a new implementation of Language Model (LLM) inference in Rust, showcasing its performance on an M2 Mac. It highlights the efficiency of the library, emphasizing the need for specific dependencies to optimize performance on CPU. This has implications for professionals working on AI and LLM applications, particularly in terms of local inference efficiency.

Detailed Description: The provided content focuses on the implementation of a language model inference library, `lm.rs`, optimized for execution on CPUs using Rust. The author, Samuel Vitorino, demonstrates impressive performance metrics when running a Q8 Llama 3.2 1B model. Here are the major points derived from the text:

– **Local Inference Performance**:
– The implementation allows local inference on CPUs, showcasing that developers can run complex LLMs on personal hardware rather than relying on cloud resources, enhancing privacy and reducing costs.
– Performance metrics indicate that the implementation managed a reported 1000% CPU usage over 13 threads on an M2 Mac with 64GB of RAM, suggesting effective resource utilization.

– **Installation and Setup**:
– Instructions provided for compiling and running the library are straightforward, which highlights accessibility for developers interested in experimenting with LLMs locally.
– The step-by-step guide includes commands to clone the repository, set flags for Rust compilation, and download necessary tokenizer and model files.

– **Key Dependencies**:
– The performance improvements are attributed to two critical dependencies:
– **Wide**: This library provides specialized data types optimized for Single Instruction, Multiple Data (SIMD) operations, enhancing computational efficiency particularly in workload-heavy tasks such as matrix multiplication.
– **Rayon**: A library that facilitates data parallelism across multiple cores, vital for executing iterator-based parallel tasks, which is beneficial in LLM operations.

– **Tags and Community Engagement**:
– The mention of platforms like Hacker News indicates community interest and engagement in advancements related to LLMs and AI programming.
– Related tags suggest that the implementation falls within broader topics in AI, generative AI, and machine learning, marking it as relevant for professionals exploring practical applications of these technologies.

Overall, this new implementation stands out as an innovative approach to running large language models locally, providing insights for AI professionals in optimizing performance, controlling costs, and enhancing security and privacy by keeping inference processes off the cloud.