Simon Willison’s Weblog: Everything I’ve learned so far about running local LLMs

Source URL: https://simonwillison.net/2024/Nov/10/running-llms/#atom-everything
Source: Simon Willison’s Weblog
Title: Everything I’ve learned so far about running local LLMs

Feedly Summary: Everything I’ve learned so far about running local LLMs
Chris Wellons shares detailed notes on his experience running local LLMs on Windows – though most of these tips apply to other operating systems as well.
This is great, there’s a ton of detail here and the root recommendations are very solid: Use llama-server from llama.cpp and try ~8B models first (Chris likes Llama 3.1 8B Instruct at Q4_K_M as a first model), anything over 10B probably won’t run well on a CPU so you’ll need to consider your available GPU VRAM.
This is neat:

Just for fun, I ported llama.cpp to Windows XP and ran a 360M model on a 2008-era laptop. It was magical to load that old laptop with technology that, at the time it was new, would have been worth billions of dollars.

I need to spend more time with Chris’s favourite models, Mistral-Nemo-2407 (12B) and Qwen2.5-14B/72B.
Chris also built illume, a Go CLI tool for interacting with models that looks similar to my own LLM project.
Via lobste.rs
Tags: windows, generative-ai, go, ai, edge-llms, llms

AI Summary and Description: Yes

Summary: The text discusses insights on running local large language models (LLMs) on Windows, including practical recommendations for selecting models and considerations regarding hardware limitations. This information is particularly pertinent for professionals involved in generative AI and LLM security.

Detailed Description: The text provides an overview of Chris Wellons’ experiences and recommendations for running local LLMs, especially on Windows systems, although many tips are also applicable to other operating systems. Key points include:

– **Model Recommendations**:
– Users are encouraged to start with smaller models, particularly the Llama 3.1 8B Instruct, which is recommended for its efficiency on local systems.
– Attention is drawn to the performance limitations of large models (over 10B) when run on CPUs, highlighting the need for adequate hardware resources, such as sufficient GPU VRAM.

– **Creative Experimentation**:
– Chris shares an anecdote about successfully running a 360M model on a 2008 laptop by porting llama.cpp to Windows XP, illustrating the potential for innovation with older technology.

– **Additional Models**:
– Chris mentions interest in exploring other models like Mistral-Nemo-2407 (12B) and Qwen2.5 (14B/72B), suggesting a landscape of available models for diverse applications.

– **Tool Development**:
– He developed ‘illume,’ a Go CLI tool for model interactions, which reflects ongoing development and collaboration within the LLM community.

This text highlights practical considerations in generative AI, particularly relevant for AI security, LLM security, and general information security as it pertains to the infrastructure used for handling such systems. The insights may guide professionals in optimizing their environments for effective and secure AI model usage.

Overall, the integration of LLMs into various environments and the appropriate utilization of hardware emphasizes the need for security considerations during implementation and operation.