Running local LLMs with Ollama in 2025

The state of local LLMs has changed enough in the last year that the advice I would have given in 2024 is now wrong. Here’s how I’d set it up today.

The tool

Ollama has become the default for running models locally. It’s a single binary that handles model download, quantization, serving, and an OpenAI-compatible API, all behind commands that are one word long.

# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.3

# Chat
ollama run llama3.3

# Serve an OpenAI-compatible API on localhost:11434
ollama serve

That’s the entire getting-started story. No Python env, no CUDA wrangling, no converting GGUFs by hand.

Which models are actually worth pulling

This is the part that changes every few months. Mid-2025 state of play:

If you have 8 GB of VRAM (or a decent Mac):

llama3.2:3b — Meta’s small model. Surprisingly capable for its size. Fast.
qwen2.5:7b — strong general-purpose model, especially for non-English.
gemma2:9b — Google’s contribution, good at following instructions.

If you have 16–24 GB of VRAM (or an M-series Mac with 32+ GB unified memory):

llama3.3:70b (quantized) — the sweet spot for “feels like a real assistant.” Q4 quantization runs on 48 GB; Q2 runs on 32 GB with some quality loss.
qwen2.5:32b — very strong, fits comfortably on 24 GB.
qwen2.5-coder:32b — specifically tuned for code, currently the best local coding model I’ve used.
deepseek-r1:32b — the distilled reasoning model from the R1 release. Shows its work, useful for math and debugging.

If you have a ridiculous rig (2x 4090 or a Mac Studio with 128 GB):

deepseek-r1:70b — full-size reasoning model, genuinely competitive with frontier API models for a lot of tasks.
llama3.3:70b at Q8 quantization.

What I actually use local models for

Three things, mostly:

Privacy-sensitive tasks. Summarizing documents I wouldn’t paste into a hosted API. Processing logs that might contain credentials.
Bulk workloads where API cost adds up. Tagging thousands of items, generating embeddings, extracting structured data from big corpora. A local model is free to run all weekend.
Offline work. Planes, hotel wifi, conferences with bad reception.

What I don’t use them for: anything where I’d notice the quality gap from frontier. For coding that actually matters, I’m still using Claude or GPT. Local models are good enough for small stuff, not good enough for the hard stuff — yet. The gap is narrowing about twice as fast as I expected a year ago.

The piece nobody mentions

Quantization quality matters more than the model you pick. A Q4 70B will usually beat a Q8 32B. A Q2 70B will sometimes lose to a Q8 32B because too much has been thrown away. When you’re comparing models at the same size budget, the rule of thumb is: bigger model, more aggressive quantization beats smaller model, lighter quantization — until you hit Q3 or below, at which point the damage gets noticeable.

If you haven’t tried local models since 2023, try again. They’re different now.