Running LLMs Locally: Ollama, vLLM, and the VRAM Problem

This is Part 7 of the AI Agents series. Part 6 covered Groq — an API for open-source models hosted on someone else’s hardware. This post removes the API entirely: running models on your own machine, with full privacy and zero per-token cost.

1. Two reasons to run locally

Privacy. Every API call sends your prompt to a third-party server. For applications handling sensitive data — medical records, legal documents, internal company data — that’s a problem. Running locally means the data never leaves your machine.

Cost at scale. API pricing is pay-per-token. At low traffic it’s cheap; at high traffic it compounds. A model running on your own hardware costs the same per day whether you run 100 requests or 100,000.

Both are real advantages. The catch is that running models locally requires hardware, and the hardware requirements are higher than most people expect.

2. The two tools: Ollama and vLLM

	Ollama	vLLM
Setup	One command	GPU setup + configuration
Use case	Development, experimentation	Production serving
Throughput	Single request at a time	Batched concurrent requests
Hardware	CPU or GPU	GPU (NVIDIA recommended)
Production-ready	No	Yes

Ollama — designed for developer ergonomics. Install it, type one command, and any model from the Ollama library downloads and runs locally on localhost:11434. Perfect for testing models and building prototypes.

vLLM — a high-throughput inference engine from UC Berkeley, now with 80k+ GitHub stars and 2,600+ contributors. It uses PagedAttention for efficient GPU memory management and handles batched requests, quantization, streaming, and production load. This is what you use when you’re actually serving users.

Note on model versions: The model names used in the code below (e.g. qwen3:4b) are accurate as of May 2026. New model versions are released constantly. Check ollama.com/library for the current best-performing models in each size range.

The pattern: validate your model choice with Ollama, then move to vLLM for production.

3. The VRAM problem

This is the wall most people hit first.

Commercial models like GPT-4 or Claude are estimated to have 600B–1T+ parameters. Even “small” open-source models at 70B are extremely difficult to run on a standard laptop. The limiting factor is VRAM — the memory your GPU uses to hold model weights during inference.

Formula for weight memory:

$$\text{VRAM (GB)} = \frac{\text{Parameters (B)} \times \text{Bits per weight}}{8}$$

For a 7B model at different precisions:

Precision	Bits	VRAM for 7B model
float32	32	28 GB
float16 / bfloat16	16	14 GB
int8	8	7 GB
int4	4	3.5 GB

A typical laptop has 8–16 GB of unified RAM. At 32-bit, even a 7B model is too large. At 4-bit, it fits — but at a cost.

And that’s just the weights. You also need memory for the KV Cache — the store of key-value pairs for your input context. Depending on input length, budget an additional ~4 GB on top of the model weights.

Finally, your CPU RAM needs to be large enough to load the model before it transfers to the GPU: VRAM_required × 1.2 is a safe estimate.

4. Quantization: smaller model, lower accuracy

Quantization reduces the numerical precision of model weights to shrink memory usage. Think of it like compressing an image: the file gets smaller, but you lose some detail.

int8 (8-bit): 2× smaller than float16, small accuracy drop, usually acceptable
int4 (4-bit): 4× smaller than float16, noticeable accuracy drop on complex tasks

The right approach:

Start with the smallest model that fits at full precision (1.5B or 3B at float16)
Test quality on your actual task
If quality isn’t enough, move to a larger quantized model (7B int4)
Don’t assume quantized = broken — for simple, well-defined tasks, 4-bit is often fine

The accuracy tradeoff is task-dependent. A 4-bit model answering FAQ questions may be fine. A 4-bit model doing complex legal reasoning may not be.

5. Install Ollama and run your first model

# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

Or download the installer from ollama.com/download.

Once installed, running a model is one command. Ollama downloads the weights automatically:

ollama run qwen3:4b

This starts an interactive terminal session. Type a message, get a response. To stop, type /bye.

6. Using Ollama from Python

pip install ollama

Non-streaming:

import ollama

response = ollama.chat(
    model="qwen3:4b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what a transformer is in 3 sentences."}
    ]
)

print(response["message"]["content"])

Streaming:

import ollama

stream = ollama.chat(
    model="qwen3:4b",
    messages=[{"role": "user", "content": "Explain what a transformer is in 3 sentences."}],
    stream=True
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

Streaming prints each token as it generates instead of waiting for the full response. On a local CPU or low-end GPU, this is the difference between the app feeling responsive and feeling frozen. Always use streaming for user-facing output.

7. Model size vs. quality: what actually happens

Small models are not mini versions of GPT-4. They are genuinely less capable, and the gap is larger than most people expect.

A 1.5B parameter model has learned far less from training data than a 70B model. It will hallucinate more, follow instructions less reliably, and fail on tasks that require broader knowledge or multi-step reasoning.

How this plays out in practice:

1.5B models — fast, but answer quality is unreliable on anything open-ended. Use only for narrow, constrained tasks where you can validate the output.
7B models — noticeably better reasoning and knowledge. Still a large step below frontier models, but useful for many real tasks.
Reasoning models (e.g. DeepSeek-R1) — generate a chain-of-thought before answering, which improves accuracy but significantly increases response time. On a CPU, a 7B reasoning model can take minutes per response.

The practical path: pick the smallest model that gives acceptable quality for your specific task. Don’t start with the biggest one you can fit — start small and scale up only when quality requires it.

8. Token generation speed and hardware

Token generation speed is a usability constraint. If a model produces 5 tokens per second, a 200-word response takes 80 seconds. Most users will not wait that long.

What determines speed:

Factor	Impact
GPU vs CPU	GPU is dramatically faster — often 10–20×
Model size	Smaller model = faster generation
Quantization	Lower precision = faster
VRAM	If model doesn’t fit, it spills to slower CPU RAM

Ollama automatically uses your GPU if one is available. On a MacBook with Apple Silicon (M-series), the unified memory acts as GPU memory — a 16 GB M2 can reasonably run a 7B model at 4-bit.

On a CPU-only machine, keep models at 3B or smaller. Even then, expect 5–15 tokens per second.

9. When to move to vLLM

Ollama handles one request at a time. That’s fine for development. It’s not fine for serving multiple users simultaneously.

vLLM solves this with batched inference — it processes multiple requests in parallel on the GPU, dramatically increasing throughput. It also exposes an OpenAI-compatible API endpoint, so you can point existing code at your local vLLM instance with minimal changes.

The migration path:

Build and test your application with Ollama locally
Once the model and prompt work correctly, move to vLLM for serving
Point your API calls at http://localhost:8000/v1 instead of Groq or OpenAI

vLLM requires a proper NVIDIA GPU setup. It’s not for laptops — it’s for servers or cloud instances where you’re running a real workload.

What’s next

The series continues with RAG — Retrieval-Augmented Generation. Instead of relying on what the model learned during training, RAG gives it access to your own documents at inference time. That’s how you build a chatbot that knows your company’s internal data without fine-tuning anything.

Full video walkthrough is embedded above.