The Complete Guide to Running Local LLMs with Ollama in 2026

Everything you need to set up Ollama, pick the right model, and start running AI locally on your own hardware — no cloud, no API costs, full privacy.

4 min read
836 words

Running AI locally used to mean weeks of setup, a server-class GPU, and deep knowledge of CUDA drivers. In 2026, you can have a capable LLM answering questions on your laptop in about five minutes. Ollama changed everything.

This guide covers everything from installation to picking the right model for your use case — written from hands-on experience running local LLMs daily.

Why local LLMs instead of the cloud?

Three things push people toward local inference: privacy, cost, and control.

With a cloud API, every prompt you send is processed on someone else’s infrastructure. For most content this is fine, but for internal business data, client information, or anything sensitive, running locally eliminates the question entirely.

Cost matters once you scale. At a few cents per thousand tokens, an API looks cheap until your automation pipeline sends 200,000 tokens a day. A local model costs nothing per query after setup.

Control means you decide which model version to run, what system prompt to use, and what temperature to set — and none of that changes without your input.

Installing Ollama

Ollama is available on macOS, Windows, and Linux. The install is intentionally simple.

Linux (one command):

curl -fsSL https://ollama.com/install.sh | sh

macOS / Windows: Download the installer from ollama.com and run it. Ollama installs as a background service that starts automatically.

After installation, verify it’s running:

ollama --version

Ollama also starts a local API server at http://localhost:11434. You’ll use this when connecting from N8N, Python scripts, or any other tool.

Picking a model

The model you choose depends on your hardware and use case. Here’s how I think about it:

Use caseRecommended modelVRAM / RAM needed
Quick tasks, summaries, classificationphi4-mini (3.8B)4 GB
Long-form writing, complex reasoningqwen3:7b8 GB
Code generationqwen2.5-coder:7b8 GB
High-quality writing (slower)qwen3:14b16 GB

Pull a model with:

ollama pull qwen3:7b

The first pull downloads the weights — anywhere from 2 GB to 10 GB depending on the model. After that, loading is instant.

Your first run

Once a model is pulled, start an interactive session:

ollama run qwen3:7b

Type your prompt and hit enter. You’ll see the model stream a response directly in your terminal.

To pass a one-shot prompt without opening an interactive session:

ollama run phi4-mini "Summarize this in one sentence: The quick brown fox jumps over the lazy dog."

Using the API

The real power of Ollama is the local HTTP API. Any tool — N8N, a Python script, Zapier — can call it:

curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:7b",
    "prompt": "Write a 3-sentence intro for a blog about AI automation.",
    "stream": false
  }'

The response comes back as JSON with a response field containing the generated text.

For streaming responses (better UX in chat interfaces), set "stream": true and handle the newline-delimited JSON chunks.

Connecting to a custom system prompt

System prompts let you give the model a persistent persona or set of instructions. Create a Modelfile:

FROM qwen3:7b

SYSTEM """
You are a concise technical writer. You write in short paragraphs.
You never use filler phrases like "certainly" or "of course".
When asked for code, you provide working examples with no extra commentary.
"""

Then build and run it:

ollama create mywriter -f Modelfile
ollama run mywriter

This is how you create a consistent voice for AI-generated content.

Running models on CPU vs GPU

Ollama auto-detects your hardware. If you have a compatible GPU (NVIDIA, AMD, Apple Silicon), it offloads computation there. If not, it falls back to CPU.

CPU inference is slower but it works. A 7B model on a modern CPU generates roughly 5–15 tokens per second. A GPU bumps that to 30–80+ tokens per second. For automation workflows that run in the background, CPU speed is often acceptable.

Check which layers Ollama is offloading with:

ollama ps

Performance tips

A few things that meaningfully affect speed:

  • Keep the model loaded. Ollama caches the model in memory for a few minutes after the last request. If your workflow sends prompts regularly, the model stays warm and responses start faster.
  • Use quantized models. The :q4 variants use 4-bit quantization — smaller, faster, and nearly indistinguishable in quality for most writing tasks.
  • Context window. Shorter prompts with less conversation history generate faster. Trim your context aggressively in automated pipelines.

What to build with it

Once Ollama is running, the logical next step is automation. Connect it to N8N to build content pipelines that run on a schedule. Use the Python ollama library to embed it in scripts. Point Open WebUI at it for a ChatGPT-style interface.

The local API means the same workflow you’d build with an OpenAI key works with Ollama — just swap the base URL.


Written by

Admin Editor & Builder

Human editor behind Pipeline Monk. Building AI-powered workflows, reviewing pipeline output, and writing guides from hands-on experience.