How to Run AI Models Locally with Ollama (2026 Setup Guide)
The definitive guide to running LLMs on your own hardware. Install Ollama, pull models like Llama 4, DeepSeek R1, and Mistral, set up a ChatGPT-like interface, and integrate with your dev tools — all free, private, and offline.
Why Run AI Locally?
Four reasons developers are moving away from cloud AI APIs and running models on their own hardware.
Privacy
Your data never leaves your machine. No conversations logged to third-party servers. No training on your prompts. Complete data sovereignty.
Zero Cost
No per-token API fees. No monthly subscriptions. Download a model once and run it unlimited times. Your hardware, your rules.
Speed
Zero network latency. Responses start instantly. No waiting for server round-trips. Especially noticeable for short, frequent prompts.
Offline
Works without internet after the initial model download. Perfect for travel, restricted networks, or air-gapped environments.
The 2026 Reality: Open-source models have caught up dramatically. DeepSeek R1 beats GPT-4o on math reasoning. Qwen 2.5 Coder matches it on code generation. Llama 4 Scout delivers near-70B quality with only 12GB of VRAM. Running locally is no longer a compromise — it's a competitive advantage.
What is Ollama?
Ollama is a free, open-source tool for running large language models locally on your own hardware. It handles the complexity of downloading, configuring, and serving AI models — so you can go from zero to chatting with an LLM in under 5 minutes.
Think of it as Docker for AI models. You ollama pull a model, ollama run it, and start talking. Ollama automatically detects your hardware — NVIDIA GPUs (CUDA), Apple Silicon (Metal), AMD GPUs (ROCm on Linux) — and optimizes performance accordingly.
100+
Available Models
3
OS Platforms
OpenAI
Compatible API
Free
Open Source
Platform Support: macOS 11+ (Intel & Apple Silicon), Linux (Ubuntu 18.04+), Windows (via WSL2 or native installer). Download from ollama.com.
Hardware Requirements
What you can run depends on your RAM and GPU. Here's a breakdown by hardware tier.
System RAM Requirements
8GB RAM — Small Models (3B parameters)
Modern CPU required. Enough for Llama 3.2 1B/3B, Phi 3 3.8B. Expect ~5-10 tokens/sec on CPU only.
16GB RAM — 7B Models
Recommended for Mistral 7B, Llama 3.2 8B. Good balance of quality and speed.
32GB RAM — 13B+ Models
Run Qwen 2.5 Coder 32B, Gemma 2 27B, and larger models comfortably.
GPU VRAM Guide
A GPU dramatically speeds up inference. Here's what you can run at each VRAM tier:
| GPU VRAM | Models You Can Run | Example GPUs |
|---|---|---|
| 4GB | Small models (3B params): Llama 3.2 1B/3B, Phi 3 3.8B | GTX 1650, RX 6500 XT |
| 8GB | 7B models: Mistral 7B, Llama 3.2 8B, Gemma 2 9B | RTX 3060/4060, M1/M2 (shared) |
| 12GB | Llama 4 Scout (109B MoE, 17B active) — near-70B quality | RTX 3060 12GB, RTX 4070 |
| 24GB | 13-32B models, DeepSeek R1 70B distilled (Q4 quantization) | RTX 3090/4090, M2 Pro/Max |
| 40GB+ | 70B+ models: Full Llama 3.2 70B, DeepSeek R1 full | A100, H100, M2 Ultra |
CPU-Only Performance
No GPU? You can still run models on CPU, but performance varies significantly:
Intel Core i7
~7.5 tok/s
With 7B model (Q4)
AMD Ryzen 5
~12.3 tok/s
With 7B model (Q4)
A GPU is highly recommended for a decent interactive experience. CPU-only works for batch processing or when you can tolerate slower responses.
Storage Tip: Models can be large (1-40GB+ each). A 512GB SSD is recommended if you plan to keep multiple models downloaded. Models are stored in ~/.ollama/models by default.
Part 1: Installing Ollama
Four installation methods. Pick the one that matches your system and workflow.
macOS (Intel & Apple Silicon)
Two options — the official installer (easiest) or Homebrew (for package manager fans):
# Download from ollama.com and run the .dmg installer
# Supports macOS 11 (Big Sur) and later
# Auto-detects Apple Silicon (M1/M2/M3/M4) for Metal acceleration
# After install, verify:
ollama --version# Install via Homebrew
brew install ollama
# Verify installation
ollama --versionLinux (Ubuntu 18.04+, Debian, Fedora, Arch)
One-liner install script that detects your distro and GPU automatically:
# Official install script (auto-detects GPU)
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# The script automatically:
# - Detects NVIDIA GPUs and installs CUDA support
# - Detects AMD GPUs and configures ROCm
# - Creates a systemd service (ollama.service)
# - Starts the server on port 11434Windows
Native Windows support or install via the Windows Package Manager:
# Download the Windows installer from ollama.com
# Runs natively — no WSL2 required (since late 2024)
# Supports NVIDIA CUDA GPUs out of the box# Install via Windows Package Manager
winget install Ollama.Ollama
# Verify installation (open new terminal)
ollama --versionDocker (Any Platform)
Ideal for servers, CI/CD pipelines, or keeping your system clean:
# Run Ollama in Docker with NVIDIA GPU support
docker run -d \
--gpus all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
# Without GPU (CPU only)
docker run -d \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
# Pull and run a model inside the container
docker exec -it ollama ollama pull llama3.2
docker exec -it ollama ollama run llama3.2After Installation: Verify Everything Works
Check Ollama is installed
ollama --version
# Should output something like: ollama version 0.6.xStart the server (if not auto-started)
ollama serve
# Server starts on http://localhost:11434
# On macOS/Windows, the app starts the server automatically
# On Linux, systemd manages the serviceTest the API is responding
curl http://localhost:11434
# Should return: "Ollama is running"Part 2: Your First Model
Let's download and run your first AI model. We'll start with Llama 3.2 — Meta's versatile open model.
Download (Pull) a Model
# Pull Llama 3.2 (8B parameters, 4.7GB download)
ollama pull llama3.2
# The download progress will show:
# pulling manifest...
# pulling abc123... 100% ████████████████ 4.7 GB
# verifying sha256 digest...
# writing manifest...
# successFirst pull takes a few minutes depending on your internet speed. The model is cached locally — subsequent runs are instant.
Run the Model Interactively
# Start an interactive chat session
ollama run llama3.2
# You'll see a prompt:
# >>> Send a message (/? for help)
# Try these prompts:
>>> Explain how HTTP caching works in 3 sentences
>>> Write a Python function to merge two sorted lists
>>> What are the pros and cons of microservices?
# Type /bye to exit the sessionEssential Commands
# Download a model
ollama pull llama3.2
# Run a model (auto-pulls if not downloaded)
ollama run llama3.2
# List all downloaded models
ollama list
# Show detailed model info (size, parameters, template)
ollama show llama3.2
# Remove a model (free disk space)
ollama rm llama3.2
# Start the Ollama server
ollama serve
# Show currently running/loaded models
ollama ps
# Copy a model (create an alias)
ollama cp llama3.2 my-custom-llamaQuick-Run Examples
# One-shot prompt (no interactive session)
ollama run llama3.2 "Summarize the benefits of TypeScript in 3 bullet points"
# Pipe input from a file
cat README.md | ollama run llama3.2 "Summarize this document"
# Use a specific model for coding
ollama run qwen2.5-coder:32b "Write a React hook for debouncing"
# Use DeepSeek for reasoning tasks
ollama run deepseek-r1 "Prove that the square root of 2 is irrational"Best Models to Run Locally (2026)
Not all models are created equal. Here are the best options for different use cases, with exact download sizes and VRAM requirements.
Llama 3.2
Llama Community License
Best for: General-purpose tasks, conversation, writing, analysis
Llama 4 Scout
Llama Community License
Best for: Near-70B quality on consumer GPUs — best bang for buck
Llama 4 Maverick
Llama Community License
Best for: 10M token context window, complex long-document analysis
DeepSeek R1
MIT (fully commercial)
Best for: Reasoning, math, science, logic — best chain-of-thought
Qwen 2.5 Coder 32B
Apache 2.0
Best for: Code generation, completion, review — best coding model
Mistral 7B
Apache 2.0 (with conditions)
Best for: Great general-purpose model for consumer hardware
Gemma 2
Gemma License
Best for: Google's efficient open model, good for resource-constrained setups
Phi 3
MIT
Best for: Microsoft's small but surprisingly capable model
Pull Any of These Models
# Meta Llama
ollama pull llama3.2 # 8B (default)
ollama pull llama3.2:3b # 3B variant
ollama pull llama3.2:70b # 70B variant (needs 40GB+ VRAM)
# Llama 4
ollama pull llama4-scout # 109B MoE, 17B active
ollama pull llama4-maverick # 400B MoE, 17B active
# DeepSeek R1
ollama pull deepseek-r1 # Default size
ollama pull deepseek-r1:7b # 7B distilled
ollama pull deepseek-r1:70b # 70B distilled
# Qwen 2.5 Coder
ollama pull qwen2.5-coder:32b # Best coding model
ollama pull qwen2.5-coder:7b # Smaller coding model
# Others
ollama pull mistral # Mistral 7B
ollama pull gemma2:9b # Google Gemma 2
ollama pull phi3 # Microsoft Phi 3Part 3: Open WebUI — ChatGPT-like Interface
The terminal is powerful, but sometimes you want a polished chat interface. Open WebUI gives you a ChatGPT-like experience for your local models — with chat history, model switching, and more.
Quick Setup with Docker
# Pull and run Open WebUI (connects to local Ollama)
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
# Access the UI at:
# http://localhost:3000# For NVIDIA GPU acceleration in Open WebUI
docker run -d \
-p 3000:8080 \
--gpus all \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:cuda
# For AMD GPU support, use the :rocm tag insteadChat Features
- Full chat history with search
- Model selector dropdown (switch models mid-chat)
- File uploads and document analysis
- Markdown rendering with syntax highlighting
Admin Features
- Multi-user support with role-based access
- Dark mode / light mode toggle
- System prompt customization per model
- Import/export conversations
First Login: The first account you create becomes the admin. Open WebUI stores data locally in the Docker volume — nothing is sent to external servers. Access at http://localhost:3000.
Part 4: Developer Tool Integration
Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. This means any tool that supports the OpenAI API can connect to your local models.
🔧 Continue.dev (VS Code / JetBrains)
Open-source AI code assistant plugin for VS Code and JetBrains IDEs. Works natively with Ollama.
{
"models": [
{
"title": "Ollama - Qwen Coder",
"provider": "ollama",
"model": "qwen2.5-coder:32b",
"apiBase": "http://localhost:11434"
},
{
"title": "Ollama - DeepSeek R1",
"provider": "ollama",
"model": "deepseek-r1",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Ollama Autocomplete",
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"apiBase": "http://localhost:11434"
}
}⚡ Cursor IDE
Cursor supports custom OpenAI-compatible endpoints. Point it to your local Ollama server:
# In Cursor Settings, add a custom model:
# API Base URL: http://localhost:11434/v1
# API Key: ollama (any string works, it's not validated)
# Model name: llama3.2 (or any model you have pulled)
# The OpenAI-compatible endpoint:
# http://localhost:11434/v1/chat/completionsNote: Local models may be slower than cloud APIs for large codebases. Best for smaller, focused tasks.
🤖 Claude Code & Other Tools
Any tool that speaks the OpenAI API protocol can connect to Ollama:
# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'
# This works with any OpenAI-compatible client:
# - Python: openai library with base_url="http://localhost:11434/v1"
# - Node.js: openai package with baseURL config
# - Any HTTP client🐍 Python Integration
# pip install openai
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # any string works
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to find prime numbers"}
]
)
print(response.choices[0].message.content)Performance Tips
1. Use Quantized Models
Quantization reduces model precision (e.g., from 16-bit to 4-bit) to dramatically reduce VRAM usage with minimal quality loss. Most Ollama models are already quantized to Q4_K_M by default.
# Check a model's quantization level
ollama show llama3.2 --modelfile
# Most models default to Q4_K_M (4-bit quantization)
# This gives ~95% of full quality at ~25% of the VRAM
# Quantization levels (more bits = better quality, more VRAM):
# Q4_0 — fastest, lowest quality
# Q4_K_M — best balance (default for most models)
# Q5_K_M — slightly better quality
# Q8_0 — near-original quality, 2x VRAM vs Q4
# F16 — full precision, maximum VRAM2. GPU Layer Offloading
If your model doesn't fully fit in VRAM, Ollama automatically splits it between GPU and CPU. You can control how many layers run on the GPU:
# Set number of GPU layers (higher = more on GPU = faster)
OLLAMA_NUM_GPU=35 ollama run llama3.2
# Set to 0 for CPU-only mode
OLLAMA_NUM_GPU=0 ollama run llama3.2
# Let Ollama auto-detect (default behavior)
ollama run llama3.23. Optimize Context Window Size
Larger context windows use more VRAM. Reduce it if you don't need long conversations:
# Set context window size (default varies by model)
ollama run llama3.2 --num-ctx 2048 # Smaller = faster, less VRAM
ollama run llama3.2 --num-ctx 8192 # Larger = handles longer input
ollama run llama3.2 --num-ctx 32768 # Maximum for most models4. Pick the Right Model for the Task
For coding:
Qwen 2.5 Coder 32B > Qwen 2.5 Coder 7B > DeepSeek R1 7B
For math/reasoning:
DeepSeek R1 > Llama 4 Scout > Qwen 2.5 32B
For general chat:
Llama 3.2 8B > Mistral 7B > Gemma 2 9B
For low VRAM (4-8GB):
Phi 3 3.8B > Llama 3.2 3B > Gemma 2 2B
Ollama API Reference
Ollama runs a REST API on http://localhost:11434. It also provides an OpenAI-compatible endpoint at /v1.
Native Ollama API
# Generate a completion (streaming)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?"
}'
# Chat completion (multi-turn)
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"},
{"role": "user", "content": "And 2+3?"}
]
}'
# List local models
curl http://localhost:11434/api/tags
# Show model info
curl http://localhost:11434/api/show -d '{"name": "llama3.2"}'
# Pull a model programmatically
curl http://localhost:11434/api/pull -d '{"name": "mistral"}'
# Generate embeddings
curl http://localhost:11434/api/embeddings -d '{
"model": "llama3.2",
"prompt": "Ollama is awesome"
}'OpenAI-Compatible API
# Chat completions (drop-in replacement for OpenAI)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke"}
],
"temperature": 0.7,
"max_tokens": 200
}'
# List available models
curl http://localhost:11434/v1/models
# This means you can use:
# - Python openai library
# - Node.js openai package
# - Any OpenAI-compatible SDK or tool
# Just change the base URL to http://localhost:11434/v1Model Comparison Table
Benchmark Highlights (2026)
| Benchmark | Best Open Model | Score | GPT-4o |
|---|---|---|---|
| AIME Math Reasoning | DeepSeek R1 | 79.8% | 9.3% |
| MMLU Knowledge | Llama 4 Maverick | 88.2% | 88.7% |
| HumanEval Code | Qwen 2.5 Coder | 92% | 90.2% |
Open models are beating GPT-4o in specialized tasks. The gap for general knowledge (MMLU) is nearly closed.
Complete Model Comparison
| Model | Parameters | Download Size | Min VRAM | License | Best For |
|---|---|---|---|---|---|
| Llama 3.2 3B | 3B | 2.0GB | 4GB | Community | Light tasks, edge |
| Llama 3.2 8B | 8B | 4.7GB | 8GB | Community | General purpose |
| Llama 3.2 70B | 70B | 40GB | 40GB+ | Community | High-quality gen |
| Llama 4 Scout | 109B MoE | ~12GB | 12GB | Community | Best value |
| Llama 4 Maverick | 400B MoE | ~24GB | 24GB+ | Community | Long context |
| DeepSeek R1 7B | 7B | ~4.7GB | 8GB | MIT | Reasoning |
| DeepSeek R1 70B | 70B | ~40GB | 24GB (Q4) | MIT | Math, science |
| Qwen 2.5 Coder 32B | 32B | ~20GB | 24GB | Apache 2.0 | Code gen |
| Mistral 7B | 7B | 4.1GB | 8GB | Apache 2.0* | General purpose |
| Gemma 2 9B | 9B | ~5.4GB | 8GB | Gemma License | Efficient gen |
| Phi 3 3.8B | 3.8B | ~2.3GB | 4GB | MIT | Small & capable |
* Some Mistral models have commercial restrictions at scale. Check the specific model license before commercial use.
Licensing Quick Guide
Most Permissive (Commercial OK)
- • DeepSeek R1: MIT — fully commercial, no restrictions
- • Qwen 3: Apache 2.0 — commercial use allowed
- • Phi 3: MIT — fully permissive
Commercial with Conditions
- • Llama 4: Community License — commercial with conditions
- • Mistral: Some restrictions at commercial scale
- • Gemma 2: Gemma License — check specific terms
Troubleshooting
✕ "Error: model not found"
The model hasn't been downloaded yet. Pull it first:
# Pull the model before running it
ollama pull llama3.2
# Or use ollama run (auto-pulls if missing)
ollama run llama3.2
# Check what models you have downloaded:
ollama list✕ "Error: could not connect to ollama server"
The Ollama server isn't running. Start it:
# Start the server manually
ollama serve
# On macOS: open the Ollama app (it starts the server)
# On Linux: check the systemd service
sudo systemctl status ollama
sudo systemctl start ollama
# Verify the server is running
curl http://localhost:11434
# Should return: "Ollama is running"✕ Out of memory / model too large
Your system doesn't have enough RAM or VRAM for the model. Try a smaller model or reduce context:
# Switch to a smaller model
ollama run llama3.2:3b # Instead of the 8B default
# Reduce context window to use less memory
ollama run llama3.2 --num-ctx 2048
# Force CPU-only mode (uses system RAM instead of VRAM)
OLLAMA_NUM_GPU=0 ollama run llama3.2
# Check VRAM usage
nvidia-smi # NVIDIA GPUs
# For Apple Silicon, check Activity Monitor → Memory✕ GPU not detected / running on CPU
Ollama isn't using your GPU. Check drivers and CUDA:
# Check if NVIDIA driver is installed
nvidia-smi
# Check CUDA version
nvcc --version
# For NVIDIA, you need:
# - NVIDIA driver 470+ (for CUDA 11.x support)
# - The ollama install script should handle CUDA setup
# Restart Ollama after driver updates
sudo systemctl restart ollama # Linux
# Or quit and reopen the Ollama app (macOS/Windows)
# For AMD on Linux, ensure ROCm is installed:
# https://rocm.docs.amd.com/✕ Port 11434 already in use
Another process (or another Ollama instance) is using the default port:
# Find what's using the port
lsof -i :11434 # macOS/Linux
netstat -ano | findstr 11434 # Windows
# Kill the existing process, or use a different port:
OLLAMA_HOST=0.0.0.0:11435 ollama serve
# Update your clients to use the new port:
# http://localhost:11435Frequently Asked Questions
Is Ollama free? What's the catch?
Ollama is completely free and open-source. There's no catch — no subscriptions, no usage limits, no data collection. The models are also free to download (though some have license restrictions for commercial use). You only "pay" with your hardware resources (RAM, GPU, disk space, electricity).
How do local models compare to ChatGPT / GPT-4o?
For specialized tasks, local models can match or beat GPT-4o. DeepSeek R1 scores 79.8% on AIME math (vs GPT-4o's 9.3%). Qwen 2.5 Coder gets 92% on HumanEval (vs GPT-4o's 90.2%). For broad general knowledge and instruction following, GPT-4o still has an edge with smaller models, but the gap closes rapidly with 70B+ models and Llama 4 Scout/Maverick.
Can I run Ollama on a Mac with 8GB RAM?
Yes! Apple Silicon Macs share memory between CPU and GPU, making them surprisingly good for local AI. An M1/M2 with 8GB can run 3B-7B models comfortably. For 8B models, expect usable but not blazing speed. 16GB is the sweet spot for Apple Silicon — it can handle most 7B-13B models with good performance.
Does Ollama send any data to the internet?
No. Once a model is downloaded, Ollama runs 100% locally. It makes no outbound network connections during inference. The only internet access is when you ollama pull a new model. You can even disconnect from the internet entirely after downloading your models.
Can I fine-tune models with Ollama?
Ollama supports creating custom models with Modelfiles (similar to Dockerfiles). You can customize system prompts, parameters, and create model variants. For actual fine-tuning (training on your data), you'd need tools like Unsloth or Axolotl, then import the resulting model into Ollama.
What's the best model for coding?
Qwen 2.5 Coder 32B is the current champion with 92% on HumanEval. If you don't have 24GB VRAM, the 7B variant is still excellent. DeepSeek R1 is also great for code that requires complex reasoning (algorithms, system design). For quick autocomplete, Qwen 2.5 Coder 7B offers the best speed-to-quality ratio.
Can I use Ollama for commercial products?
Yes, but check the model license. Ollama itself is MIT-licensed. Models like DeepSeek R1 (MIT) and Qwen (Apache 2.0) are fully commercial. Llama models have a community license with some conditions for large-scale commercial use. Mistral has varying licenses per model. Always check the specific model's license before deploying commercially.
Related Articles
Self-hosted AI assistant tutorial
Cursor vs Claude Code 2026AI coding assistants compared
Best Tech Stack for SaaS 2026Complete guide to building modern SaaS
Build an AI App with Next.jsStep-by-step AI app tutorial
Lovable vs Bolt vs v0 2026AI app builders comparison
Solo Founder Tech Stack 2026Build your MVP for under $50/month