What is App Stack Builder?

App Stack Builder is a free AI-powered tool that helps founders and developers choose the right tech stack for their projects based on budget, app type, team size, and skill level.

What is the best tech stack for SaaS in 2026?

The best tech stack for SaaS in 2026 typically includes Next.js 16 for frontend, Supabase for database and auth, Vercel for hosting, Stripe for payments, and PostHog for analytics. App Stack Builder can generate a personalized recommendation based on your specific budget and requirements.

Is App Stack Builder free to use?

Yes, App Stack Builder is completely free to use. You can generate unlimited tech stack recommendations without any cost.

What types of apps can I build with the recommended stacks?

App Stack Builder supports various app types including SaaS applications, web apps, mobile apps, e-commerce stores, internal tools, and AI/ML products.

How much does a SaaS tech stack cost per month?

A modern SaaS tech stack can cost anywhere from $0-50/month for bootstrapped startups using free tiers, $100-300/month for growth-stage startups, to $500+/month for enterprise-ready infrastructure. App Stack Builder helps you find the best tools within your specific budget.

What is the best tech stack for solo founders?

For solo founders, we recommend Next.js + Vercel (free), Supabase (free tier), Supabase Auth (free), Stripe (pay as you go), and Plausible or PostHog for analytics. This stack can run your entire SaaS for under $50/month.

Can I export my tech stack recommendations?

Yes, you can export your tech stack recommendations as a PDF document or copy them to your clipboard. You can also share a link to your stack with others.

Back to Blog

AI ToolsSetup Guide

How to Run AI Models Locally with Ollama (2026 Setup Guide)

Name: App Stack Builder
Availability: InStock
Rating: 4.8 (150 reviews)

The definitive guide to running LLMs on your own hardware. Install Ollama, pull models like Llama 4, DeepSeek R1, and Mistral, set up a ChatGPT-like interface, and integrate with your dev tools — all free, private, and offline.

18 min read

Complete Tutorial

Published February 12, 2026

In This Guide

Why Run AI Locally? What is Ollama? Hardware Requirements Part 1: Installing Ollama Part 2: Your First Model Best Models to Run Locally Part 3: Open WebUI Setup Part 4: Developer Tool Integration Performance Tips Ollama API Reference Model Comparison Table Troubleshooting FAQ

Why Run AI Locally?

Four reasons developers are moving away from cloud AI APIs and running models on their own hardware.

Privacy

Your data never leaves your machine. No conversations logged to third-party servers. No training on your prompts. Complete data sovereignty.

Zero Cost

No per-token API fees. No monthly subscriptions. Download a model once and run it unlimited times. Your hardware, your rules.

Speed

Zero network latency. Responses start instantly. No waiting for server round-trips. Especially noticeable for short, frequent prompts.

Offline

Works without internet after the initial model download. Perfect for travel, restricted networks, or air-gapped environments.

The 2026 Reality: Open-source models have caught up dramatically. DeepSeek R1 beats GPT-4o on math reasoning. Qwen 2.5 Coder matches it on code generation. Llama 4 Scout delivers near-70B quality with only 12GB of VRAM. Running locally is no longer a compromise — it's a competitive advantage.

What is Ollama?

Ollama is a free, open-source tool for running large language models locally on your own hardware. It handles the complexity of downloading, configuring, and serving AI models — so you can go from zero to chatting with an LLM in under 5 minutes.

Think of it as Docker for AI models. You ollama pull a model, ollama run it, and start talking. Ollama automatically detects your hardware — NVIDIA GPUs (CUDA), Apple Silicon (Metal), AMD GPUs (ROCm on Linux) — and optimizes performance accordingly.

100+

Available Models

OS Platforms

OpenAI

Compatible API

Free

Open Source

Platform Support: macOS 11+ (Intel & Apple Silicon), Linux (Ubuntu 18.04+), Windows (via WSL2 or native installer). Download from ollama.com.

Hardware Requirements

What you can run depends on your RAM and GPU. Here's a breakdown by hardware tier.

System RAM Requirements

Minimum

8GB RAM — Small Models (3B parameters)

Modern CPU required. Enough for Llama 3.2 1B/3B, Phi 3 3.8B. Expect ~5-10 tokens/sec on CPU only.

Good

16GB RAM — 7B Models

Recommended for Mistral 7B, Llama 3.2 8B. Good balance of quality and speed.

Ideal

32GB RAM — 13B+ Models

Run Qwen 2.5 Coder 32B, Gemma 2 27B, and larger models comfortably.

GPU VRAM Guide

A GPU dramatically speeds up inference. Here's what you can run at each VRAM tier:

GPU VRAM	Models You Can Run	Example GPUs
4GB	Small models (3B params): Llama 3.2 1B/3B, Phi 3 3.8B	GTX 1650, RX 6500 XT
8GB	7B models: Mistral 7B, Llama 3.2 8B, Gemma 2 9B	RTX 3060/4060, M1/M2 (shared)
12GB	Llama 4 Scout (109B MoE, 17B active) — near-70B quality	RTX 3060 12GB, RTX 4070
24GB	13-32B models, DeepSeek R1 70B distilled (Q4 quantization)	RTX 3090/4090, M2 Pro/Max
40GB+	70B+ models: Full Llama 3.2 70B, DeepSeek R1 full	A100, H100, M2 Ultra

CPU-Only Performance

No GPU? You can still run models on CPU, but performance varies significantly:

Intel Core i7

~7.5 tok/s

With 7B model (Q4)

AMD Ryzen 5

~12.3 tok/s

With 7B model (Q4)

A GPU is highly recommended for a decent interactive experience. CPU-only works for batch processing or when you can tolerate slower responses.

Storage Tip: Models can be large (1-40GB+ each). A 512GB SSD is recommended if you plan to keep multiple models downloaded. Models are stored in ~/.ollama/models by default.

Part 1: Installing Ollama

Four installation methods. Pick the one that matches your system and workflow.

Recommended

macOS (Intel & Apple Silicon)

Two options — the official installer (easiest) or Homebrew (for package manager fans):

Option A: Official Installer

# Download from ollama.com and run the .dmg installer
# Supports macOS 11 (Big Sur) and later
# Auto-detects Apple Silicon (M1/M2/M3/M4) for Metal acceleration

# After install, verify:
ollama --version

Option B: Homebrew

# Install via Homebrew
brew install ollama

# Verify installation
ollama --version

Linux (Ubuntu 18.04+, Debian, Fedora, Arch)

One-liner install script that detects your distro and GPU automatically:

Terminal

# Official install script (auto-detects GPU)
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

# The script automatically:
# - Detects NVIDIA GPUs and installs CUDA support
# - Detects AMD GPUs and configures ROCm
# - Creates a systemd service (ollama.service)
# - Starts the server on port 11434

Windows

Native Windows support or install via the Windows Package Manager:

Option A: Official Installer

# Download the Windows installer from ollama.com
# Runs natively — no WSL2 required (since late 2024)
# Supports NVIDIA CUDA GPUs out of the box

Option B: winget

# Install via Windows Package Manager
winget install Ollama.Ollama

# Verify installation (open new terminal)
ollama --version

Docker (Any Platform)

Ideal for servers, CI/CD pipelines, or keeping your system clean:

Terminal — Docker with GPU support

# Run Ollama in Docker with NVIDIA GPU support
docker run -d \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# Without GPU (CPU only)
docker run -d \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# Pull and run a model inside the container
docker exec -it ollama ollama pull llama3.2
docker exec -it ollama ollama run llama3.2

After Installation: Verify Everything Works

Check Ollama is installed

ollama --version
# Should output something like: ollama version 0.6.x

Start the server (if not auto-started)

ollama serve
# Server starts on http://localhost:11434
# On macOS/Windows, the app starts the server automatically
# On Linux, systemd manages the service

Test the API is responding

curl http://localhost:11434
# Should return: "Ollama is running"

Part 2: Your First Model

Let's download and run your first AI model. We'll start with Llama 3.2 — Meta's versatile open model.

Download (Pull) a Model

Terminal

# Pull Llama 3.2 (8B parameters, 4.7GB download)
ollama pull llama3.2

# The download progress will show:
# pulling manifest...
# pulling abc123... 100% ████████████████ 4.7 GB
# verifying sha256 digest...
# writing manifest...
# success

First pull takes a few minutes depending on your internet speed. The model is cached locally — subsequent runs are instant.

Run the Model Interactively

Terminal

# Start an interactive chat session
ollama run llama3.2

# You'll see a prompt:
# >>> Send a message (/? for help)

# Try these prompts:
>>> Explain how HTTP caching works in 3 sentences
>>> Write a Python function to merge two sorted lists
>>> What are the pros and cons of microservices?

# Type /bye to exit the session

Essential Commands

Ollama CLI Reference

# Download a model
ollama pull llama3.2

# Run a model (auto-pulls if not downloaded)
ollama run llama3.2

# List all downloaded models
ollama list

# Show detailed model info (size, parameters, template)
ollama show llama3.2

# Remove a model (free disk space)
ollama rm llama3.2

# Start the Ollama server
ollama serve

# Show currently running/loaded models
ollama ps

# Copy a model (create an alias)
ollama cp llama3.2 my-custom-llama

Quick-Run Examples

Terminal

# One-shot prompt (no interactive session)
ollama run llama3.2 "Summarize the benefits of TypeScript in 3 bullet points"

# Pipe input from a file
cat README.md | ollama run llama3.2 "Summarize this document"

# Use a specific model for coding
ollama run qwen2.5-coder:32b "Write a React hook for debouncing"

# Use DeepSeek for reasoning tasks
ollama run deepseek-r1 "Prove that the square root of 2 is irrational"

Best Models to Run Locally (2026)

Not all models are created equal. Here are the best options for different use cases, with exact download sizes and VRAM requirements.

🦙

Llama 3.2

Llama Community License

Sizes1B (1.3GB), 3B (2.0GB), 8B (4.7GB), 70B (40GB)

VRAM Needed4GB (3B), 8GB (8B), 40GB+ (70B)

Best for: General-purpose tasks, conversation, writing, analysis

🔭

Llama 4 Scout

Llama Community License

Sizes109B MoE (17B active)

VRAM Needed12GB (only active params loaded)

BenchmarkClose to Llama 3.2 70B

Best for: Near-70B quality on consumer GPUs — best bang for buck

🚀

Llama 4 Maverick

Llama Community License

Sizes400B MoE (17B active)

VRAM Needed24GB+

Benchmark88.2% MMLU

Best for: 10M token context window, complex long-document analysis

🧠

DeepSeek R1

MIT (fully commercial)

Sizes1.5B, 7B, 8B, 14B, 32B, 70B, 671B

VRAM Needed8GB (7B), 24GB (70B Q4)

Benchmark79.8% AIME Math

Best for: Reasoning, math, science, logic — best chain-of-thought

💻

Qwen 2.5 Coder 32B

Apache 2.0

Sizes0.5B, 1.5B, 3B, 7B, 14B, 32B

VRAM Needed24GB (32B)

Benchmark92% HumanEval

Best for: Code generation, completion, review — best coding model

🌬️

Mistral 7B

Apache 2.0 (with conditions)

Sizes7B (4.1GB)

VRAM Needed8GB

Best for: Great general-purpose model for consumer hardware

💎

Gemma 2

Gemma License

Sizes2B, 9B, 27B

VRAM Needed4GB (2B), 8GB (9B), 24GB (27B)

Best for: Google's efficient open model, good for resource-constrained setups

⚛️

Phi 3

MIT

Sizes3.8B, 14B

VRAM Needed4GB (3.8B), 12GB (14B)

Best for: Microsoft's small but surprisingly capable model

Pull Any of These Models

Terminal

# Meta Llama
ollama pull llama3.2          # 8B (default)
ollama pull llama3.2:3b       # 3B variant
ollama pull llama3.2:70b      # 70B variant (needs 40GB+ VRAM)

# Llama 4
ollama pull llama4-scout      # 109B MoE, 17B active
ollama pull llama4-maverick   # 400B MoE, 17B active

# DeepSeek R1
ollama pull deepseek-r1       # Default size
ollama pull deepseek-r1:7b    # 7B distilled
ollama pull deepseek-r1:70b   # 70B distilled

# Qwen 2.5 Coder
ollama pull qwen2.5-coder:32b # Best coding model
ollama pull qwen2.5-coder:7b  # Smaller coding model

# Others
ollama pull mistral           # Mistral 7B
ollama pull gemma2:9b         # Google Gemma 2
ollama pull phi3              # Microsoft Phi 3

Part 3: Open WebUI — ChatGPT-like Interface

The terminal is powerful, but sometimes you want a polished chat interface. Open WebUI gives you a ChatGPT-like experience for your local models — with chat history, model switching, and more.

Quick Setup with Docker

Terminal — Basic Setup

# Pull and run Open WebUI (connects to local Ollama)
docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# Access the UI at:
# http://localhost:3000

Terminal — With NVIDIA GPU Support

# For NVIDIA GPU acceleration in Open WebUI
docker run -d \
  -p 3000:8080 \
  --gpus all \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:cuda

# For AMD GPU support, use the :rocm tag instead

Chat Features

Full chat history with search
Model selector dropdown (switch models mid-chat)
File uploads and document analysis
Markdown rendering with syntax highlighting

Admin Features

Multi-user support with role-based access
Dark mode / light mode toggle
System prompt customization per model
Import/export conversations

First Login: The first account you create becomes the admin. Open WebUI stores data locally in the Docker volume — nothing is sent to external servers. Access at http://localhost:3000.

Part 4: Developer Tool Integration

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. This means any tool that supports the OpenAI API can connect to your local models.

🔧 Continue.dev (VS Code / JetBrains)

Open-source AI code assistant plugin for VS Code and JetBrains IDEs. Works natively with Ollama.

~/.continue/config.json

{
  "models": [
    {
      "title": "Ollama - Qwen Coder",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "apiBase": "http://localhost:11434"
    },
    {
      "title": "Ollama - DeepSeek R1",
      "provider": "ollama",
      "model": "deepseek-r1",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Ollama Autocomplete",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b",
    "apiBase": "http://localhost:11434"
  }
}

Install the Continue extension from VS Code Marketplace or JetBrains Plugin Store

Supports code completion, chat, refactoring, and documentation generation

Completely free and open-source — no account required

⚡ Cursor IDE

Cursor supports custom OpenAI-compatible endpoints. Point it to your local Ollama server:

Cursor Settings → Models → OpenAI API

# In Cursor Settings, add a custom model:
# API Base URL: http://localhost:11434/v1
# API Key: ollama (any string works, it's not validated)
# Model name: llama3.2 (or any model you have pulled)

# The OpenAI-compatible endpoint:
# http://localhost:11434/v1/chat/completions

Note: Local models may be slower than cloud APIs for large codebases. Best for smaller, focused tasks.

🤖 Claude Code & Other Tools

Any tool that speaks the OpenAI API protocol can connect to Ollama:

Terminal — Using the API directly

# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

# This works with any OpenAI-compatible client:
# - Python: openai library with base_url="http://localhost:11434/v1"
# - Node.js: openai package with baseURL config
# - Any HTTP client

🐍 Python Integration

python — using the openai library

# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # any string works
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to find prime numbers"}
    ]
)

print(response.choices[0].message.content)

Performance Tips

1. Use Quantized Models

Quantization reduces model precision (e.g., from 16-bit to 4-bit) to dramatically reduce VRAM usage with minimal quality loss. Most Ollama models are already quantized to Q4_K_M by default.

Terminal

# Check a model's quantization level
ollama show llama3.2 --modelfile

# Most models default to Q4_K_M (4-bit quantization)
# This gives ~95% of full quality at ~25% of the VRAM

# Quantization levels (more bits = better quality, more VRAM):
# Q4_0  — fastest, lowest quality
# Q4_K_M — best balance (default for most models)
# Q5_K_M — slightly better quality
# Q8_0  — near-original quality, 2x VRAM vs Q4
# F16   — full precision, maximum VRAM

2. GPU Layer Offloading

If your model doesn't fully fit in VRAM, Ollama automatically splits it between GPU and CPU. You can control how many layers run on the GPU:

Terminal

# Set number of GPU layers (higher = more on GPU = faster)
OLLAMA_NUM_GPU=35 ollama run llama3.2

# Set to 0 for CPU-only mode
OLLAMA_NUM_GPU=0 ollama run llama3.2

# Let Ollama auto-detect (default behavior)
ollama run llama3.2

3. Optimize Context Window Size

Larger context windows use more VRAM. Reduce it if you don't need long conversations:

Terminal

# Set context window size (default varies by model)
ollama run llama3.2 --num-ctx 2048    # Smaller = faster, less VRAM
ollama run llama3.2 --num-ctx 8192    # Larger = handles longer input
ollama run llama3.2 --num-ctx 32768   # Maximum for most models

4. Pick the Right Model for the Task

For coding:

Qwen 2.5 Coder 32B > Qwen 2.5 Coder 7B > DeepSeek R1 7B

For math/reasoning:

DeepSeek R1 > Llama 4 Scout > Qwen 2.5 32B

For general chat:

Llama 3.2 8B > Mistral 7B > Gemma 2 9B

For low VRAM (4-8GB):

Phi 3 3.8B > Llama 3.2 3B > Gemma 2 2B

Ollama API Reference

Ollama runs a REST API on http://localhost:11434. It also provides an OpenAI-compatible endpoint at /v1.

Native Ollama API

Key Endpoints

# Generate a completion (streaming)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?"
}'

# Chat completion (multi-turn)
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "4"},
    {"role": "user", "content": "And 2+3?"}
  ]
}'

# List local models
curl http://localhost:11434/api/tags

# Show model info
curl http://localhost:11434/api/show -d '{"name": "llama3.2"}'

# Pull a model programmatically
curl http://localhost:11434/api/pull -d '{"name": "mistral"}'

# Generate embeddings
curl http://localhost:11434/api/embeddings -d '{
  "model": "llama3.2",
  "prompt": "Ollama is awesome"
}'

OpenAI-Compatible API

OpenAI-format endpoints at /v1

# Chat completions (drop-in replacement for OpenAI)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Tell me a joke"}
    ],
    "temperature": 0.7,
    "max_tokens": 200
  }'

# List available models
curl http://localhost:11434/v1/models

# This means you can use:
# - Python openai library
# - Node.js openai package
# - Any OpenAI-compatible SDK or tool
# Just change the base URL to http://localhost:11434/v1

Model Comparison Table

Benchmark Highlights (2026)

Benchmark	Best Open Model	Score	GPT-4o
AIME Math Reasoning	DeepSeek R1	79.8%	9.3%
MMLU Knowledge	Llama 4 Maverick	88.2%	88.7%
HumanEval Code	Qwen 2.5 Coder	92%	90.2%

Open models are beating GPT-4o in specialized tasks. The gap for general knowledge (MMLU) is nearly closed.

Complete Model Comparison

Model	Parameters	Download Size	Min VRAM	License	Best For
Llama 3.2 3B	3B	2.0GB	4GB	Community	Light tasks, edge
Llama 3.2 8B	8B	4.7GB	8GB	Community	General purpose
Llama 3.2 70B	70B	40GB	40GB+	Community	High-quality gen
Llama 4 Scout	109B MoE	~12GB	12GB	Community	Best value
Llama 4 Maverick	400B MoE	~24GB	24GB+	Community	Long context
DeepSeek R1 7B	7B	~4.7GB	8GB	MIT	Reasoning
DeepSeek R1 70B	70B	~40GB	24GB (Q4)	MIT	Math, science
Qwen 2.5 Coder 32B	32B	~20GB	24GB	Apache 2.0	Code gen
Mistral 7B	7B	4.1GB	8GB	Apache 2.0*	General purpose
Gemma 2 9B	9B	~5.4GB	8GB	Gemma License	Efficient gen
Phi 3 3.8B	3.8B	~2.3GB	4GB	MIT	Small & capable

* Some Mistral models have commercial restrictions at scale. Check the specific model license before commercial use.

Licensing Quick Guide

Most Permissive (Commercial OK)

• DeepSeek R1: MIT — fully commercial, no restrictions
• Qwen 3: Apache 2.0 — commercial use allowed
• Phi 3: MIT — fully permissive

Commercial with Conditions

• Llama 4: Community License — commercial with conditions
• Mistral: Some restrictions at commercial scale
• Gemma 2: Gemma License — check specific terms

Troubleshooting

✕ "Error: model not found"

The model hasn't been downloaded yet. Pull it first:

# Pull the model before running it
ollama pull llama3.2

# Or use ollama run (auto-pulls if missing)
ollama run llama3.2

# Check what models you have downloaded:
ollama list

✕ "Error: could not connect to ollama server"

The Ollama server isn't running. Start it:

# Start the server manually
ollama serve

# On macOS: open the Ollama app (it starts the server)
# On Linux: check the systemd service
sudo systemctl status ollama
sudo systemctl start ollama

# Verify the server is running
curl http://localhost:11434
# Should return: "Ollama is running"

✕ Out of memory / model too large

Your system doesn't have enough RAM or VRAM for the model. Try a smaller model or reduce context:

# Switch to a smaller model
ollama run llama3.2:3b    # Instead of the 8B default

# Reduce context window to use less memory
ollama run llama3.2 --num-ctx 2048

# Force CPU-only mode (uses system RAM instead of VRAM)
OLLAMA_NUM_GPU=0 ollama run llama3.2

# Check VRAM usage
nvidia-smi    # NVIDIA GPUs
# For Apple Silicon, check Activity Monitor → Memory

✕ GPU not detected / running on CPU

Ollama isn't using your GPU. Check drivers and CUDA:

# Check if NVIDIA driver is installed
nvidia-smi

# Check CUDA version
nvcc --version

# For NVIDIA, you need:
# - NVIDIA driver 470+ (for CUDA 11.x support)
# - The ollama install script should handle CUDA setup

# Restart Ollama after driver updates
sudo systemctl restart ollama  # Linux
# Or quit and reopen the Ollama app (macOS/Windows)

# For AMD on Linux, ensure ROCm is installed:
# https://rocm.docs.amd.com/

✕ Port 11434 already in use

Another process (or another Ollama instance) is using the default port:

# Find what's using the port
lsof -i :11434          # macOS/Linux
netstat -ano | findstr 11434  # Windows

# Kill the existing process, or use a different port:
OLLAMA_HOST=0.0.0.0:11435 ollama serve

# Update your clients to use the new port:
# http://localhost:11435

Frequently Asked Questions

Is Ollama free? What's the catch?

Ollama is completely free and open-source. There's no catch — no subscriptions, no usage limits, no data collection. The models are also free to download (though some have license restrictions for commercial use). You only "pay" with your hardware resources (RAM, GPU, disk space, electricity).

How do local models compare to ChatGPT / GPT-4o?

For specialized tasks, local models can match or beat GPT-4o. DeepSeek R1 scores 79.8% on AIME math (vs GPT-4o's 9.3%). Qwen 2.5 Coder gets 92% on HumanEval (vs GPT-4o's 90.2%). For broad general knowledge and instruction following, GPT-4o still has an edge with smaller models, but the gap closes rapidly with 70B+ models and Llama 4 Scout/Maverick.

Can I run Ollama on a Mac with 8GB RAM?

Yes! Apple Silicon Macs share memory between CPU and GPU, making them surprisingly good for local AI. An M1/M2 with 8GB can run 3B-7B models comfortably. For 8B models, expect usable but not blazing speed. 16GB is the sweet spot for Apple Silicon — it can handle most 7B-13B models with good performance.

Does Ollama send any data to the internet?

No. Once a model is downloaded, Ollama runs 100% locally. It makes no outbound network connections during inference. The only internet access is when you ollama pull a new model. You can even disconnect from the internet entirely after downloading your models.

Can I fine-tune models with Ollama?

Ollama supports creating custom models with Modelfiles (similar to Dockerfiles). You can customize system prompts, parameters, and create model variants. For actual fine-tuning (training on your data), you'd need tools like Unsloth or Axolotl, then import the resulting model into Ollama.

What's the best model for coding?

Qwen 2.5 Coder 32B is the current champion with 92% on HumanEval. If you don't have 24GB VRAM, the 7B variant is still excellent. DeepSeek R1 is also great for code that requires complex reasoning (algorithms, system design). For quick autocomplete, Qwen 2.5 Coder 7B offers the best speed-to-quality ratio.

Can I use Ollama for commercial products?

Yes, but check the model license. Ollama itself is MIT-licensed. Models like DeepSeek R1 (MIT) and Qwen (Apache 2.0) are fully commercial. Llama models have a community license with some conditions for large-scale commercial use. Mistral has varying licenses per model. Always check the specific model's license before deploying commercially.

Build Your AI Agent Stack

Now that you can run AI locally, build a complete stack around it. Use our AI-powered generator to pick the right hosting, database, auth, and deployment tools — optimized for AI agent workloads and your budget.

Clawdbot Setup Guide 2026

Self-hosted AI assistant tutorial

Cursor vs Claude Code 2026

AI coding assistants compared

Best Tech Stack for SaaS 2026

Complete guide to building modern SaaS

Build an AI App with Next.js

Step-by-step AI app tutorial

Lovable vs Bolt vs v0 2026

AI app builders comparison

Solo Founder Tech Stack 2026

Build your MVP for under $50/month

Explore More

All Blog Posts Browse Tools Prebuilt Stacks Stack Generator

How to Run AI Models Locally with Ollama (2026 Setup Guide)

In This Guide

Why Run AI Locally?

Privacy

Zero Cost

Speed

Offline

What is Ollama?

Hardware Requirements

System RAM Requirements

8GB RAM — Small Models (3B parameters)

16GB RAM — 7B Models

32GB RAM — 13B+ Models

GPU VRAM Guide

CPU-Only Performance

Part 1: Installing Ollama

macOS (Intel & Apple Silicon)

Linux (Ubuntu 18.04+, Debian, Fedora, Arch)

Windows

Docker (Any Platform)

After Installation: Verify Everything Works

Part 2: Your First Model

Download (Pull) a Model

Run the Model Interactively

Essential Commands

Quick-Run Examples

Best Models to Run Locally (2026)

Llama 3.2

Llama 4 Scout

Llama 4 Maverick

DeepSeek R1

Qwen 2.5 Coder 32B

Mistral 7B

Gemma 2

Phi 3

Pull Any of These Models

Part 3: Open WebUI — ChatGPT-like Interface

Quick Setup with Docker

Chat Features

Admin Features

Part 4: Developer Tool Integration

🔧 Continue.dev (VS Code / JetBrains)

⚡ Cursor IDE

🤖 Claude Code & Other Tools

🐍 Python Integration

Performance Tips

1. Use Quantized Models

2. GPU Layer Offloading

3. Optimize Context Window Size

4. Pick the Right Model for the Task

Ollama API Reference

Native Ollama API

OpenAI-Compatible API

Model Comparison Table

Benchmark Highlights (2026)

Complete Model Comparison

Licensing Quick Guide

Troubleshooting

✕ "Error: model not found"

✕ "Error: could not connect to ollama server"

✕ Out of memory / model too large

✕ GPU not detected / running on CPU

✕ Port 11434 already in use

Frequently Asked Questions

Is Ollama free? What's the catch?

How do local models compare to ChatGPT / GPT-4o?

Can I run Ollama on a Mac with 8GB RAM?

Does Ollama send any data to the internet?

Can I fine-tune models with Ollama?

What's the best model for coding?

Can I use Ollama for commercial products?

Build Your AI Agent Stack

Related Articles

Explore More