Back to Blog
AI ToolsSetup Guide

How to Run AI Models Locally with Ollama (2026 Setup Guide)

The definitive guide to running LLMs on your own hardware. Install Ollama, pull models like Llama 4, DeepSeek R1, and Mistral, set up a ChatGPT-like interface, and integrate with your dev tools — all free, private, and offline.

18 min read
Complete Tutorial
Published February 12, 2026

Why Run AI Locally?

Four reasons developers are moving away from cloud AI APIs and running models on their own hardware.

Privacy

Your data never leaves your machine. No conversations logged to third-party servers. No training on your prompts. Complete data sovereignty.

Zero Cost

No per-token API fees. No monthly subscriptions. Download a model once and run it unlimited times. Your hardware, your rules.

Speed

Zero network latency. Responses start instantly. No waiting for server round-trips. Especially noticeable for short, frequent prompts.

Offline

Works without internet after the initial model download. Perfect for travel, restricted networks, or air-gapped environments.

The 2026 Reality: Open-source models have caught up dramatically. DeepSeek R1 beats GPT-4o on math reasoning. Qwen 2.5 Coder matches it on code generation. Llama 4 Scout delivers near-70B quality with only 12GB of VRAM. Running locally is no longer a compromise — it's a competitive advantage.

What is Ollama?

Ollama is a free, open-source tool for running large language models locally on your own hardware. It handles the complexity of downloading, configuring, and serving AI models — so you can go from zero to chatting with an LLM in under 5 minutes.

Think of it as Docker for AI models. You ollama pull a model, ollama run it, and start talking. Ollama automatically detects your hardware — NVIDIA GPUs (CUDA), Apple Silicon (Metal), AMD GPUs (ROCm on Linux) — and optimizes performance accordingly.

100+

Available Models

3

OS Platforms

OpenAI

Compatible API

Free

Open Source

Platform Support: macOS 11+ (Intel & Apple Silicon), Linux (Ubuntu 18.04+), Windows (via WSL2 or native installer). Download from ollama.com.

Hardware Requirements

What you can run depends on your RAM and GPU. Here's a breakdown by hardware tier.

System RAM Requirements

Minimum

8GB RAM — Small Models (3B parameters)

Modern CPU required. Enough for Llama 3.2 1B/3B, Phi 3 3.8B. Expect ~5-10 tokens/sec on CPU only.

Good

16GB RAM — 7B Models

Recommended for Mistral 7B, Llama 3.2 8B. Good balance of quality and speed.

Ideal

32GB RAM — 13B+ Models

Run Qwen 2.5 Coder 32B, Gemma 2 27B, and larger models comfortably.

GPU VRAM Guide

A GPU dramatically speeds up inference. Here's what you can run at each VRAM tier:

GPU VRAMModels You Can RunExample GPUs
4GBSmall models (3B params): Llama 3.2 1B/3B, Phi 3 3.8BGTX 1650, RX 6500 XT
8GB7B models: Mistral 7B, Llama 3.2 8B, Gemma 2 9BRTX 3060/4060, M1/M2 (shared)
12GBLlama 4 Scout (109B MoE, 17B active) — near-70B qualityRTX 3060 12GB, RTX 4070
24GB13-32B models, DeepSeek R1 70B distilled (Q4 quantization)RTX 3090/4090, M2 Pro/Max
40GB+70B+ models: Full Llama 3.2 70B, DeepSeek R1 fullA100, H100, M2 Ultra

CPU-Only Performance

No GPU? You can still run models on CPU, but performance varies significantly:

Intel Core i7

~7.5 tok/s

With 7B model (Q4)

AMD Ryzen 5

~12.3 tok/s

With 7B model (Q4)

A GPU is highly recommended for a decent interactive experience. CPU-only works for batch processing or when you can tolerate slower responses.

Storage Tip: Models can be large (1-40GB+ each). A 512GB SSD is recommended if you plan to keep multiple models downloaded. Models are stored in ~/.ollama/models by default.

Part 1: Installing Ollama

Four installation methods. Pick the one that matches your system and workflow.

Recommended

macOS (Intel & Apple Silicon)

Two options — the official installer (easiest) or Homebrew (for package manager fans):

Option A: Official Installer
# Download from ollama.com and run the .dmg installer
# Supports macOS 11 (Big Sur) and later
# Auto-detects Apple Silicon (M1/M2/M3/M4) for Metal acceleration

# After install, verify:
ollama --version
Option B: Homebrew
# Install via Homebrew
brew install ollama

# Verify installation
ollama --version

Linux (Ubuntu 18.04+, Debian, Fedora, Arch)

One-liner install script that detects your distro and GPU automatically:

Terminal
# Official install script (auto-detects GPU)
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

# The script automatically:
# - Detects NVIDIA GPUs and installs CUDA support
# - Detects AMD GPUs and configures ROCm
# - Creates a systemd service (ollama.service)
# - Starts the server on port 11434

Windows

Native Windows support or install via the Windows Package Manager:

Option A: Official Installer
# Download the Windows installer from ollama.com
# Runs natively — no WSL2 required (since late 2024)
# Supports NVIDIA CUDA GPUs out of the box
Option B: winget
# Install via Windows Package Manager
winget install Ollama.Ollama

# Verify installation (open new terminal)
ollama --version

Docker (Any Platform)

Ideal for servers, CI/CD pipelines, or keeping your system clean:

Terminal — Docker with GPU support
# Run Ollama in Docker with NVIDIA GPU support
docker run -d \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# Without GPU (CPU only)
docker run -d \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# Pull and run a model inside the container
docker exec -it ollama ollama pull llama3.2
docker exec -it ollama ollama run llama3.2

After Installation: Verify Everything Works

1

Check Ollama is installed

ollama --version
# Should output something like: ollama version 0.6.x
2

Start the server (if not auto-started)

ollama serve
# Server starts on http://localhost:11434
# On macOS/Windows, the app starts the server automatically
# On Linux, systemd manages the service
3

Test the API is responding

curl http://localhost:11434
# Should return: "Ollama is running"

Part 2: Your First Model

Let's download and run your first AI model. We'll start with Llama 3.2 — Meta's versatile open model.

1

Download (Pull) a Model

Terminal
# Pull Llama 3.2 (8B parameters, 4.7GB download)
ollama pull llama3.2

# The download progress will show:
# pulling manifest...
# pulling abc123... 100% ████████████████ 4.7 GB
# verifying sha256 digest...
# writing manifest...
# success

First pull takes a few minutes depending on your internet speed. The model is cached locally — subsequent runs are instant.

2

Run the Model Interactively

Terminal
# Start an interactive chat session
ollama run llama3.2

# You'll see a prompt:
# >>> Send a message (/? for help)

# Try these prompts:
>>> Explain how HTTP caching works in 3 sentences
>>> Write a Python function to merge two sorted lists
>>> What are the pros and cons of microservices?

# Type /bye to exit the session
3

Essential Commands

Ollama CLI Reference
# Download a model
ollama pull llama3.2

# Run a model (auto-pulls if not downloaded)
ollama run llama3.2

# List all downloaded models
ollama list

# Show detailed model info (size, parameters, template)
ollama show llama3.2

# Remove a model (free disk space)
ollama rm llama3.2

# Start the Ollama server
ollama serve

# Show currently running/loaded models
ollama ps

# Copy a model (create an alias)
ollama cp llama3.2 my-custom-llama

Quick-Run Examples

Terminal
# One-shot prompt (no interactive session)
ollama run llama3.2 "Summarize the benefits of TypeScript in 3 bullet points"

# Pipe input from a file
cat README.md | ollama run llama3.2 "Summarize this document"

# Use a specific model for coding
ollama run qwen2.5-coder:32b "Write a React hook for debouncing"

# Use DeepSeek for reasoning tasks
ollama run deepseek-r1 "Prove that the square root of 2 is irrational"

Best Models to Run Locally (2026)

Not all models are created equal. Here are the best options for different use cases, with exact download sizes and VRAM requirements.

🦙

Llama 3.2

Llama Community License

Sizes1B (1.3GB), 3B (2.0GB), 8B (4.7GB), 70B (40GB)
VRAM Needed4GB (3B), 8GB (8B), 40GB+ (70B)

Best for: General-purpose tasks, conversation, writing, analysis

🔭

Llama 4 Scout

Llama Community License

Sizes109B MoE (17B active)
VRAM Needed12GB (only active params loaded)
BenchmarkClose to Llama 3.2 70B

Best for: Near-70B quality on consumer GPUs — best bang for buck

🚀

Llama 4 Maverick

Llama Community License

Sizes400B MoE (17B active)
VRAM Needed24GB+
Benchmark88.2% MMLU

Best for: 10M token context window, complex long-document analysis

🧠

DeepSeek R1

MIT (fully commercial)

Sizes1.5B, 7B, 8B, 14B, 32B, 70B, 671B
VRAM Needed8GB (7B), 24GB (70B Q4)
Benchmark79.8% AIME Math

Best for: Reasoning, math, science, logic — best chain-of-thought

💻

Qwen 2.5 Coder 32B

Apache 2.0

Sizes0.5B, 1.5B, 3B, 7B, 14B, 32B
VRAM Needed24GB (32B)
Benchmark92% HumanEval

Best for: Code generation, completion, review — best coding model

🌬️

Mistral 7B

Apache 2.0 (with conditions)

Sizes7B (4.1GB)
VRAM Needed8GB

Best for: Great general-purpose model for consumer hardware

💎

Gemma 2

Gemma License

Sizes2B, 9B, 27B
VRAM Needed4GB (2B), 8GB (9B), 24GB (27B)

Best for: Google's efficient open model, good for resource-constrained setups

⚛️

Phi 3

MIT

Sizes3.8B, 14B
VRAM Needed4GB (3.8B), 12GB (14B)

Best for: Microsoft's small but surprisingly capable model

Pull Any of These Models

Terminal
# Meta Llama
ollama pull llama3.2          # 8B (default)
ollama pull llama3.2:3b       # 3B variant
ollama pull llama3.2:70b      # 70B variant (needs 40GB+ VRAM)

# Llama 4
ollama pull llama4-scout      # 109B MoE, 17B active
ollama pull llama4-maverick   # 400B MoE, 17B active

# DeepSeek R1
ollama pull deepseek-r1       # Default size
ollama pull deepseek-r1:7b    # 7B distilled
ollama pull deepseek-r1:70b   # 70B distilled

# Qwen 2.5 Coder
ollama pull qwen2.5-coder:32b # Best coding model
ollama pull qwen2.5-coder:7b  # Smaller coding model

# Others
ollama pull mistral           # Mistral 7B
ollama pull gemma2:9b         # Google Gemma 2
ollama pull phi3              # Microsoft Phi 3

Part 3: Open WebUI — ChatGPT-like Interface

The terminal is powerful, but sometimes you want a polished chat interface. Open WebUI gives you a ChatGPT-like experience for your local models — with chat history, model switching, and more.

Quick Setup with Docker

Terminal — Basic Setup
# Pull and run Open WebUI (connects to local Ollama)
docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# Access the UI at:
# http://localhost:3000
Terminal — With NVIDIA GPU Support
# For NVIDIA GPU acceleration in Open WebUI
docker run -d \
  -p 3000:8080 \
  --gpus all \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:cuda

# For AMD GPU support, use the :rocm tag instead

Chat Features

  • Full chat history with search
  • Model selector dropdown (switch models mid-chat)
  • File uploads and document analysis
  • Markdown rendering with syntax highlighting

Admin Features

  • Multi-user support with role-based access
  • Dark mode / light mode toggle
  • System prompt customization per model
  • Import/export conversations

First Login: The first account you create becomes the admin. Open WebUI stores data locally in the Docker volume — nothing is sent to external servers. Access at http://localhost:3000.

Part 4: Developer Tool Integration

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. This means any tool that supports the OpenAI API can connect to your local models.

🔧 Continue.dev (VS Code / JetBrains)

Open-source AI code assistant plugin for VS Code and JetBrains IDEs. Works natively with Ollama.

~/.continue/config.json
{
  "models": [
    {
      "title": "Ollama - Qwen Coder",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "apiBase": "http://localhost:11434"
    },
    {
      "title": "Ollama - DeepSeek R1",
      "provider": "ollama",
      "model": "deepseek-r1",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Ollama Autocomplete",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b",
    "apiBase": "http://localhost:11434"
  }
}
Install the Continue extension from VS Code Marketplace or JetBrains Plugin Store
Supports code completion, chat, refactoring, and documentation generation
Completely free and open-source — no account required

Cursor IDE

Cursor supports custom OpenAI-compatible endpoints. Point it to your local Ollama server:

Cursor Settings → Models → OpenAI API
# In Cursor Settings, add a custom model:
# API Base URL: http://localhost:11434/v1
# API Key: ollama (any string works, it's not validated)
# Model name: llama3.2 (or any model you have pulled)

# The OpenAI-compatible endpoint:
# http://localhost:11434/v1/chat/completions

Note: Local models may be slower than cloud APIs for large codebases. Best for smaller, focused tasks.

🤖 Claude Code & Other Tools

Any tool that speaks the OpenAI API protocol can connect to Ollama:

Terminal — Using the API directly
# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

# This works with any OpenAI-compatible client:
# - Python: openai library with base_url="http://localhost:11434/v1"
# - Node.js: openai package with baseURL config
# - Any HTTP client

🐍 Python Integration

python — using the openai library
# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # any string works
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to find prime numbers"}
    ]
)

print(response.choices[0].message.content)

Performance Tips

1. Use Quantized Models

Quantization reduces model precision (e.g., from 16-bit to 4-bit) to dramatically reduce VRAM usage with minimal quality loss. Most Ollama models are already quantized to Q4_K_M by default.

Terminal
# Check a model's quantization level
ollama show llama3.2 --modelfile

# Most models default to Q4_K_M (4-bit quantization)
# This gives ~95% of full quality at ~25% of the VRAM

# Quantization levels (more bits = better quality, more VRAM):
# Q4_0  — fastest, lowest quality
# Q4_K_M — best balance (default for most models)
# Q5_K_M — slightly better quality
# Q8_0  — near-original quality, 2x VRAM vs Q4
# F16   — full precision, maximum VRAM

2. GPU Layer Offloading

If your model doesn't fully fit in VRAM, Ollama automatically splits it between GPU and CPU. You can control how many layers run on the GPU:

Terminal
# Set number of GPU layers (higher = more on GPU = faster)
OLLAMA_NUM_GPU=35 ollama run llama3.2

# Set to 0 for CPU-only mode
OLLAMA_NUM_GPU=0 ollama run llama3.2

# Let Ollama auto-detect (default behavior)
ollama run llama3.2

3. Optimize Context Window Size

Larger context windows use more VRAM. Reduce it if you don't need long conversations:

Terminal
# Set context window size (default varies by model)
ollama run llama3.2 --num-ctx 2048    # Smaller = faster, less VRAM
ollama run llama3.2 --num-ctx 8192    # Larger = handles longer input
ollama run llama3.2 --num-ctx 32768   # Maximum for most models

4. Pick the Right Model for the Task

For coding:

Qwen 2.5 Coder 32B > Qwen 2.5 Coder 7B > DeepSeek R1 7B

For math/reasoning:

DeepSeek R1 > Llama 4 Scout > Qwen 2.5 32B

For general chat:

Llama 3.2 8B > Mistral 7B > Gemma 2 9B

For low VRAM (4-8GB):

Phi 3 3.8B > Llama 3.2 3B > Gemma 2 2B

Ollama API Reference

Ollama runs a REST API on http://localhost:11434. It also provides an OpenAI-compatible endpoint at /v1.

Native Ollama API

Key Endpoints
# Generate a completion (streaming)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?"
}'

# Chat completion (multi-turn)
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "4"},
    {"role": "user", "content": "And 2+3?"}
  ]
}'

# List local models
curl http://localhost:11434/api/tags

# Show model info
curl http://localhost:11434/api/show -d '{"name": "llama3.2"}'

# Pull a model programmatically
curl http://localhost:11434/api/pull -d '{"name": "mistral"}'

# Generate embeddings
curl http://localhost:11434/api/embeddings -d '{
  "model": "llama3.2",
  "prompt": "Ollama is awesome"
}'

OpenAI-Compatible API

OpenAI-format endpoints at /v1
# Chat completions (drop-in replacement for OpenAI)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Tell me a joke"}
    ],
    "temperature": 0.7,
    "max_tokens": 200
  }'

# List available models
curl http://localhost:11434/v1/models

# This means you can use:
# - Python openai library
# - Node.js openai package
# - Any OpenAI-compatible SDK or tool
# Just change the base URL to http://localhost:11434/v1

Model Comparison Table

Benchmark Highlights (2026)

BenchmarkBest Open ModelScoreGPT-4o
AIME Math ReasoningDeepSeek R179.8%9.3%
MMLU KnowledgeLlama 4 Maverick88.2%88.7%
HumanEval CodeQwen 2.5 Coder92%90.2%

Open models are beating GPT-4o in specialized tasks. The gap for general knowledge (MMLU) is nearly closed.

Complete Model Comparison

ModelParametersDownload SizeMin VRAMLicenseBest For
Llama 3.2 3B3B2.0GB4GBCommunityLight tasks, edge
Llama 3.2 8B8B4.7GB8GBCommunityGeneral purpose
Llama 3.2 70B70B40GB40GB+CommunityHigh-quality gen
Llama 4 Scout109B MoE~12GB12GBCommunityBest value
Llama 4 Maverick400B MoE~24GB24GB+CommunityLong context
DeepSeek R1 7B7B~4.7GB8GBMITReasoning
DeepSeek R1 70B70B~40GB24GB (Q4)MITMath, science
Qwen 2.5 Coder 32B32B~20GB24GBApache 2.0Code gen
Mistral 7B7B4.1GB8GBApache 2.0*General purpose
Gemma 2 9B9B~5.4GB8GBGemma LicenseEfficient gen
Phi 3 3.8B3.8B~2.3GB4GBMITSmall & capable

* Some Mistral models have commercial restrictions at scale. Check the specific model license before commercial use.

Licensing Quick Guide

Most Permissive (Commercial OK)

  • DeepSeek R1: MIT — fully commercial, no restrictions
  • Qwen 3: Apache 2.0 — commercial use allowed
  • Phi 3: MIT — fully permissive

Commercial with Conditions

  • Llama 4: Community License — commercial with conditions
  • Mistral: Some restrictions at commercial scale
  • Gemma 2: Gemma License — check specific terms

Troubleshooting

"Error: model not found"

The model hasn't been downloaded yet. Pull it first:

# Pull the model before running it
ollama pull llama3.2

# Or use ollama run (auto-pulls if missing)
ollama run llama3.2

# Check what models you have downloaded:
ollama list

"Error: could not connect to ollama server"

The Ollama server isn't running. Start it:

# Start the server manually
ollama serve

# On macOS: open the Ollama app (it starts the server)
# On Linux: check the systemd service
sudo systemctl status ollama
sudo systemctl start ollama

# Verify the server is running
curl http://localhost:11434
# Should return: "Ollama is running"

Out of memory / model too large

Your system doesn't have enough RAM or VRAM for the model. Try a smaller model or reduce context:

# Switch to a smaller model
ollama run llama3.2:3b    # Instead of the 8B default

# Reduce context window to use less memory
ollama run llama3.2 --num-ctx 2048

# Force CPU-only mode (uses system RAM instead of VRAM)
OLLAMA_NUM_GPU=0 ollama run llama3.2

# Check VRAM usage
nvidia-smi    # NVIDIA GPUs
# For Apple Silicon, check Activity Monitor → Memory

GPU not detected / running on CPU

Ollama isn't using your GPU. Check drivers and CUDA:

# Check if NVIDIA driver is installed
nvidia-smi

# Check CUDA version
nvcc --version

# For NVIDIA, you need:
# - NVIDIA driver 470+ (for CUDA 11.x support)
# - The ollama install script should handle CUDA setup

# Restart Ollama after driver updates
sudo systemctl restart ollama  # Linux
# Or quit and reopen the Ollama app (macOS/Windows)

# For AMD on Linux, ensure ROCm is installed:
# https://rocm.docs.amd.com/

Port 11434 already in use

Another process (or another Ollama instance) is using the default port:

# Find what's using the port
lsof -i :11434          # macOS/Linux
netstat -ano | findstr 11434  # Windows

# Kill the existing process, or use a different port:
OLLAMA_HOST=0.0.0.0:11435 ollama serve

# Update your clients to use the new port:
# http://localhost:11435

Frequently Asked Questions

Is Ollama free? What's the catch?

Ollama is completely free and open-source. There's no catch — no subscriptions, no usage limits, no data collection. The models are also free to download (though some have license restrictions for commercial use). You only "pay" with your hardware resources (RAM, GPU, disk space, electricity).

How do local models compare to ChatGPT / GPT-4o?

For specialized tasks, local models can match or beat GPT-4o. DeepSeek R1 scores 79.8% on AIME math (vs GPT-4o's 9.3%). Qwen 2.5 Coder gets 92% on HumanEval (vs GPT-4o's 90.2%). For broad general knowledge and instruction following, GPT-4o still has an edge with smaller models, but the gap closes rapidly with 70B+ models and Llama 4 Scout/Maverick.

Can I run Ollama on a Mac with 8GB RAM?

Yes! Apple Silicon Macs share memory between CPU and GPU, making them surprisingly good for local AI. An M1/M2 with 8GB can run 3B-7B models comfortably. For 8B models, expect usable but not blazing speed. 16GB is the sweet spot for Apple Silicon — it can handle most 7B-13B models with good performance.

Does Ollama send any data to the internet?

No. Once a model is downloaded, Ollama runs 100% locally. It makes no outbound network connections during inference. The only internet access is when you ollama pull a new model. You can even disconnect from the internet entirely after downloading your models.

Can I fine-tune models with Ollama?

Ollama supports creating custom models with Modelfiles (similar to Dockerfiles). You can customize system prompts, parameters, and create model variants. For actual fine-tuning (training on your data), you'd need tools like Unsloth or Axolotl, then import the resulting model into Ollama.

What's the best model for coding?

Qwen 2.5 Coder 32B is the current champion with 92% on HumanEval. If you don't have 24GB VRAM, the 7B variant is still excellent. DeepSeek R1 is also great for code that requires complex reasoning (algorithms, system design). For quick autocomplete, Qwen 2.5 Coder 7B offers the best speed-to-quality ratio.

Can I use Ollama for commercial products?

Yes, but check the model license. Ollama itself is MIT-licensed. Models like DeepSeek R1 (MIT) and Qwen (Apache 2.0) are fully commercial. Llama models have a community license with some conditions for large-scale commercial use. Mistral has varying licenses per model. Always check the specific model's license before deploying commercially.

Build Your AI Agent Stack

Now that you can run AI locally, build a complete stack around it. Use our AI-powered generator to pick the right hosting, database, auth, and deployment tools — optimized for AI agent workloads and your budget.