What is Ollama: Complete Guide to Running AI Models Locally

TL;DR

Ollama is a command-line tool that lets you run large language models like Llama, Mistral, and CodeLlama directly on your Linux machine without sending data to external APIs. Install it with a single command, pull models from the ollama.com library, and interact via REST API on port 11434 or through the CLI.

Install Ollama and run your first model:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2
ollama run llama3.2

The service starts automatically and listens on localhost:11434. Pull models once, then query them locally via HTTP or integrate with tools like Open WebUI, Continue.dev, or custom Python scripts.

Key Benefits

Run AI models without cloud dependencies. Your prompts, data, and responses stay on your hardware. No usage caps, no per-token billing, no internet required after model download. Ideal for privacy-sensitive workflows, air-gapped environments, or homelab experimentation.

Common Use Cases

Developers use Ollama for code completion in VS Code, generating Ansible playbooks, or explaining Terraform configurations. Homelab operators integrate it with Home Assistant for natural language automation or use it to analyze system logs. You can build chatbots, summarize documents, or prototype AI features before committing to cloud providers.

Important Limitations

Ollama does not expose Prometheus metrics endpoints, so monitoring requires external tools or API polling. Performance depends entirely on your hardware – consumer GPUs handle smaller models well, but large models may require significant VRAM or run slowly on CPU. Always validate AI-generated system commands before execution, especially for infrastructure automation or security-sensitive operations. Models can hallucinate package names, file paths, or configuration syntax that appears plausible but breaks systems.

Control your AI stack. Keep your data local. Run models on your terms.

Core Steps

Getting Ollama running locally involves three essential steps: installation, model selection, and initial interaction. Each step builds on the previous one to create a functional local AI environment.

Install Ollama using the official script on Linux systems:

curl -fsSL https://ollama.com/install.sh | sh

This installs the Ollama service and CLI tools. The service automatically starts and listens on port 11434. Verify the installation:

ollama --version

Pulling Your First Model

Download a model from the Ollama library. Start with a smaller model like Llama 3.2 for testing:

ollama pull llama3.2

Models are stored in GGUF format in the default location at /usr/share/ollama/.ollama/models. To use a custom storage path, set the OLLAMA_MODELS environment variable before starting the service:

export OLLAMA_MODELS=/mnt/storage/ollama-models

Running and Testing

Start an interactive session with your model:

ollama run llama3.2

This opens a chat interface where you can test prompts directly. For API access, use the REST endpoint:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain Docker networking in one sentence"
}'

For Python integration:

import requests

response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'llama3.2',
    'prompt': 'List common Ansible modules for file management',
    'stream': False
})
print(response.json()['response'])

Caution: When using AI-generated system commands, always review output before execution. LLMs can hallucinate package names, file paths, or command flags that do not exist. Validate all infrastructure code, especially Terraform plans or Ansible playbooks, in a non-production environment first.

Implementation

Install Ollama on Linux with a single command:

curl -fsSL https://ollama.com/install.sh | sh

After installation, the service runs automatically on port 11434. Verify it’s working:

ollama list

Pull your first model from the ollama.com library:

ollama pull llama3.2

Basic Configuration

Configure Ollama through environment variables. Create a systemd override:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo nano /etc/systemd/system/ollama.service.d/override.conf

Add configuration:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/mnt/storage/ollama-models"
Environment="OLLAMA_NUM_GPU=1"
Environment="OLLAMA_ORIGINS=http://localhost:3000"

Reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

API Integration

Query models via REST API from any language:

import requests

response = requests.post('http://localhost:11434/api/generate',
    json={
        'model': 'llama3.2',
        'prompt': 'Explain Docker networking in 50 words',
        'stream': False
    })

print(response.json()['response'])

For shell scripting:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Write a Terraform variable definition for AWS region",
  "stream": false
}'

Caution: When using LLMs to generate infrastructure code or system commands, always review output before execution. Models can hallucinate invalid syntax, deprecated flags, or insecure configurations. Test generated Ansible playbooks, Terraform modules, and shell scripts in isolated environments first. Never pipe AI-generated commands directly to bash without manual inspection.

Running Models

Start an interactive session:

ollama run llama3.2

Or run one-off queries:

ollama run llama3.2 "Explain Kubernetes pod security contexts"

Verification and Testing

After installing Ollama, verify the service is running and accessible before deploying models to production workloads.

Check that Ollama responds on its default port:

curl http://localhost:11434/api/tags

This returns a JSON list of installed models. If the connection fails, verify the service status:

systemctl status ollama

For custom host configurations using OLLAMA_HOST, adjust the curl command accordingly:

curl http://192.168.1.100:11434/api/tags

Model Inference Testing

Pull a small model and run a test inference:

ollama pull llama3.2:1b
ollama run llama3.2:1b "Explain what Ollama does in one sentence"

Test the REST API directly with a generation request:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:1b",
  "prompt": "What is the capital of France?",
  "stream": false
}'

GPU Acceleration Verification

If you configured OLLAMA_NUM_GPU, verify GPU utilization during inference:

nvidia-smi --query-gpu=utilization.gpu --format=csv --loop=1

Run this in a separate terminal while executing a model query. You should see GPU utilization spike during generation.

Integration Testing

Test Ollama with automation tools before production deployment. Example Python validation script:

import requests

response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'llama3.2:1b',
    'prompt': 'Return only the word SUCCESS',
    'stream': False
})

assert 'SUCCESS' in response.json()['response']

Caution: When using AI models to generate system commands or infrastructure code, always review outputs before execution. LLMs can hallucinate package names, file paths, or configuration syntax that may not exist on your system. Validate all AI-generated Ansible playbooks, shell scripts, or Terraform configurations in a test environment first.

Best Practices

Choose models that match your hardware capabilities. Start with smaller parameter counts like llama3.2:3b for testing, then scale to llama3.1:70b only if you have sufficient VRAM. Monitor GPU memory usage during initial runs to establish baseline requirements before committing to larger models in production workflows.

Environment Configuration

Set environment variables in systemd service files rather than shell profiles for consistent behavior across reboots:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf << EOF
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_GPU=1"
Environment="OLLAMA_MODELS=/mnt/models"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama

API Integration Safety

When integrating Ollama with automation tools, always validate AI-generated commands before execution. This Python example shows safe command handling:

import subprocess
import ollama

response = ollama.chat(model='llama3.1:8b', messages=[
    {'role': 'user', 'content': 'Generate a systemctl restart command'}
])

suggested_cmd = response['message']['content']
print(f"AI suggested: {suggested_cmd}")
approval = input("Execute? (yes/no): ")

if approval.lower() == 'yes':
    subprocess.run(suggested_cmd, shell=True)

AI models can hallucinate invalid syntax or destructive operations. Never pipe LLM output directly to bash without human review, especially for system administration tasks.

Model Updates and Versioning

Pin specific model versions in production rather than using latest tags:

ollama pull llama3.1:8b-instruct-q4_K_M

This prevents unexpected behavior changes when model maintainers publish updates. Test new versions in staging environments before promoting to production systems.

FAQ

Requirements vary by model size. A 7B parameter model typically needs 8GB VRAM for reasonable performance, while 13B models work better with 16GB or more. You can run larger models with less VRAM by adjusting the OLLAMA_NUM_GPU environment variable to offload fewer layers to the GPU, though inference will be slower.

Can I use Ollama with existing AI tools like LangChain?

Yes. Ollama exposes a REST API on port 11434 that’s compatible with OpenAI’s API format. Point LangChain, LlamaIndex, or similar frameworks to http://localhost:11434 and specify your model name. Here’s a Python example:

from langchain.llms import Ollama

llm = Ollama(
    base_url="http://localhost:11434",
    model="llama3.1"
)
response = llm.invoke("Explain Docker networking")

Does Ollama support monitoring with Prometheus?

No. Ollama does not expose a /metrics endpoint for Prometheus scraping. For monitoring, you’ll need to track system-level metrics like GPU utilization through nvidia-smi or similar tools, or build custom metrics by querying the API and exporting results to your monitoring stack.

How do I change where Ollama stores models?

Set the OLLAMA_MODELS environment variable before starting the service:

export OLLAMA_MODELS=/mnt/storage/ollama-models
ollama serve

For systemd services, add this to your service file override in /etc/systemd/system/ollama.service.d/override.conf.

Can I run Ollama in Docker?

Yes. The official image is available at ollama/ollama. Mount a volume for model storage and expose port 11434:

docker run -d -v ollama-data:/root/.ollama -p 11434:11434 --gpus=all ollama/ollama

TL;DR#

Key Benefits#

Common Use Cases#

Important Limitations#

Core Steps#

Pulling Your First Model#

Running and Testing#

Implementation#

Basic Configuration#

API Integration#

Running Models#

Verification and Testing#

Model Inference Testing#

GPU Acceleration Verification#

Integration Testing#

Best Practices#

Environment Configuration#

API Integration Safety#

Model Updates and Versioning#

FAQ#

Can I use Ollama with existing AI tools like LangChain?#

Does Ollama support monitoring with Prometheus?#

How do I change where Ollama stores models?#

Can I run Ollama in Docker?#