How to Set Up a Local AI Assistant That Works Offline

TL;DR

This guide walks you through deploying a fully offline AI assistant using Ollama and Open WebUI on a Linux system. You’ll run models like Llama 3.1, Mistral, or Qwen locally without internet connectivity or cloud dependencies.

What you’ll accomplish: Install Ollama as a systemd service, download AI models for offline use, deploy Open WebUI as your chat interface, and configure everything to work without external network access. The entire stack runs on your hardware—a laptop with 16GB RAM handles 7B models, while 32GB+ systems can run 13B or larger models.

Key components: Ollama serves as your local inference engine (similar to running your own OpenAI API), Open WebUI provides a ChatGPT-like interface, and models are stored locally in ~/.ollama/models. You’ll use Docker Compose for orchestration and systemd for service management.

Time investment: 30-45 minutes for basic setup, plus model download time (5-20 minutes per model depending on size and connection speed).

Prerequisites: A Linux system (Ubuntu 22.04+, Debian 12+, or Fedora 38+ recommended), Docker and Docker Compose installed, and sufficient disk space (10GB minimum, 50GB+ recommended for multiple models). No GPU required, though it significantly improves response times.

Why this matters: Complete data privacy—your conversations never leave your machine. Works on airplanes, in secure environments, or anywhere without internet. No API costs, no rate limits, no vendor lock-in. Perfect for sensitive work, personal projects, or learning AI infrastructure without cloud dependencies.

Caution: AI models can hallucinate commands and configurations. Always review generated bash scripts, systemd units, or infrastructure code before execution. Test in isolated environments first, especially when using AI to generate Ansible playbooks or Terraform configurations for production systems.

Core Steps

Start by installing Ollama on your Linux system. This provides the runtime for local LLM inference:

curl -fsSL https://ollama.com/install.sh | sh

Pull a capable model like Llama 3.1 or Mistral. For offline assistant work, 8B parameter models offer the best balance:

ollama pull llama3.1:8b
ollama pull mistral:7b-instruct

Configure Open WebUI for Chat Interface

Open WebUI provides a ChatGPT-like interface that connects to your local models:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Access the interface at http://localhost:3000 and configure it to use your Ollama endpoint at http://host.docker.internal:11434.

Set Up System Prompts for Assistant Behavior

Create a custom system prompt that defines your assistant’s role. In Open WebUI, navigate to Settings → Models → System Prompt:

You are a helpful offline AI assistant running locally. Provide accurate, concise responses. When suggesting system commands, always explain what they do before the user runs them. Never execute commands directly.

⚠️ Caution: AI models can hallucinate incorrect commands. Always validate any suggested bash scripts, API calls, or system modifications before execution. Test in a non-production environment first.

Enable API Access for Automation

Ollama exposes a REST API for programmatic access:

import requests

response = requests.post('http://localhost:11434/api/generate',
    json={
        'model': 'llama3.1:8b',
        'prompt': 'Explain Docker networking',
        'stream': False
    })
print(response.json()['response'])

This enables integration with scripts, Ansible playbooks, or monitoring tools like Prometheus for AI-enhanced alerting.

Implementation

Start by installing Ollama on your Linux system. The installation script handles dependencies automatically:

curl -fsSL https://ollama.com/install.sh | sh

Once installed, pull a capable model like Mistral 7B or Llama 3.1 8B. For offline assistants, choose models under 10GB for reasonable response times:

ollama pull mistral:7b-instruct
ollama pull llama3.1:8b

Verify the installation by running a test query:

ollama run mistral:7b-instruct "Explain what you can do as an offline assistant"

Configuring Open WebUI for Daily Use

Open WebUI provides a ChatGPT-like interface that connects to your local models. Deploy it using Docker:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Access the interface at http://localhost:3000 and configure it to use your Ollama instance at http://host.docker.internal:11434.

Creating System Prompts for Specific Tasks

Define custom system prompts for different use cases. Create a file ~/.ollama/prompts/sysadmin.txt:

You are a Linux system administrator assistant. When suggesting commands:
1. Always explain what each command does
2. Warn about destructive operations
3. Suggest testing in non-production first
4. Never assume sudo access without asking

⚠️ Critical Warning: AI models can hallucinate commands that look plausible but are incorrect or dangerous. Always validate generated commands against official documentation before execution, especially for system administration tasks involving rm, dd, iptables, or package management operations.

Test your prompts with real scenarios before relying on them for production work.

Verification and Testing

After installation, confirm your local AI assistant responds correctly and operates without internet connectivity.

Start with a simple query to verify Ollama is running:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "What is 2+2?",
  "stream": false
}'

You should receive a JSON response within seconds. If the connection fails, check that the Ollama service is active with systemctl status ollama.

Offline Operation Verification

Disconnect your network interface to confirm true offline capability:

sudo ip link set enp0s3 down
ollama run llama3.2:3b "Explain Docker containers"

The model should respond normally. Reconnect afterward with sudo ip link set enp0s3 up.

Performance Benchmarking

Test response times and token generation speed:

import time
import requests

start = time.time()
response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'mistral:7b',
    'prompt': 'Write a Python function to calculate Fibonacci numbers',
    'stream': False
})
elapsed = time.time() - start
print(f"Response time: {elapsed:.2f}s")
print(f"Tokens/sec: {response.json()['eval_count'] / response.json()['eval_duration'] * 1e9:.1f}")

Expect 15-30 tokens/second on CPU, 50-150 on GPU depending on hardware.

AI Command Validation

CAUTION: Always review AI-generated system commands before execution. Local models can hallucinate dangerous operations.

# WRONG - Never pipe AI output directly to shell
ollama run codellama "write rm command" | bash

# CORRECT - Review first
ollama run codellama "write backup script" > review.sh
cat review.sh  # Inspect thoroughly
bash review.sh  # Execute only after validation

Test with Open WebUI by accessing http://localhost:3000 and verifying chat history persists after browser restart, confirming local data storage.

Best Practices

Allocate at least 8GB RAM for 7B models and 16GB for 13B models. Monitor GPU memory with nvidia-smi or watch -n 1 rocm-smi for AMD cards. Set Ollama’s OLLAMA_NUM_PARALLEL to 1 on systems with limited VRAM to prevent OOM crashes.

# Set resource limits in systemd service
sudo systemctl edit ollama

[Service]
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_NUM_PARALLEL=1"
MemoryMax=24G

Model Selection Strategy

Start with llama3.2:3b for testing, then upgrade to qwen2.5:7b or mistral:7b for production use. Quantized models (Q4_K_M) offer the best speed-to-quality ratio for local deployment. Test with your actual prompts before committing to larger models.

Prompt Engineering for Offline Use

Design system prompts that acknowledge the model’s knowledge cutoff and local context. Store reusable templates in Open WebUI’s prompt library.

SYSTEM_PROMPT = """You are a local AI assistant running offline. 
Current date: 2026-01-15. You cannot access the internet.
When uncertain, clearly state your limitations."""

⚠️ Critical: AI Hallucination Risks

Never execute AI-generated system commands without validation. LLMs frequently hallucinate package names, file paths, and command flags.

# WRONG: Piping AI output directly
ollama run codellama "write ansible playbook" | ansible-playbook -

# RIGHT: Review first
ollama run codellama "write ansible playbook" > review.yml
ansible-playbook --syntax-check review.yml

Backup and Version Control

Keep model files in /opt/ollama/models and back up your Open WebUI database weekly:

docker exec open-webui sqlite3 /app/backend/data/webui.db ".backup '/backup/webui-$(date +%F).db'"

Version control your Modelfiles and system prompts in Git for reproducibility across deployments.

FAQ

Yes, Ollama supports concurrent model loading. Each model consumes VRAM independently. For example, running llama3.2:3b (2GB) and mistral:7b (4GB) simultaneously requires 6GB+ VRAM. Monitor usage with:

ollama ps
watch -n 1 nvidia-smi

What happens if my internet goes down?

Once models are downloaded, everything runs locally. Your AI assistant continues functioning without internet connectivity. Verify offline capability:

sudo systemctl stop NetworkManager
ollama run llama3.2:3b "Explain Docker containers"

How do I update models safely?

Ollama doesn’t auto-update models. Pull updates manually and test before removing old versions:

ollama pull llama3.2:latest
ollama run llama3.2:latest "Test prompt"
ollama rm llama3.2:previous-tag

Can AI assistants generate dangerous system commands?

Absolutely. LLMs hallucinate and generate plausible-looking but incorrect commands. Never pipe AI output directly to bash:

# DANGEROUS - DO NOT DO THIS
ollama run codellama "fix permissions" | bash

# SAFE - Review first
ollama run codellama "fix permissions" > review.sh
cat review.sh  # Inspect carefully
bash review.sh

For infrastructure automation, use AI to generate Ansible playbooks or Terraform configs, then review thoroughly:

# AI-generated, human-reviewed
- name: Configure firewall
  ufw:
    rule: allow
    port: '11434'
    proto: tcp

How much disk space do I need?

Plan for 10-50GB depending on model collection:

Small models (3B): 2-4GB each
Medium models (7B-13B): 4-8GB each
Large models (70B+): 40GB+ each

Check current usage:

du -sh ~/.ollama/models

Does this work on Raspberry Pi?

Yes, but only with quantized small models. Use llama3.2:3b-q4_0 on Pi 5 with 8GB RAM. Performance is slow but functional for basic tasks.

TL;DR#

Core Steps#

Configure Open WebUI for Chat Interface#

Set Up System Prompts for Assistant Behavior#

Enable API Access for Automation#

Implementation#

Configuring Open WebUI for Daily Use#

Creating System Prompts for Specific Tasks#

Verification and Testing#

Offline Operation Verification#

Performance Benchmarking#

AI Command Validation#

Best Practices#

Model Selection Strategy#

Prompt Engineering for Offline Use#

Backup and Version Control#

FAQ#

What happens if my internet goes down?#

How do I update models safely?#

Can AI assistants generate dangerous system commands?#

How much disk space do I need?#

Does this work on Raspberry Pi?#