TL;DR
Ollama transforms your Debian system into a private AI inference server, letting you run models like Llama 3.1, Mistral, and Phi-3 locally without cloud dependencies. This guide walks you through installation, model deployment, API integration, and production hardening.
Quick Install:
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable ollama
ollama pull llama3.1:8b
ollama run llama3.1:8b
You’ll configure Ollama as a systemd service, expose its REST API on port 11434, and integrate it with Open WebUI for a ChatGPT-like interface. We cover GPU acceleration (NVIDIA/AMD), resource limits, and reverse proxy setup with Nginx for secure remote access.
Key integration examples include:
- Python API calls using the
ollamalibrary for RAG applications - Ansible playbooks for multi-server deployment
- Prometheus metrics scraping for performance monitoring
- LangChain integration for building AI agents
What you’ll learn:
- Installing Ollama on Debian 12 (Bookworm) and Debian 11 (Bullseye)
- Pulling and managing models (quantization levels, storage optimization)
- Exposing the API securely with authentication
- Running multiple models simultaneously with resource allocation
- Integrating with development tools (VS Code Continue, Cursor)
- Troubleshooting common issues (CUDA errors, OOM kills, slow inference)
Caution: When using AI assistants like Claude or ChatGPT to generate system commands for Ollama configuration, always validate the output before execution. LLMs can hallucinate incorrect systemd unit files, firewall rules, or API endpoints that may break your setup. Test AI-generated commands in a VM or container first, especially for production deployments.
By the end, you’ll have a fully functional local AI stack that processes sensitive data on-premises, with no external API calls or usage tracking.
What is Ollama and Why Run It Locally on Debian
Ollama is an open-source framework that lets you run large language models like Llama 3.1, Mistral, Phi-3, and Gemma directly on your own hardware. Think of it as Docker for AI models—it handles model downloads, memory management, and provides a simple API that works with tools like Open WebUI, Continue.dev, and LangChain.
Running Ollama locally on Debian gives you complete control over your AI infrastructure. Your code reviews, internal documentation, and sensitive prompts never leave your network. This matters when you’re using AI to analyze proprietary codebases or generate infrastructure-as-code templates with Terraform or Ansible.
Performance is another key advantage. A local Llama 3.1 8B model on decent hardware (16GB RAM, modern CPU) responds in 2-3 seconds versus 5-10 seconds for API calls to cloud services. For development workflows—like using Cursor or Continue.dev for code completion—this latency difference is significant.
Cost savings add up quickly. Running Ollama 24/7 on a homelab server costs you electricity, while cloud API usage can hit $50-200/month for moderate use. A one-time hardware investment pays for itself in 6-12 months.
Real-World Use Cases
Developers use Ollama to power local coding assistants that understand your entire codebase context. DevOps teams run it alongside Prometheus and Grafana to generate alert analysis and runbook suggestions. Technical writers use it for documentation drafts without exposing confidential product details.
Caution: AI models can hallucinate commands that look correct but contain dangerous flags. Always validate generated Ansible playbooks, Docker commands, or system scripts in a test environment before production deployment. A hallucinated rm -rf path or incorrect firewall rule can cause real damage.
Debian’s stability and long-term support make it ideal for production AI deployments that need to run reliably for years.
Prerequisites and System Requirements
Before installing Ollama on your Debian system, ensure you meet the following requirements for optimal performance and compatibility.
Ollama runs on modest hardware, but model performance scales with resources:
- CPU: x86_64 processor (AMD64 architecture)
- RAM: Minimum 8GB for 7B models, 16GB+ recommended for 13B models, 32GB+ for 70B models
- Storage: 10GB free space minimum (models range from 4GB to 40GB+ each)
- GPU (optional): NVIDIA GPU with 6GB+ VRAM significantly improves inference speed
To check your system specifications:
# Check CPU architecture
uname -m
# Check available RAM
free -h
# Check disk space
df -h /usr/local
# Check for NVIDIA GPU
lspci | grep -i nvidia
Software Requirements
Ollama supports Debian 11 (Bullseye) and Debian 12 (Bookworm). Verify your version:
cat /etc/debian_version
lsb_release -a
You’ll need:
- curl or wget for downloading the installer
- systemd for service management (standard on Debian)
- NVIDIA drivers (if using GPU acceleration)
Install prerequisites:
sudo apt update
sudo apt install curl systemd
For GPU support, install NVIDIA Container Toolkit:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install nvidia-container-toolkit
Network Access
Ollama downloads models from its registry at registry.ollama.ai. Ensure your firewall allows outbound HTTPS (port 443) connections.
Installing Ollama on Debian
Ollama provides an official installation script that handles dependencies and system configuration automatically. The installation process is straightforward but requires root privileges.
Run the official installer with a single command:
curl -fsSL https://ollama.com/install.sh | sh
This script detects your Debian version, installs necessary dependencies, sets up the systemd service, and configures Ollama to start on boot. The installation typically completes in under two minutes on modern hardware.
Manual Installation Alternative
For environments where piping curl to shell raises security concerns, download and inspect the script first:
curl -fsSL https://ollama.com/install.sh -o ollama-install.sh
less ollama-install.sh
sudo bash ollama-install.sh
Verifying the Installation
Confirm Ollama is running correctly:
systemctl status ollama
ollama --version
The service should show as “active (running)” and display the current version number. Test basic functionality by pulling a small model:
ollama pull qwen2.5:0.5b
ollama run qwen2.5:0.5b "What is 2+2?"
Post-Installation Configuration
By default, Ollama listens only on localhost (127.0.0.1:11434). To enable network access for tools like Open WebUI or API integrations, edit the systemd service:
sudo systemctl edit ollama
Add this configuration:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Restart the service to apply changes:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Caution: Exposing Ollama to your network without authentication allows anyone with access to run models and consume GPU resources. Consider using a reverse proxy with authentication or firewall rules to restrict access to trusted clients only.
Downloading and Running Your First Model
With Ollama installed, you’re ready to pull and run your first model. The ollama run command handles both downloading and launching models in one step.
Start with Meta’s Llama 3.2 3B model, which offers excellent performance on consumer hardware:
ollama run llama3.2
Ollama downloads the model (approximately 2GB) and drops you into an interactive chat session. Type your prompts directly and press Enter. Exit with /bye or Ctrl+D.
Listing Available Models
Browse the model library at ollama.com/library or check what’s installed locally:
ollama list
Popular models for local deployment include:
- llama3.2 (3B/1B) - Fast, efficient for most tasks
- mistral (7B) - Strong reasoning capabilities
- codellama (7B/13B) - Code generation specialist
- phi3 (3.8B) - Microsoft’s compact model
Pulling Models Without Running
Download models in advance for offline use:
ollama pull mistral
ollama pull codellama:13b
The tag system (:13b, :7b) lets you specify model sizes. Without a tag, Ollama defaults to the recommended variant.
Testing API Access
Verify the REST API works for integration with Open WebUI or custom applications:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why run AI locally?",
"stream": false
}'
⚠️ Caution: When using LLMs to generate system commands, always review output before execution. Models can hallucinate package names, paths, or flags that don’t exist. Never pipe AI responses directly to bash or sudo without manual verification—especially on production systems.
You now have a working local LLM environment. Next, we’ll optimize performance and integrate with web interfaces.
Configuration and Customization
Ollama respects several environment variables for customization. Create a systemd override to persist these settings:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo nano /etc/systemd/system/ollama.service.d/override.conf
Add your configuration:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/mnt/storage/ollama-models"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_KEEP_ALIVE=10m"
Reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Model Management
Download and manage models efficiently:
# Pull specific model versions
ollama pull llama3.2:3b-instruct-q4_K_M
# List installed models with sizes
ollama list
# Remove unused models to free space
ollama rm codellama:7b
API Integration
Ollama exposes a REST API compatible with OpenAI’s format. Test with curl:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain Docker networking in 50 words",
"stream": false
}'
Python integration example:
import requests
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'llama3.2',
'prompt': 'Generate a Prometheus alerting rule for high CPU',
'stream': False
})
print(response.json()['response'])
⚠️ Caution: Always validate AI-generated system commands, Ansible playbooks, or infrastructure code before execution. LLMs can hallucinate incorrect syntax, deprecated flags, or dangerous operations. Test in isolated environments first.
Performance Tuning
Monitor resource usage with Prometheus node_exporter or check real-time stats:
# Watch GPU utilization (if available)
watch -n 1 nvidia-smi
# Monitor Ollama process
htop -p $(pgrep ollama)
Adjust OLLAMA_NUM_PARALLEL based on your RAM—each concurrent request loads the model into memory.
Verification and Testing
After installation, verify Ollama is running correctly before deploying models. Start by checking the service status:
systemctl status ollama
The output should show “active (running)” in green. If the service failed to start, check logs with journalctl -u ollama -n 50.
Pull a lightweight model to verify network connectivity and model download capabilities:
ollama pull qwen2.5:0.5b
This 397MB model downloads quickly and confirms your installation works. Test inference with a simple prompt:
ollama run qwen2.5:0.5b "Explain what Ollama does in one sentence"
You should receive a coherent response within seconds. Exit the interactive session by typing /bye.
API Endpoint Verification
Ollama exposes a REST API on port 11434. Test it with curl:
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5:0.5b",
"prompt": "Why is the sky blue?",
"stream": false
}'
A successful response returns JSON with the model’s answer in the response field.
Performance Baseline
Run a quick benchmark to establish baseline performance:
time ollama run qwen2.5:0.5b "Count from 1 to 10"
Note the execution time. On modern hardware (8-core CPU, 16GB RAM), expect responses under 3 seconds for this small model.
Integration Testing with Python
Verify programmatic access using the official Python library:
import ollama
response = ollama.chat(model='qwen2.5:0.5b', messages=[
{'role': 'user', 'content': 'What is 2+2?'}
])
print(response['message']['content'])
Caution: When using LLMs to generate system commands, always review output before execution. Models can hallucinate dangerous commands like rm -rf / or incorrect systemd configurations. Never pipe LLM output directly to bash without human verification, especially on production systems.