TL;DR
Running Llama 3 locally with Ollama on Linux takes about 5 minutes from start to finish. You’ll install Ollama, pull the model, and start chatting—all without sending data to external servers.
Quick Setup:
curl -fsSL https://ollama.com/install.sh | sh
# Pull Llama 3 (8B parameter version)
ollama pull llama3
# Start chatting
ollama run llama3
The 8B model requires ~5GB disk space and 8GB RAM. For the 70B version, you’ll need 40GB disk space and 48GB RAM minimum. Ollama handles quantization automatically, so you don’t need to configure GGUF formats manually.
What You Get:
- Local API endpoint at
http://localhost:11434compatible with OpenAI’s format - No internet required after initial model download
- GPU acceleration automatically detected for NVIDIA and AMD cards
- Multi-model support - run multiple models simultaneously
Integration Example:
import requests
response = requests.post('http://localhost:11434/api/generate',
json={
"model": "llama3",
"prompt": "Explain Docker networking in 3 sentences",
"stream": False
})
print(response.json()['response'])
Common Use Cases:
- Code review and generation (integrate with VS Code via Continue.dev)
- Log analysis and troubleshooting
- Documentation summarization
- Infrastructure-as-code generation with Terraform/Ansible
⚠️ Caution: Always validate AI-generated system commands before execution. Llama 3 can hallucinate package names, file paths, or dangerous command combinations. Never pipe AI output directly to bash or sudo without manual review—especially on production systems.
This guide covers Ubuntu 22.04/24.04, but the process works identically on Fedora, Arch, and Debian-based distributions.
Why Run Llama 3 Locally with Ollama
Running Llama 3 locally with Ollama gives you complete control over your AI infrastructure without sending sensitive data to third-party APIs. Unlike cloud services that charge per token and log your prompts, local deployment means zero recurring costs after initial hardware investment and absolute privacy for your queries.
Ollama optimizes Llama 3 for CPU and GPU inference, delivering response times under 2 seconds on consumer hardware. A typical ChatGPT API call costs $0.002 per 1K tokens—running 500K tokens monthly costs $1,000 annually. Your local setup has zero marginal cost after electricity, typically $15-30/month for a dedicated machine.
Privacy and Data Sovereignty
When you run ollama run llama3 locally, your prompts never leave your network. This matters for:
- Processing customer support tickets containing PII
- Analyzing proprietary codebases with tools like Continue.dev or Cody
- Running compliance-sensitive workloads in healthcare or finance
- Developing AI features without exposing trade secrets
Integration Flexibility
Ollama exposes an OpenAI-compatible API on localhost:11434, making it a drop-in replacement for cloud services:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"messages": [{"role": "user", "content": "Explain Docker networking"}]
}'
This works seamlessly with LangChain, Semantic Kernel, and custom Python scripts. You can integrate Llama 3 into Ansible playbooks for infrastructure documentation, Prometheus alert analysis, or CI/CD pipelines—all without external dependencies.
Caution: When using AI to generate system commands, always review output before execution. LLMs can hallucinate dangerous commands like rm -rf / or incorrect iptables rules. Validate AI-generated Terraform configurations and Kubernetes manifests in staging environments first.
System Requirements and Hardware Considerations
Running Llama 3 locally requires careful hardware planning. The model comes in several sizes, and your system specs will determine which variant runs smoothly.
For Llama 3 8B (the smallest variant), you’ll need:
- RAM: 16GB minimum (8GB for model + 8GB for system overhead)
- Storage: 10GB free space for model files
- CPU: Modern multi-core processor (AMD Ryzen 5/Intel i5 or better)
- GPU: Optional but recommended - NVIDIA GPU with 8GB+ VRAM
For Llama 3 70B, requirements jump significantly:
- RAM: 64GB+ for CPU-only inference
- GPU: NVIDIA RTX 4090 (24GB VRAM) or multiple GPUs
- Storage: 50GB+ for model quantizations
Checking Your System
Verify your hardware before installation:
# Check available RAM
free -h
# Check GPU (NVIDIA)
nvidia-smi
# Check available disk space
df -h /var/lib/ollama
GPU Acceleration Considerations
Ollama automatically detects NVIDIA GPUs with CUDA support. For AMD GPUs, ROCm support varies by model. Check compatibility:
# Verify CUDA installation
nvcc --version
# Check GPU compute capability
nvidia-smi --query-gpu=compute_cap --format=csv
Quantization Trade-offs
Ollama uses quantized models to reduce memory requirements. The 8B-Q4_K_M quantization runs well on 16GB systems, while 8B-Q8_0 provides better quality but needs 24GB RAM.
Performance tip: For production deployments, monitor resource usage with Prometheus and set up alerts when memory exceeds 80%. Use tools like htop or nvtop during initial testing to understand your baseline requirements.
Caution: When using AI assistants to generate system monitoring scripts, always validate commands before execution—hallucinated rm or dd commands can destroy data.
Installing Ollama on Linux
Ollama provides a one-line installation script that works across most Linux distributions. The official installer handles dependencies and sets up systemd services automatically.
Download and run the installation script:
curl -fsSL https://ollama.com/install.sh | sh
This installs Ollama to /usr/local/bin/ollama and creates a systemd service that starts automatically. Verify the installation:
ollama --version
systemctl status ollama
Manual Installation for Air-Gapped Systems
For environments without internet access, download the binary directly:
wget https://github.com/ollama/ollama/releases/download/v0.1.29/ollama-linux-amd64
sudo mv ollama-linux-amd64 /usr/local/bin/ollama
sudo chmod +x /usr/local/bin/ollama
Create the systemd service manually:
sudo tee /etc/systemd/system/ollama.service > /dev/null <<EOF
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
[Install]
WantedBy=default.target
EOF
Create the service user and enable the service:
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
sudo systemctl daemon-reload
sudo systemctl enable --now ollama
Configuration Options
Ollama stores models in /usr/share/ollama/.ollama/models by default. To change the storage location, edit the systemd service:
sudo systemctl edit ollama
Add environment variables:
[Service]
Environment="OLLAMA_MODELS=/mnt/storage/ollama-models"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Restart the service to apply changes:
sudo systemctl restart ollama
The API server now listens on port 11434, ready to serve model requests across your network.
Downloading and Running Llama 3 Models
With Ollama installed, downloading Llama 3 models is straightforward. The ollama pull command fetches models from Ollama’s registry and stores them locally for offline use.
Ollama hosts several Llama 3 model sizes optimized for different hardware:
# Standard Llama 3 8B (requires ~8GB RAM)
ollama pull llama3
# Llama 3 70B (requires ~48GB RAM)
ollama pull llama3:70b
# Llama 3.1 with extended context (128K tokens)
ollama pull llama3.1
# Llama 3.2 Vision (multimodal capabilities)
ollama pull llama3.2-vision
The default llama3 tag pulls the 8B parameter model, suitable for most homelab setups with 16GB+ RAM.
Running Your First Inference
Once downloaded, start an interactive chat session:
ollama run llama3
For single-shot queries without entering chat mode:
ollama run llama3 "Explain Docker networking in 3 sentences"
API Integration
Ollama exposes an OpenAI-compatible API on localhost:11434. Test it with curl:
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Write a Prometheus alert rule for high CPU usage",
"stream": false
}'
⚠️ Caution: AI models can hallucinate incorrect system commands. Always validate generated Ansible playbooks, Terraform configurations, or shell scripts before execution. For production infrastructure, use AI as a drafting tool, then review with ansible-playbook --check or terraform plan.
Managing Models
List installed models:
ollama list
Remove unused models to free disk space:
ollama rm llama3:70b
Each model variant consumes 4-40GB of storage depending on parameter count. Monitor available space with df -h /usr/share/ollama before pulling large models.
Configuration and Customization
Ollama allows fine-tuning model behavior through a Modelfile. Create one to adjust temperature, context window, and system prompts:
ollama show llama3 --modelfile > Modelfile
Edit the Modelfile to customize parameters:
FROM llama3
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER top_p 0.9
SYSTEM """You are a helpful Linux system administrator assistant. Provide accurate commands and explain potential risks."""
Apply your configuration:
ollama create llama3-custom -f Modelfile
API Integration
Ollama exposes a REST API on localhost:11434 compatible with OpenAI’s format. Test with curl:
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Write a bash script to backup /home to /mnt/backup",
"stream": false
}'
For Python applications, use the official client:
import ollama
response = ollama.chat(model='llama3', messages=[
{'role': 'user', 'content': 'Explain Docker networking modes'}
])
print(response['message']['content'])
⚠️ Caution: Always validate AI-generated system commands before execution. LLMs can hallucinate dangerous operations like rm -rf / or incorrect iptables rules. Review output carefully, especially for privileged operations.
Persistent Configuration
Store environment variables in /etc/systemd/system/ollama.service.d/override.conf:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=http://192.168.1.0/24"
Environment="OLLAMA_NUM_PARALLEL=2"
Reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
For infrastructure-as-code deployments, use Ansible to standardize configurations across multiple nodes. This ensures consistent model parameters and API settings in homelab clusters.
Verification and Testing
After installing Ollama and pulling Llama 3, verify your setup is working correctly before integrating it into your workflows.
Run a simple inference test to confirm Llama 3 responds:
ollama run llama3 "Explain what a Linux kernel module is in one sentence"
You should receive a coherent response within seconds. If the model hangs or returns errors, check journalctl -u ollama -f for service logs.
API Endpoint Verification
Test the REST API that your applications will use:
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Write a bash one-liner to find large files",
"stream": false
}'
The JSON response contains the generated text in the response field. This endpoint is what Open WebUI, custom scripts, and automation tools will call.
Performance Benchmarking
Measure token generation speed to establish baseline performance:
time ollama run llama3 "Generate a 500-word essay about container orchestration" > /dev/null
On a system with 32GB RAM and RTX 4070, expect 40-60 tokens/second for the 8B parameter model. Monitor GPU utilization with nvidia-smi during generation.
Integration Testing with Python
Create a simple test script to validate programmatic access:
import requests
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'llama3',
'prompt': 'List three Linux security hardening steps',
'stream': False
})
print(response.json()['response'])
⚠️ Caution: LLMs can hallucinate system commands. Always validate AI-generated bash scripts, Ansible playbooks, or infrastructure code before execution. Test in isolated environments first. Never pipe LLM output directly to bash or kubectl apply without human review—even locally-run models can produce plausible but incorrect or dangerous commands.