How to Run Llama 3 Locally with Ollama on Linux

TL;DR

Running Llama 3 locally with Ollama on Linux takes about 5 minutes from start to finish. You’ll install Ollama, pull the model, and start chatting—all without sending data to external servers.

Quick Setup:

curl -fsSL https://ollama.com/install.sh | sh

# Pull Llama 3 (8B parameter version)
ollama pull llama3

# Start chatting
ollama run llama3

The 8B model requires ~5GB disk space and 8GB RAM. For the 70B version, you’ll need 40GB disk space and 48GB RAM minimum. Ollama handles quantization automatically, so you don’t need to configure GGUF formats manually.

What You Get:

Local API endpoint at http://localhost:11434 compatible with OpenAI’s format
No internet required after initial model download
GPU acceleration automatically detected for NVIDIA and AMD cards
Multi-model support - run multiple models simultaneously

Integration Example:

import requests

response = requests.post('http://localhost:11434/api/generate',
    json={
        "model": "llama3",
        "prompt": "Explain Docker networking in 3 sentences",
        "stream": False
    })

print(response.json()['response'])

Common Use Cases:

Code review and generation (integrate with VS Code via Continue.dev)
Log analysis and troubleshooting
Documentation summarization
Infrastructure-as-code generation with Terraform/Ansible

⚠️ Caution: Always validate AI-generated system commands before execution. Llama 3 can hallucinate package names, file paths, or dangerous command combinations. Never pipe AI output directly to bash or sudo without manual review—especially on production systems.

This guide covers Ubuntu 22.04/24.04, but the process works identically on Fedora, Arch, and Debian-based distributions.

Why Run Llama 3 Locally with Ollama

Running Llama 3 locally with Ollama gives you complete control over your AI infrastructure without sending sensitive data to third-party APIs. Unlike cloud services that charge per token and log your prompts, local deployment means zero recurring costs after initial hardware investment and absolute privacy for your queries.

Ollama optimizes Llama 3 for CPU and GPU inference, delivering response times under 2 seconds on consumer hardware. A typical ChatGPT API call costs $0.002 per 1K tokens—running 500K tokens monthly costs $1,000 annually. Your local setup has zero marginal cost after electricity, typically $15-30/month for a dedicated machine.

Privacy and Data Sovereignty

When you run ollama run llama3 locally, your prompts never leave your network. This matters for:

Processing customer support tickets containing PII
Analyzing proprietary codebases with tools like Continue.dev or Cody
Running compliance-sensitive workloads in healthcare or finance
Developing AI features without exposing trade secrets

Integration Flexibility

Ollama exposes an OpenAI-compatible API on localhost:11434, making it a drop-in replacement for cloud services:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "Explain Docker networking"}]
  }'

This works seamlessly with LangChain, Semantic Kernel, and custom Python scripts. You can integrate Llama 3 into Ansible playbooks for infrastructure documentation, Prometheus alert analysis, or CI/CD pipelines—all without external dependencies.

Caution: When using AI to generate system commands, always review output before execution. LLMs can hallucinate dangerous commands like rm -rf / or incorrect iptables rules. Validate AI-generated Terraform configurations and Kubernetes manifests in staging environments first.

System Requirements and Hardware Considerations

Running Llama 3 locally requires careful hardware planning. The model comes in several sizes, and your system specs will determine which variant runs smoothly.

For Llama 3 8B (the smallest variant), you’ll need:

RAM: 16GB minimum (8GB for model + 8GB for system overhead)
Storage: 10GB free space for model files
CPU: Modern multi-core processor (AMD Ryzen 5/Intel i5 or better)
GPU: Optional but recommended - NVIDIA GPU with 8GB+ VRAM

For Llama 3 70B, requirements jump significantly:

RAM: 64GB+ for CPU-only inference
GPU: NVIDIA RTX 4090 (24GB VRAM) or multiple GPUs
Storage: 50GB+ for model quantizations

Checking Your System

Verify your hardware before installation:

# Check available RAM
free -h

# Check GPU (NVIDIA)
nvidia-smi

# Check available disk space
df -h /var/lib/ollama

GPU Acceleration Considerations

Ollama automatically detects NVIDIA GPUs with CUDA support. For AMD GPUs, ROCm support varies by model. Check compatibility:

# Verify CUDA installation
nvcc --version

# Check GPU compute capability
nvidia-smi --query-gpu=compute_cap --format=csv

Quantization Trade-offs

Ollama uses quantized models to reduce memory requirements. The 8B-Q4_K_M quantization runs well on 16GB systems, while 8B-Q8_0 provides better quality but needs 24GB RAM.

Performance tip: For production deployments, monitor resource usage with Prometheus and set up alerts when memory exceeds 80%. Use tools like htop or nvtop during initial testing to understand your baseline requirements.

Caution: When using AI assistants to generate system monitoring scripts, always validate commands before execution—hallucinated rm or dd commands can destroy data.

Installing Ollama on Linux

Ollama provides a one-line installation script that works across most Linux distributions. The official installer handles dependencies and sets up systemd services automatically.

Download and run the installation script:

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama to /usr/local/bin/ollama and creates a systemd service that starts automatically. Verify the installation:

ollama --version
systemctl status ollama

Manual Installation for Air-Gapped Systems

For environments without internet access, download the binary directly:

wget https://github.com/ollama/ollama/releases/download/v0.1.29/ollama-linux-amd64
sudo mv ollama-linux-amd64 /usr/local/bin/ollama
sudo chmod +x /usr/local/bin/ollama

Create the systemd service manually:

sudo tee /etc/systemd/system/ollama.service > /dev/null <<EOF
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3

[Install]
WantedBy=default.target
EOF

Create the service user and enable the service:

sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
sudo systemctl daemon-reload
sudo systemctl enable --now ollama

Configuration Options

Ollama stores models in /usr/share/ollama/.ollama/models by default. To change the storage location, edit the systemd service:

sudo systemctl edit ollama

Add environment variables:

[Service]
Environment="OLLAMA_MODELS=/mnt/storage/ollama-models"
Environment="OLLAMA_HOST=0.0.0.0:11434"

Restart the service to apply changes:

sudo systemctl restart ollama

The API server now listens on port 11434, ready to serve model requests across your network.

Downloading and Running Llama 3 Models

With Ollama installed, downloading Llama 3 models is straightforward. The ollama pull command fetches models from Ollama’s registry and stores them locally for offline use.

Ollama hosts several Llama 3 model sizes optimized for different hardware:

# Standard Llama 3 8B (requires ~8GB RAM)
ollama pull llama3

# Llama 3 70B (requires ~48GB RAM)
ollama pull llama3:70b

# Llama 3.1 with extended context (128K tokens)
ollama pull llama3.1

# Llama 3.2 Vision (multimodal capabilities)
ollama pull llama3.2-vision

The default llama3 tag pulls the 8B parameter model, suitable for most homelab setups with 16GB+ RAM.

Running Your First Inference

Once downloaded, start an interactive chat session:

ollama run llama3

For single-shot queries without entering chat mode:

ollama run llama3 "Explain Docker networking in 3 sentences"

API Integration

Ollama exposes an OpenAI-compatible API on localhost:11434. Test it with curl:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Write a Prometheus alert rule for high CPU usage",
  "stream": false
}'

⚠️ Caution: AI models can hallucinate incorrect system commands. Always validate generated Ansible playbooks, Terraform configurations, or shell scripts before execution. For production infrastructure, use AI as a drafting tool, then review with ansible-playbook --check or terraform plan.

Managing Models

List installed models:

ollama list

Remove unused models to free disk space:

ollama rm llama3:70b

Each model variant consumes 4-40GB of storage depending on parameter count. Monitor available space with df -h /usr/share/ollama before pulling large models.

Configuration and Customization

Ollama allows fine-tuning model behavior through a Modelfile. Create one to adjust temperature, context window, and system prompts:

ollama show llama3 --modelfile > Modelfile

Edit the Modelfile to customize parameters:

FROM llama3

PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER top_p 0.9

SYSTEM """You are a helpful Linux system administrator assistant. Provide accurate commands and explain potential risks."""

Apply your configuration:

ollama create llama3-custom -f Modelfile

API Integration

Ollama exposes a REST API on localhost:11434 compatible with OpenAI’s format. Test with curl:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Write a bash script to backup /home to /mnt/backup",
  "stream": false
}'

For Python applications, use the official client:

import ollama

response = ollama.chat(model='llama3', messages=[
  {'role': 'user', 'content': 'Explain Docker networking modes'}
])
print(response['message']['content'])

⚠️ Caution: Always validate AI-generated system commands before execution. LLMs can hallucinate dangerous operations like rm -rf / or incorrect iptables rules. Review output carefully, especially for privileged operations.

Persistent Configuration

Store environment variables in /etc/systemd/system/ollama.service.d/override.conf:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=http://192.168.1.0/24"
Environment="OLLAMA_NUM_PARALLEL=2"

Reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

For infrastructure-as-code deployments, use Ansible to standardize configurations across multiple nodes. This ensures consistent model parameters and API settings in homelab clusters.

Verification and Testing

After installing Ollama and pulling Llama 3, verify your setup is working correctly before integrating it into your workflows.

Run a simple inference test to confirm Llama 3 responds:

ollama run llama3 "Explain what a Linux kernel module is in one sentence"

You should receive a coherent response within seconds. If the model hangs or returns errors, check journalctl -u ollama -f for service logs.

API Endpoint Verification

Test the REST API that your applications will use:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Write a bash one-liner to find large files",
  "stream": false
}'

The JSON response contains the generated text in the response field. This endpoint is what Open WebUI, custom scripts, and automation tools will call.

Performance Benchmarking

Measure token generation speed to establish baseline performance:

time ollama run llama3 "Generate a 500-word essay about container orchestration" > /dev/null

On a system with 32GB RAM and RTX 4070, expect 40-60 tokens/second for the 8B parameter model. Monitor GPU utilization with nvidia-smi during generation.

Integration Testing with Python

Create a simple test script to validate programmatic access:

import requests

response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'llama3',
    'prompt': 'List three Linux security hardening steps',
    'stream': False
})

print(response.json()['response'])

⚠️ Caution: LLMs can hallucinate system commands. Always validate AI-generated bash scripts, Ansible playbooks, or infrastructure code before execution. Test in isolated environments first. Never pipe LLM output directly to bash or kubectl apply without human review—even locally-run models can produce plausible but incorrect or dangerous commands.

TL;DR#

Why Run Llama 3 Locally with Ollama#

Privacy and Data Sovereignty#

Integration Flexibility#

System Requirements and Hardware Considerations#

Checking Your System#

GPU Acceleration Considerations#

Quantization Trade-offs#

Installing Ollama on Linux#

Manual Installation for Air-Gapped Systems#

Configuration Options#

Downloading and Running Llama 3 Models#

Running Your First Inference#

API Integration#

Managing Models#

Configuration and Customization#

API Integration#

Persistent Configuration#

Verification and Testing#

API Endpoint Verification#

Performance Benchmarking#

Integration Testing with Python#

TL;DR

Why Run Llama 3 Locally with Ollama

Privacy and Data Sovereignty

Integration Flexibility

System Requirements and Hardware Considerations

Checking Your System

GPU Acceleration Considerations

Quantization Trade-offs

Installing Ollama on Linux

Manual Installation for Air-Gapped Systems

Configuration Options

Downloading and Running Llama 3 Models

Running Your First Inference

API Integration

Managing Models

Configuration and Customization

API Integration

Persistent Configuration

Verification and Testing

API Endpoint Verification

Performance Benchmarking

Integration Testing with Python