TL;DR

Running LLMs locally gives you privacy, control, and cost savings compared to cloud APIs. This comprehensive guide covers everything you need to deploy production-ready local AI infrastructure using Ollama and llama.cpp.

Both tools use GGUF format models with quantization to run efficiently on consumer hardware. Ollama provides a simple REST API and automatic model management, while llama.cpp offers fine-grained control and bleeding-edge features. You can run a 7B parameter model in 4-6GB RAM using Q4_K_M quantization, or larger models with GPU acceleration.

Key concepts: temperature controls randomness (0.1-0.9), context window sets memory length (2048-128000 tokens), quantization balances quality vs size (Q4_K_M recommended), and GPU layer offloading improves speed. Install Ollama in minutes, or build llama.cpp from source for custom hardware optimization.

Caution: When using AI assistants to generate commands or configurations, always validate against official documentation before production deployment. Test parameter combinations with your specific hardware and workload first.

Understanding Local LLM Deployment

Local LLM deployment means running language models on your own hardware instead of calling cloud APIs. This approach keeps data private, eliminates per-token costs, and lets you customize model behavior without vendor restrictions.

Two main tools dominate local deployment: Ollama provides a user-friendly REST API with automatic model management, while llama.cpp offers maximum control through a C++ inference engine. Both use GGUF format models – quantized weights optimized for CPU and consumer GPU inference.

Why Run Models Locally

Privacy is the primary driver for self-hosting. Medical records, customer data, and proprietary code never leave your network. You control logging, auditing, and access without trusting third-party privacy policies.

Cost savings matter for high-volume workloads. After initial hardware investment, inference costs nothing beyond electricity. A $2000 GPU running 24/7 costs roughly $50/month in power, far less than equivalent API usage for busy applications.

Customization enables fine-tuning and parameter adjustments impossible with managed services. You can modify system prompts, adjust temperature per request, or run custom model variants trained on your specific data.

Hardware Requirements

A 7B parameter model at Q4_K_M quantization needs 4-6GB RAM for CPU-only inference. Add 8GB+ VRAM for GPU acceleration. 13B models require 12-16GB RAM (CPU) or 10-12GB VRAM (GPU). 70B models need 48GB+ RAM or multiple GPUs.

Modern CPUs with AVX2 support run smaller models adequately. GPU acceleration becomes essential for 13B+ models or high-throughput scenarios. NVIDIA GPUs work best due to mature CUDA support, though AMD ROCm and Apple Metal backends exist.

Storage requirements depend on model count and quantization. A Q4_K_M 7B model consumes 4-5GB disk space. Q8_0 variants double this. Plan for 50-100GB if running multiple model variants.

Installing Ollama

Ollama packages the GGUF inference engine with automatic model downloads, service management, and a REST API. Installation takes under a minute on most Linux systems.

Run the official installer:

curl -fsSL https://ollama.com/install.sh | sh

This installs the ollama binary, creates a systemd service, and starts the API server on port 11434. Verify installation:

ollama list
curl http://localhost:11434/api/tags

Pull your first model to test:

ollama pull llama3.2:3b
ollama run llama3.2:3b "Explain Docker networking"

Configuration

Edit /etc/systemd/system/ollama.service to customize behavior:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/mnt/storage/models"
Environment="OLLAMA_NUM_GPU=1"

Restart after changes:

sudo systemctl daemon-reload
sudo systemctl restart ollama

The OLLAMA_HOST variable controls bind address. Use 0.0.0.0 to accept network connections. OLLAMA_MODELS sets storage location. OLLAMA_NUM_GPU controls GPU count for layer offloading.

Building llama.cpp from Source

llama.cpp gives you control over hardware acceleration, optimization flags, and experimental features. Pre-built binaries work for basic use, but compiling from source unlocks GPU support and architecture-specific optimizations.

Clone the repository:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

For CPU-only builds:

cmake -B build
cmake --build build --config Release

For NVIDIA GPU support:

cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release

For AMD GPUs:

cmake -B build -DLLAMA_HIPBLAS=ON
cmake --build build --config Release

For Apple Silicon:

cmake -B build -DLLAMA_METAL=ON
cmake --build build --config Release

Verify the build:

./build/bin/llama-cli --version
./build/bin/llama-server --version

The build process takes several minutes depending on CPU. Compilation enables CPU instruction sets like AVX2 and AVX-512 automatically based on your hardware.

Understanding LLM Parameters

LLM parameters control inference behavior – randomness, output length, token selection, and context handling. Tuning these settings optimizes models for specific tasks.

Temperature

Temperature controls randomness in token selection. Range: 0.0-2.0.

  • 0.0-0.3: Deterministic, focused output (code generation, factual queries)
  • 0.5-0.7: Balanced creativity and coherence (general chat)
  • 0.8-1.2: Creative, varied output (brainstorming, creative writing)

Lower temperatures make the model choose the most likely tokens consistently. Higher values increase randomness, making output less predictable but more creative.

Example:

# Ollama API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Write code to parse JSON",
  "options": {"temperature": 0.2}
}'

# llama-server
./llama-server -m model.gguf --temp 0.2

Top-p (Nucleus Sampling)

Top-p limits token selection to the smallest set whose cumulative probability exceeds the threshold. Range: 0.1-1.0.

A top-p of 0.9 considers only tokens representing 90% probability mass. This produces more coherent output than temperature alone by excluding unlikely tokens entirely.

Typical values: 0.85-0.95 for most tasks.

Context Window

Context window defines how much conversation history the model remembers, measured in tokens. Most models support 2048-8192 by default. Some extend to 32k-128k tokens.

Longer contexts consume more VRAM and slow inference. Use the minimum context needed for your task:

# Ollama
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "options": {"num_ctx": 4096}
}'

# llama-server
./llama-server -m model.gguf --ctx-size 4096

Max Tokens (Output Length)

Controls maximum output length. Prevents runaway generation:

# Ollama
"options": {"num_predict": 512}

# llama-server
./llama-server -m model.gguf -n 512

Quantization Explained

Quantization reduces model size by representing weights with fewer bits. GGUF models use schemes like Q4_0, Q4_K_M, Q5_K_M, and Q8_0.

Quantization Levels

  • Q4_0: Smallest, fastest, lowest quality (legacy format)
  • Q4_K_M: Best balance for most use cases (recommended starting point)
  • Q5_K_M: Better quality, 20% larger than Q4_K_M
  • Q8_0: Near full precision, double the size of Q4_K_M

For a 7B model:

  • Q4_K_M: ~4GB RAM
  • Q5_K_M: ~5GB RAM
  • Q8_0: ~7-8GB RAM

Choosing Quantization

Start with Q4_K_M. Upgrade to Q5_K_M or Q8_0 only if quality issues appear in your specific use case. Test with actual prompts from your application.

# Ollama pulls Q4_K_M by default
ollama pull llama3.2:7b

# Specify quantization
ollama pull llama3.2:7b-q5_K_M

For llama.cpp, download GGUF files from Hugging Face with explicit quantization in filename:

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

GPU Offloading and Memory Management

GPU offloading moves model layers from CPU to GPU memory, dramatically improving inference speed. Control how many layers run on GPU to balance speed and VRAM usage.

Ollama GPU Configuration

Ollama automatically calculates optimal layer distribution based on available VRAM:

export OLLAMA_NUM_GPU=1
ollama run llama3.2:7b

Set to 0 to force CPU-only inference. Ollama handles layer splitting transparently.

llama.cpp GPU Layers

Use the -ngl (or --n-gpu-layers) flag to set GPU layer count explicitly:

./llama-server -m model.gguf -ngl 35 --ctx-size 4096

Start with -ngl 35 for 7B models on 8GB VRAM cards. Increase for larger VRAM, decrease if you hit OOM errors.

Monitor VRAM usage:

# NVIDIA
watch -n 1 nvidia-smi

# AMD
watch -n 1 rocm-smi

Memory Optimization

If you run out of memory:

  1. Reduce GPU layers (-ngl)
  2. Use lower quantization (Q4_K_M instead of Q5_K_M)
  3. Decrease context window (--ctx-size)
  4. Close other applications using VRAM

Test configurations incrementally. A model that runs with short prompts may exhaust memory with long context windows.

Running Models with Ollama

Ollama provides a simple CLI and REST API for model interaction. The service runs as a systemd daemon, handling model loading and inference requests.

Command Line Usage

Run a model interactively:

ollama run llama3.2:7b

Single prompt execution:

ollama run llama3.2:7b "Explain Docker networking"

List downloaded models:

ollama list

Remove a model:

ollama rm llama3.2:7b

REST API Usage

Generate completions:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain Docker networking",
  "options": {
    "temperature": 0.7,
    "top_p": 0.9,
    "num_predict": 512
  }
}'

Chat format (multi-turn conversations):

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "What is Docker?"},
    {"role": "assistant", "content": "Docker is a containerization platform..."},
    {"role": "user", "content": "How does networking work?"}
  ]
}'

Creating Custom Models with Modelfiles

Define persistent parameter defaults:

cat > Modelfile <<EOF
FROM llama3.2
PARAMETER temperature 0.3
PARAMETER top_p 0.85
PARAMETER num_predict 1024
SYSTEM You are a technical documentation assistant specializing in Linux.
EOF

ollama create llama3.2-docs -f Modelfile
ollama run llama3.2-docs

This approach works well for CI/CD integration requiring consistent model behavior.

Running Models with llama.cpp

llama-server provides an OpenAI-compatible HTTP API for inference. The server loads models into memory and serves requests on a specified port.

Starting the Server

Basic server startup:

./llama-server -m models/llama-2-7b.Q4_K_M.gguf --port 8080

With GPU acceleration and custom parameters:

./llama-server \
  -m models/mistral-7b-instruct.Q4_K_M.gguf \
  --ctx-size 4096 \
  --threads 8 \
  --n-gpu-layers 35 \
  --port 8080 \
  --host 0.0.0.0

Parameters:

  • -m: Model file path
  • --ctx-size: Context window size
  • --threads: CPU thread count
  • -ngl: GPU layers to offload
  • --port: Server port
  • --host: Bind address

API Requests

Chat completion endpoint (OpenAI-compatible):

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b-instruct",
    "messages": [{"role": "user", "content": "Explain Docker networking"}],
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 512
  }'

The OpenAI-compatible API means you can point existing tools expecting OpenAI format to http://localhost:8080/v1 as the base URL.

Downloading Models from Hugging Face

Hugging Face hosts thousands of open-weight models. For local deployment, you need GGUF format files.

Finding Models

Browse huggingface.co/models and filter for “gguf” tag. Popular model families:

  • Llama (Meta)
  • Mistral
  • Phi (Microsoft)
  • Qwen (Alibaba)

Check the Files tab for available quantization levels. Look for Q4_K_M or Q5_K_M variants.

Downloading with wget

Get the direct download URL from the Files tab:

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

Using Git LFS

For full repository clones:

sudo apt install git-lfs
git lfs install
git clone https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF

Importing into Ollama

Create a Modelfile referencing the downloaded GGUF:

cat > Modelfile <<EOF
FROM ./llama-2-7b.Q4_K_M.gguf
PARAMETER temperature 0.7
EOF

ollama create my-llama2 -f Modelfile
ollama run my-llama2

Production Deployment Considerations

Running local LLMs in production requires planning for reliability, security, and performance.

Resource Monitoring

Monitor CPU, RAM, and VRAM usage continuously:

# Install monitoring tools
sudo apt install htop nvidia-smi

# Watch resources
htop
watch -n 1 nvidia-smi

Set up alerts for:

  • VRAM exhaustion
  • CPU throttling
  • Disk space (model storage)
  • API response latency

Security Hardening

Bind Ollama/llama-server to localhost only for single-machine use:

# Ollama
Environment="OLLAMA_HOST=127.0.0.1:11434"

# llama-server
./llama-server --host 127.0.0.1

For network access, use reverse proxy with authentication:

location /api/ {
  auth_basic "LLM API";
  auth_basic_user_file /etc/nginx/.htpasswd;
  proxy_pass http://localhost:11434/;
}

Performance Tuning

Start with conservative settings and tune based on actual workload:

  1. Temperature: 0.3 for technical tasks, 0.7 for general chat
  2. Context: Minimum needed (4096 for most tasks)
  3. GPU layers: Maximum your VRAM allows
  4. Threads: Match CPU core count

Test parameter combinations with representative prompts before production deployment.

Backup and Model Management

Store models on separate storage from system disk. Use version control for Modelfiles:

# Custom model storage
Environment="OLLAMA_MODELS=/mnt/storage/models"

Document which quantization levels you’re using and why. Keep checksums for model files to detect corruption.

Troubleshooting Common Issues

Out of Memory Errors

Symptom: Model crashes or refuses to load.

Solutions:

  1. Reduce GPU layers: -ngl 20 instead of -ngl 35
  2. Use lower quantization: Q4_K_M instead of Q5_K_M
  3. Decrease context: --ctx-size 2048 instead of 4096
  4. Close other applications using VRAM

Slow Inference

Symptom: Tokens generate slowly (< 5 tokens/second).

Solutions:

  1. Increase GPU layers if VRAM available
  2. Reduce context window
  3. Use lower quantization (faster but lower quality)
  4. Check CPU isn’t throttling (monitor temperature)

Model Not Found

Symptom: Ollama can’t find downloaded model.

Solutions:

# Check model location
ollama list

# Verify storage path
echo $OLLAMA_MODELS

# Re-pull model
ollama pull llama3.2:7b

API Connection Refused

Symptom: Can’t connect to Ollama/llama-server API.

Solutions:

# Check service status
systemctl status ollama

# Verify port binding
sudo netstat -tlnp | grep 11434

# Test locally first
curl http://localhost:11434/api/tags

Further Reading

This guide covers core concepts for running local LLMs. For specialized topics:

Caution: When using AI assistants to generate deployment commands, always validate against official documentation. Test configurations in development environments before production deployment.