TL;DR
Running LLMs locally gives you privacy, control, and cost savings compared to cloud APIs. This comprehensive guide covers everything you need to deploy production-ready local AI infrastructure using Ollama and llama.cpp.
Both tools use GGUF format models with quantization to run efficiently on consumer hardware. Ollama provides a simple REST API and automatic model management, while llama.cpp offers fine-grained control and bleeding-edge features. You can run a 7B parameter model in 4-6GB RAM using Q4_K_M quantization, or larger models with GPU acceleration.
Key concepts: temperature controls randomness (0.1-0.9), context window sets memory length (2048-128000 tokens), quantization balances quality vs size (Q4_K_M recommended), and GPU layer offloading improves speed. Install Ollama in minutes, or build llama.cpp from source for custom hardware optimization.
Caution: When using AI assistants to generate commands or configurations, always validate against official documentation before production deployment. Test parameter combinations with your specific hardware and workload first.
Understanding Local LLM Deployment
Local LLM deployment means running language models on your own hardware instead of calling cloud APIs. This approach keeps data private, eliminates per-token costs, and lets you customize model behavior without vendor restrictions.
Two main tools dominate local deployment: Ollama provides a user-friendly REST API with automatic model management, while llama.cpp offers maximum control through a C++ inference engine. Both use GGUF format models – quantized weights optimized for CPU and consumer GPU inference.
Why Run Models Locally
Privacy is the primary driver for self-hosting. Medical records, customer data, and proprietary code never leave your network. You control logging, auditing, and access without trusting third-party privacy policies.
Cost savings matter for high-volume workloads. After initial hardware investment, inference costs nothing beyond electricity. A $2000 GPU running 24/7 costs roughly $50/month in power, far less than equivalent API usage for busy applications.
Customization enables fine-tuning and parameter adjustments impossible with managed services. You can modify system prompts, adjust temperature per request, or run custom model variants trained on your specific data.
Hardware Requirements
A 7B parameter model at Q4_K_M quantization needs 4-6GB RAM for CPU-only inference. Add 8GB+ VRAM for GPU acceleration. 13B models require 12-16GB RAM (CPU) or 10-12GB VRAM (GPU). 70B models need 48GB+ RAM or multiple GPUs.
Modern CPUs with AVX2 support run smaller models adequately. GPU acceleration becomes essential for 13B+ models or high-throughput scenarios. NVIDIA GPUs work best due to mature CUDA support, though AMD ROCm and Apple Metal backends exist.
Storage requirements depend on model count and quantization. A Q4_K_M 7B model consumes 4-5GB disk space. Q8_0 variants double this. Plan for 50-100GB if running multiple model variants.
Installing Ollama
Ollama packages the GGUF inference engine with automatic model downloads, service management, and a REST API. Installation takes under a minute on most Linux systems.
Run the official installer:
curl -fsSL https://ollama.com/install.sh | sh
This installs the ollama binary, creates a systemd service, and starts the API server on port 11434. Verify installation:
ollama list
curl http://localhost:11434/api/tags
Pull your first model to test:
ollama pull llama3.2:3b
ollama run llama3.2:3b "Explain Docker networking"
Configuration
Edit /etc/systemd/system/ollama.service to customize behavior:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/mnt/storage/models"
Environment="OLLAMA_NUM_GPU=1"
Restart after changes:
sudo systemctl daemon-reload
sudo systemctl restart ollama
The OLLAMA_HOST variable controls bind address. Use 0.0.0.0 to accept network connections. OLLAMA_MODELS sets storage location. OLLAMA_NUM_GPU controls GPU count for layer offloading.
Building llama.cpp from Source
llama.cpp gives you control over hardware acceleration, optimization flags, and experimental features. Pre-built binaries work for basic use, but compiling from source unlocks GPU support and architecture-specific optimizations.
Clone the repository:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
For CPU-only builds:
cmake -B build
cmake --build build --config Release
For NVIDIA GPU support:
cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release
For AMD GPUs:
cmake -B build -DLLAMA_HIPBLAS=ON
cmake --build build --config Release
For Apple Silicon:
cmake -B build -DLLAMA_METAL=ON
cmake --build build --config Release
Verify the build:
./build/bin/llama-cli --version
./build/bin/llama-server --version
The build process takes several minutes depending on CPU. Compilation enables CPU instruction sets like AVX2 and AVX-512 automatically based on your hardware.
Understanding LLM Parameters
LLM parameters control inference behavior – randomness, output length, token selection, and context handling. Tuning these settings optimizes models for specific tasks.
Temperature
Temperature controls randomness in token selection. Range: 0.0-2.0.
- 0.0-0.3: Deterministic, focused output (code generation, factual queries)
- 0.5-0.7: Balanced creativity and coherence (general chat)
- 0.8-1.2: Creative, varied output (brainstorming, creative writing)
Lower temperatures make the model choose the most likely tokens consistently. Higher values increase randomness, making output less predictable but more creative.
Example:
# Ollama API
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Write code to parse JSON",
"options": {"temperature": 0.2}
}'
# llama-server
./llama-server -m model.gguf --temp 0.2
Top-p (Nucleus Sampling)
Top-p limits token selection to the smallest set whose cumulative probability exceeds the threshold. Range: 0.1-1.0.
A top-p of 0.9 considers only tokens representing 90% probability mass. This produces more coherent output than temperature alone by excluding unlikely tokens entirely.
Typical values: 0.85-0.95 for most tasks.
Context Window
Context window defines how much conversation history the model remembers, measured in tokens. Most models support 2048-8192 by default. Some extend to 32k-128k tokens.
Longer contexts consume more VRAM and slow inference. Use the minimum context needed for your task:
# Ollama
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"options": {"num_ctx": 4096}
}'
# llama-server
./llama-server -m model.gguf --ctx-size 4096
Max Tokens (Output Length)
Controls maximum output length. Prevents runaway generation:
# Ollama
"options": {"num_predict": 512}
# llama-server
./llama-server -m model.gguf -n 512
Quantization Explained
Quantization reduces model size by representing weights with fewer bits. GGUF models use schemes like Q4_0, Q4_K_M, Q5_K_M, and Q8_0.
Quantization Levels
- Q4_0: Smallest, fastest, lowest quality (legacy format)
- Q4_K_M: Best balance for most use cases (recommended starting point)
- Q5_K_M: Better quality, 20% larger than Q4_K_M
- Q8_0: Near full precision, double the size of Q4_K_M
For a 7B model:
- Q4_K_M: ~4GB RAM
- Q5_K_M: ~5GB RAM
- Q8_0: ~7-8GB RAM
Choosing Quantization
Start with Q4_K_M. Upgrade to Q5_K_M or Q8_0 only if quality issues appear in your specific use case. Test with actual prompts from your application.
# Ollama pulls Q4_K_M by default
ollama pull llama3.2:7b
# Specify quantization
ollama pull llama3.2:7b-q5_K_M
For llama.cpp, download GGUF files from Hugging Face with explicit quantization in filename:
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
GPU Offloading and Memory Management
GPU offloading moves model layers from CPU to GPU memory, dramatically improving inference speed. Control how many layers run on GPU to balance speed and VRAM usage.
Ollama GPU Configuration
Ollama automatically calculates optimal layer distribution based on available VRAM:
export OLLAMA_NUM_GPU=1
ollama run llama3.2:7b
Set to 0 to force CPU-only inference. Ollama handles layer splitting transparently.
llama.cpp GPU Layers
Use the -ngl (or --n-gpu-layers) flag to set GPU layer count explicitly:
./llama-server -m model.gguf -ngl 35 --ctx-size 4096
Start with -ngl 35 for 7B models on 8GB VRAM cards. Increase for larger VRAM, decrease if you hit OOM errors.
Monitor VRAM usage:
# NVIDIA
watch -n 1 nvidia-smi
# AMD
watch -n 1 rocm-smi
Memory Optimization
If you run out of memory:
- Reduce GPU layers (
-ngl) - Use lower quantization (Q4_K_M instead of Q5_K_M)
- Decrease context window (
--ctx-size) - Close other applications using VRAM
Test configurations incrementally. A model that runs with short prompts may exhaust memory with long context windows.
Running Models with Ollama
Ollama provides a simple CLI and REST API for model interaction. The service runs as a systemd daemon, handling model loading and inference requests.
Command Line Usage
Run a model interactively:
ollama run llama3.2:7b
Single prompt execution:
ollama run llama3.2:7b "Explain Docker networking"
List downloaded models:
ollama list
Remove a model:
ollama rm llama3.2:7b
REST API Usage
Generate completions:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain Docker networking",
"options": {
"temperature": 0.7,
"top_p": 0.9,
"num_predict": 512
}
}'
Chat format (multi-turn conversations):
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "What is Docker?"},
{"role": "assistant", "content": "Docker is a containerization platform..."},
{"role": "user", "content": "How does networking work?"}
]
}'
Creating Custom Models with Modelfiles
Define persistent parameter defaults:
cat > Modelfile <<EOF
FROM llama3.2
PARAMETER temperature 0.3
PARAMETER top_p 0.85
PARAMETER num_predict 1024
SYSTEM You are a technical documentation assistant specializing in Linux.
EOF
ollama create llama3.2-docs -f Modelfile
ollama run llama3.2-docs
This approach works well for CI/CD integration requiring consistent model behavior.
Running Models with llama.cpp
llama-server provides an OpenAI-compatible HTTP API for inference. The server loads models into memory and serves requests on a specified port.
Starting the Server
Basic server startup:
./llama-server -m models/llama-2-7b.Q4_K_M.gguf --port 8080
With GPU acceleration and custom parameters:
./llama-server \
-m models/mistral-7b-instruct.Q4_K_M.gguf \
--ctx-size 4096 \
--threads 8 \
--n-gpu-layers 35 \
--port 8080 \
--host 0.0.0.0
Parameters:
-m: Model file path--ctx-size: Context window size--threads: CPU thread count-ngl: GPU layers to offload--port: Server port--host: Bind address
API Requests
Chat completion endpoint (OpenAI-compatible):
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-7b-instruct",
"messages": [{"role": "user", "content": "Explain Docker networking"}],
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 512
}'
The OpenAI-compatible API means you can point existing tools expecting OpenAI format to http://localhost:8080/v1 as the base URL.
Downloading Models from Hugging Face
Hugging Face hosts thousands of open-weight models. For local deployment, you need GGUF format files.
Finding Models
Browse huggingface.co/models and filter for “gguf” tag. Popular model families:
- Llama (Meta)
- Mistral
- Phi (Microsoft)
- Qwen (Alibaba)
Check the Files tab for available quantization levels. Look for Q4_K_M or Q5_K_M variants.
Downloading with wget
Get the direct download URL from the Files tab:
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
Using Git LFS
For full repository clones:
sudo apt install git-lfs
git lfs install
git clone https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
Importing into Ollama
Create a Modelfile referencing the downloaded GGUF:
cat > Modelfile <<EOF
FROM ./llama-2-7b.Q4_K_M.gguf
PARAMETER temperature 0.7
EOF
ollama create my-llama2 -f Modelfile
ollama run my-llama2
Production Deployment Considerations
Running local LLMs in production requires planning for reliability, security, and performance.
Resource Monitoring
Monitor CPU, RAM, and VRAM usage continuously:
# Install monitoring tools
sudo apt install htop nvidia-smi
# Watch resources
htop
watch -n 1 nvidia-smi
Set up alerts for:
- VRAM exhaustion
- CPU throttling
- Disk space (model storage)
- API response latency
Security Hardening
Bind Ollama/llama-server to localhost only for single-machine use:
# Ollama
Environment="OLLAMA_HOST=127.0.0.1:11434"
# llama-server
./llama-server --host 127.0.0.1
For network access, use reverse proxy with authentication:
location /api/ {
auth_basic "LLM API";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://localhost:11434/;
}
Performance Tuning
Start with conservative settings and tune based on actual workload:
- Temperature: 0.3 for technical tasks, 0.7 for general chat
- Context: Minimum needed (4096 for most tasks)
- GPU layers: Maximum your VRAM allows
- Threads: Match CPU core count
Test parameter combinations with representative prompts before production deployment.
Backup and Model Management
Store models on separate storage from system disk. Use version control for Modelfiles:
# Custom model storage
Environment="OLLAMA_MODELS=/mnt/storage/models"
Document which quantization levels you’re using and why. Keep checksums for model files to detect corruption.
Troubleshooting Common Issues
Out of Memory Errors
Symptom: Model crashes or refuses to load.
Solutions:
- Reduce GPU layers:
-ngl 20instead of-ngl 35 - Use lower quantization: Q4_K_M instead of Q5_K_M
- Decrease context:
--ctx-size 2048instead of 4096 - Close other applications using VRAM
Slow Inference
Symptom: Tokens generate slowly (< 5 tokens/second).
Solutions:
- Increase GPU layers if VRAM available
- Reduce context window
- Use lower quantization (faster but lower quality)
- Check CPU isn’t throttling (monitor temperature)
Model Not Found
Symptom: Ollama can’t find downloaded model.
Solutions:
# Check model location
ollama list
# Verify storage path
echo $OLLAMA_MODELS
# Re-pull model
ollama pull llama3.2:7b
API Connection Refused
Symptom: Can’t connect to Ollama/llama-server API.
Solutions:
# Check service status
systemctl status ollama
# Verify port binding
sudo netstat -tlnp | grep 11434
# Test locally first
curl http://localhost:11434/api/tags
Further Reading
This guide covers core concepts for running local LLMs. For specialized topics:
- Ollama-specific optimizations: See Ollama documentation
- llama.cpp compilation details: See Building llama.cpp from source
- Hugging Face integration: See Hugging Face skills for self-hosting
- Parameter tuning advanced techniques: See Setting LLM parameters
Caution: When using AI assistants to generate deployment commands, always validate against official documentation. Test configurations in development environments before production deployment.