Run AI Locally. Own Your Data.

Practical guides for self-hosting AI models on your own hardware.

Ollama, Open WebUI, LM Studio, llama.cpp — set up local LLMs,
keep your data private, cut API costs, and run AI offline.

Also see: [The AI Dev](https://theaidev.dev) for AI coding tools | [AI Bookkeeping Tools](https://aibookkeepingtools.com) for accounting automation

Featured Guides

Running Local LLMs on AMD GPUs with ROCm and Ollama

Complete guide to running local LLMs on AMD GPUs using ROCm 6.x and Ollama. Covers supported GPUs, installation, performance benchmarks, and cost comparison with NVIDIA.

11 min read

Building a Local RAG Pipeline with Ollama and Open WebUI

Step-by-step guide to building a retrieval-augmented generation pipeline locally using Ollama, Open WebUI, embedding models, and vector databases.

12 min read

Running Local LLMs with Ollama and llama.cpp

Guide to installing, configuring, and optimizing local AI models using Ollama and llama.cpp with parameter tuning, quantization, and GPU acceleration

10 min read

RTX 3090 for AI: Best Value GPU for Local LLM Hosting

Why the NVIDIA RTX 3090 is the best value GPU for local AI inference and fine-tuning. Benchmarks, pricing, power costs, and capacity analysis.

6 min read

Self-Hosting Open WebUI with Docker: Setup Guide

Learn to deploy Open WebUI with Docker for private ChatGPT-like access, including Ollama integration, GPU setup, and production security configs.

7 min read

Browse by Topic

AI Automation 14 guides AI for Business 2 guides Business Use Cases 3 guides Comparisons 8 guides Configuration Management 3 guides Container Security 3 guides DevOps AI 5 guides Getting Started 43 guides LLM Integration 8 guides Model Management 21 guides Monitoring 2 guides Performance & Hardware 39 guides Security & Networking 15 guides Server Management 6 guides System Hardening 2 guides Tools & Interfaces 29 guides

Essential llama.cpp Command Line Flags for Local AI in 2026

TL;DR llama.cpp remains the fastest way to run quantized LLMs locally in 2026, but choosing the right command-line flags makes the difference between a sluggish 2 tokens/second and a responsive 30+ tokens/second experience. This guide covers the essential flags you need for optimal performance on consumer hardware. The most impactful flags control resource allocation: --n-gpu-layers offloads model layers to your GPU (start with -ngl 35 for 8GB VRAM), --threads sets CPU cores for processing (use physical cores minus 2), and --ctx-size defines context window length (2048 for chat, 8192 for document analysis). Getting these three right solves most performance issues. ...

How to Move Ollama Models to Another Drive in 2026

TL;DR Moving Ollama models to another drive requires changing the OLLAMA_MODELS environment variable and relocating your existing model files. By default, Ollama stores models in ~/.ollama/models on Linux systems, but you can point it to any directory with sufficient space. The fastest approach: stop the Ollama service, set OLLAMA_MODELS to your new location, move the existing models directory, then restart. For systemd-managed installations, edit /etc/systemd/system/ollama.service to add Environment=“OLLAMA_MODELS=/mnt/storage/ollama-models” under the [Service] section. After running systemctl daemon-reload and systemctl restart ollama, verify the new path with ollama list. ...

Odysseus: Complete Self-Hosted AI Workspace with Ollama

TL;DR Odysseus transforms your self-hosted infrastructure into a unified AI workspace that coordinates multiple capabilities – chat interfaces, code completion, image generation, and document analysis – through a single web interface. Unlike single-purpose tools that handle one task, Odysseus provides workspace management features designed for teams and complex projects that span multiple AI modalities. ...

Llama Models on AMD ROCm: Complete Self-Hosting Setup Guide

TL;DR Running Llama models on AMD GPUs requires ROCm-specific optimizations that differ significantly from NVIDIA CUDA workflows. This guide covers the complete setup for self-hosting Llama 2, Llama 3, and Code Llama variants on AMD hardware using ROCm 6.0+, with focus on memory management, compilation flags, and performance tuning that existing NVIDIA guides do not address. ...

llama.cpp Multi-GPU Support for Mixed Graphics Cards in 2026

TL;DR llama.cpp supports heterogeneous multi-GPU configurations, letting you mix NVIDIA, AMD, and even Intel Arc cards in the same system for local LLM inference. Unlike Ollama’s automatic GPU detection, llama.cpp requires explicit layer distribution using the -ngl flag combined with --split-mode and --tensor-split parameters. This gives you fine-grained control over which layers run on which card, essential when mixing a high-VRAM card with lower-capacity GPUs. ...

Fix Ollama Model Switching Causing 100% SSD Usage in 2026

TL;DR When you switch between models in Ollama, the service unloads the current GGUF file from memory and loads the new one from disk. Large models like llama3.1:70b or mixtral:8x7b can exceed 40GB, causing sustained disk reads that pin your SSD at maximum utilization. This becomes especially problematic when multiple users or applications trigger rapid model switches, creating a cascade of disk I/O that degrades system responsiveness. ...

DeepSeek v4 Local Setup Guide: Ollama and Open WebUI Install

TL;DR DeepSeek v4 runs locally through Ollama with Open WebUI providing a chat interface. This guide covers installation, model-specific configuration for DeepSeek’s extended context window, and performance tuning for the model’s unique reasoning architecture. Install Ollama first, then pull the DeepSeek v4 model: curl -fsSL https://ollama.com/install.sh | sh ollama pull deepseek-v4 DeepSeek v4 requires specific memory allocation due to its 128K token context window. Set OLLAMA_NUM_GPU to control GPU layer offloading – most systems benefit from full GPU utilization with this model’s architecture: ...

Open WebUI Desktop: Self-Host AI Models Locally in 2026

TL;DR Open WebUI Desktop brings self-hosted AI to your machine without Docker containers or browser tabs. Download the native application for Windows, macOS, or Linux, and you get a system tray icon, offline-first architecture, and direct file system access – no port mapping or container orchestration required. The desktop version connects to local Ollama instances or OpenAI-compatible APIs just like the web version, but runs as a standalone application with OS-level integration. Launch it from your applications menu, minimize to tray, and interact with models like llama3.2, mistral, or codellama without opening a browser. Updates arrive automatically through the built-in updater, eliminating manual Docker image pulls. ...

How Finetuning Exposes Copyright Issues in Self-Hosted LLMs

TL;DR Finetuning your local LLM on copyrighted material creates the same legal risks as training foundation models, but with direct personal liability. When you run ollama create mymodel -f Modelfile using a dataset scraped from Stack Overflow, GitHub repositories, or published books, you become the party responsible for any copyright infringement – not a distant corporation with legal teams. ...

Setting OLLAMA_NUM_GPU for Multi-GPU Local AI in 2026

TL;DR The OLLAMA_NUM_GPU environment variable controls how many GPUs Ollama uses for inference, but setting it correctly requires understanding your hardware topology and workload patterns. Unlike single-GPU setups where Ollama auto-detects your card, multi-GPU configurations demand explicit tuning to avoid memory fragmentation and PCIe bottlenecks. Set OLLAMA_NUM_GPU=2 to split model layers across two GPUs, or OLLAMA_NUM_GPU=4 for quad-GPU systems. Ollama distributes transformer layers sequentially – GPU 0 handles the first N layers, GPU 1 takes the next batch, and so on. This differs from data parallelism where each GPU processes different prompts simultaneously. ...