Odysseus: Complete Self-Hosted AI Workspace with Ollama

TL;DR Odysseus transforms your self-hosted infrastructure into a unified AI workspace that coordinates multiple capabilities – chat interfaces, code completion, image generation, and document analysis – through a single web interface. Unlike single-purpose tools that handle one task, Odysseus provides workspace management features designed for teams and complex projects that span multiple AI modalities. ...

June 1, 2026 · 9 min · Local AI Ops

Llama Models on AMD ROCm: Complete Self-Hosting Setup Guide

TL;DR Running Llama models on AMD GPUs requires ROCm-specific optimizations that differ significantly from NVIDIA CUDA workflows. This guide covers the complete setup for self-hosting Llama 2, Llama 3, and Code Llama variants on AMD hardware using ROCm 6.0+, with focus on memory management, compilation flags, and performance tuning that existing NVIDIA guides do not address. ...

May 25, 2026 · 9 min · Local AI Ops

llama.cpp Multi-GPU Support for Mixed Graphics Cards in 2026

TL;DR llama.cpp supports heterogeneous multi-GPU configurations, letting you mix NVIDIA, AMD, and even Intel Arc cards in the same system for local LLM inference. Unlike Ollama’s automatic GPU detection, llama.cpp requires explicit layer distribution using the -ngl flag combined with --split-mode and --tensor-split parameters. This gives you fine-grained control over which layers run on which card, essential when mixing a high-VRAM card with lower-capacity GPUs. ...

May 11, 2026 · 9 min · Local AI Ops

Fix Ollama Model Switching Causing 100% SSD Usage in 2026

TL;DR When you switch between models in Ollama, the service unloads the current GGUF file from memory and loads the new one from disk. Large models like llama3.1:70b or mixtral:8x7b can exceed 40GB, causing sustained disk reads that pin your SSD at maximum utilization. This becomes especially problematic when multiple users or applications trigger rapid model switches, creating a cascade of disk I/O that degrades system responsiveness. ...

May 4, 2026 · 9 min · Local AI Ops

DeepSeek v4 Local Setup Guide: Ollama and Open WebUI Install

TL;DR DeepSeek v4 runs locally through Ollama with Open WebUI providing a chat interface. This guide covers installation, model-specific configuration for DeepSeek’s extended context window, and performance tuning for the model’s unique reasoning architecture. Install Ollama first, then pull the DeepSeek v4 model: curl -fsSL https://ollama.com/install.sh | sh ollama pull deepseek-v4 DeepSeek v4 requires specific memory allocation due to its 128K token context window. Set OLLAMA_NUM_GPU to control GPU layer offloading – most systems benefit from full GPU utilization with this model’s architecture: ...

May 2, 2026 · 9 min · Local AI Ops

Open WebUI Desktop: Self-Host AI Models Locally in 2026

TL;DR Open WebUI Desktop brings self-hosted AI to your machine without Docker containers or browser tabs. Download the native application for Windows, macOS, or Linux, and you get a system tray icon, offline-first architecture, and direct file system access – no port mapping or container orchestration required. The desktop version connects to local Ollama instances or OpenAI-compatible APIs just like the web version, but runs as a standalone application with OS-level integration. Launch it from your applications menu, minimize to tray, and interact with models like llama3.2, mistral, or codellama without opening a browser. Updates arrive automatically through the built-in updater, eliminating manual Docker image pulls. ...

May 2, 2026 · 10 min · Local AI Ops

How Finetuning Exposes Copyright Issues in Self-Hosted LLMs

TL;DR Finetuning your local LLM on copyrighted material creates the same legal risks as training foundation models, but with direct personal liability. When you run ollama create mymodel -f Modelfile using a dataset scraped from Stack Overflow, GitHub repositories, or published books, you become the party responsible for any copyright infringement – not a distant corporation with legal teams. ...

May 1, 2026 · 9 min · Local AI Ops

Setting OLLAMA_NUM_GPU for Multi-GPU Local AI in 2026

TL;DR The OLLAMA_NUM_GPU environment variable controls how many GPUs Ollama uses for inference, but setting it correctly requires understanding your hardware topology and workload patterns. Unlike single-GPU setups where Ollama auto-detects your card, multi-GPU configurations demand explicit tuning to avoid memory fragmentation and PCIe bottlenecks. Set OLLAMA_NUM_GPU=2 to split model layers across two GPUs, or OLLAMA_NUM_GPU=4 for quad-GPU systems. Ollama distributes transformer layers sequentially – GPU 0 handles the first N layers, GPU 1 takes the next batch, and so on. This differs from data parallelism where each GPU processes different prompts simultaneously. ...

April 29, 2026 · 9 min · Local AI Ops

Running Llama.cpp with Inverse Kinematics AI Models in 2026

TL;DR llama.cpp now handles inverse kinematics calculations through specialized GGUF models that generate joint angles and motion paths for robotic systems. You run llama-server with an IK-trained model, send it target positions as JSON prompts, and receive executable motion commands. This works entirely offline without cloud dependencies. The typical workflow involves loading a quantized IK model (Q4_K_M or Q5_K_M recommended for speed), sending coordinate targets through the OpenAI-compatible HTTP API, and parsing the structured output into robot control commands. Models like CodeLlama-IK and specialized Llama variants trained on robotics datasets handle 6-DOF arm calculations, path planning with obstacle avoidance, and real-time trajectory adjustments. ...

April 26, 2026 · 9 min · Local AI Ops

Running Local AI Models on Kubernetes with Ollama in 2026

TL;DR Deploying Ollama on Kubernetes transforms local AI inference into a production-grade service with horizontal scaling, persistent model storage, and service mesh integration. This guide covers container orchestration patterns specifically for LLM workloads running on self-hosted infrastructure. The core deployment uses StatefulSets rather than Deployments to maintain stable network identities and persistent volume claims for model storage. Each Ollama pod serves the REST API on port 11434 and requires GPU node affinity when using NVIDIA runtime. Configure OLLAMA_HOST=0.0.0.0:11434 to bind the service to all interfaces within the pod network, and set OLLAMA_MODELS=/models pointing to your PersistentVolume mount path. ...

April 25, 2026 · 8 min · Local AI Ops
Buy Me A Coffee