Local AI Ops

Building a Local RAG Pipeline with Ollama and Open WebUI

Building a Local RAG Pipeline with Ollama and Open WebUI TL;DR Retrieval-augmented generation (RAG) lets your local LLM answer questions using your own documents instead of relying on its training data. This guide walks through building a fully local RAG pipeline: document ingestion, embedding, vector storage, and retrieval through Open WebUI. ...

RTX 4090 vs RTX 3090 for Local AI: Which GPU Should You Buy?

RTX 4090 vs RTX 3090 for Local AI: Which GPU Should You Buy? TL;DR Both GPUs have 24GB VRAM, which is the most important spec for local AI. The RTX 4090 is 40-70% faster for inference but costs roughly twice as much as a used RTX 3090. For most people building a local AI server, the 3090 is the better buy. The 4090 makes sense only when you need maximum single-card speed or plan to do significant fine-tuning work. ...

HJB Equations in Local RL: Implementing with Ollama and

TL;DR Hamilton-Jacobi-Bellman equations provide the mathematical foundation for optimal control in reinforcement learning, but implementing them locally requires combining numerical solvers with LLM-assisted code generation. This guide shows you how to use Ollama running locally to generate HJB solver implementations, validate discretization schemes, and debug boundary conditions without sending your research code to cloud APIs. ...

Self-Hosting Qwen3 Coder with Ollama: Complete 2026 Guide

TL;DR Qwen3 Coder runs locally via Ollama with a single command after installing Ollama using curl -fsSL https://ollama.com/install.sh | sh. The model excels at code completion, refactoring, and multi-language support with context windows up to 32K tokens in the larger variants. Unlike general-purpose models, Qwen3 Coder is specifically trained on code repositories and technical documentation, making it competitive with DeepSeek Coder and CodeLlama for local development workflows. ...

Ollama Windows Installation Guide: Self-Host AI Models in

TL;DR Running Ollama on Windows requires different considerations than Linux deployments. You have two main paths: native Windows installation or WSL2. Native Windows offers simpler GPU access through NVIDIA CUDA or AMD ROCm drivers, while WSL2 provides a Linux-like environment but adds complexity for GPU passthrough. The native Windows installer downloads from ollama.com and runs as a system service. After installation, Ollama serves models on port 11434 and appears in your system tray. Windows Defender Firewall blocks external connections by default – you must create an inbound rule for port 11434 if accessing from other machines on your network. ...

TurboQuant Quantization in llama.cpp: Self-Hosted Setup

TL;DR TurboQuant is an experimental quantization method in llama.cpp that prioritizes inference speed over traditional GGUF quantization schemes. Unlike standard Q4_K_M or Q5_K_M formats that balance compression and quality, TurboQuant applies aggressive optimization to matrix operations, reducing memory bandwidth requirements while maintaining acceptable output quality for many use cases. The key difference: TurboQuant reorganizes weight tensors for cache-friendly access patterns and uses specialized SIMD instructions that standard GGUF quantization doesn’t exploit. This means faster token generation on modern CPUs with AVX2 or AVX-512 support, though quality degradation becomes noticeable on complex reasoning tasks. ...

Building a TypeScript Web Scraper with LLMs for Linux Server Monitoring

TL;DR This guide demonstrates building a TypeScript-based web scraper that uses LLMs to parse unstructured server monitoring data from vendor dashboards, legacy admin panels, and third-party SaaS platforms. You’ll integrate OpenAI’s API or local models like Llama 3 to extract metrics, interpret alert messages, and normalize data into Prometheus-compatible formats. ...

Running 397B Flash-MoE Model Locally with Ollama in 2026

TL;DR Flash-MoE represents a breakthrough in local LLM deployment – a 397 billion parameter mixture-of-experts model that activates only a fraction of its parameters per request. Unlike dense models where every parameter processes every token, Flash-MoE routes inputs through specialized expert networks, making it feasible to run on consumer hardware despite its massive size. ...

Complete Guide to Running llama.cpp in Docker Containers

TL;DR Running llama.cpp in Docker containers solves the deployment complexity of local LLM inference while maintaining reproducibility across different host systems. This guide covers production-ready containerization patterns specifically for llama.cpp, focusing on aspects not typically addressed in basic setup tutorials. You’ll learn to build multi-stage Docker images that compile llama.cpp with optimal flags for your target hardware, then copy only the runtime binaries to a minimal production image. The approach reduces final image size while preserving GPU acceleration support through CUDA or ROCm layers. ...

LM Studio vs Google AI: Local Hosting Beats Cloud

TL;DR LM Studio running on your own hardware eliminates per-token billing, data transmission to Google’s infrastructure, and dependency on internet connectivity. For teams processing sensitive customer data, financial records, or proprietary code, keeping inference local satisfies GDPR Article 32 requirements for data minimization without complex data processing agreements. Google’s Vertex AI and Gemini API charge for every API call. LM Studio downloads models once from Hugging Face, then runs them indefinitely on your hardware with zero recurring costs. A mid-range workstation with 32GB RAM and an RTX 4070 handles most 7B-13B parameter models at acceptable speeds for internal tooling, documentation generation, and code review workflows. ...