Posts

GGUF Quantization Explained: Choosing the Right Format for Local AI

GGUF Quantization Explained: Choosing the Right Format for Local AI TL;DR # Check quantization of an Ollama model ollama show llama3.2:3b --modelfile | grep -i quant # Inspect a GGUF file directly python3 -c "from gguf import GGUFReader; r = GGUFReader('model.gguf'); print([kv for kv in r.fields])" # Or use llama.cpp's built-in info ./llama-quantize --help # Convert and quantize with llama.cpp ./llama-quantize input.gguf output-Q4_K_M.gguf Q4_K_M GGUF is the standard file format for running quantized LLMs locally. Quantization reduces model size and VRAM usage by representing weights with fewer bits. The tradeoff is a small reduction in output quality. Choosing the right quantization level depends on your available VRAM, the model size, and your quality requirements. ...

Ollama Model Management: Pull, Create, Copy, and Remove

Ollama Model Management: Pull, Create, Copy, and Remove TL;DR # Pull a model ollama pull llama3.2:3b # List all local models ollama list # Show model details (parameters, template, license) ollama show llama3.2:3b # Copy/rename a model ollama cp llama3.2:3b my-llama # Remove a model ollama rm llama3.2:3b # Check disk usage of model storage du -sh /usr/share/ollama/.ollama/models/ Ollama stores models as layered blobs, similar to Docker images. Understanding how models are stored, tagged, and shared lets you manage disk space effectively and avoid downloading duplicate data. ...

Ollama Modelfile Guide: Custom System Prompts and Parameters

Ollama Modelfile Guide: Custom System Prompts and Parameters TL;DR # Create a custom model from a Modelfile ollama create my-coder -f ./Modelfile # Run it ollama run my-coder # List custom models ollama list | grep my- # Remove a custom model ollama rm my-coder A Modelfile is a plain text file that defines a custom Ollama model. It specifies the base model, system prompt, generation parameters, and template format. Think of it as a Dockerfile for LLMs: declarative, reproducible, and version-controllable. ...

Running Gemma 2 Locally with LM Studio CLI for Linux System Administration

TL;DR LM Studio provides a straightforward path to running Gemma 2 models locally on Linux servers, giving you an offline AI assistant for system administration tasks without sending sensitive infrastructure data to external APIs. The CLI interface integrates cleanly with shell scripts, allowing you to pipe system logs, configuration files, and command outputs directly to the model for analysis and recommendations. ...

LM Studio Plugin System: Extend Your Local AI Setup in 2026

TL;DR LM Studio’s plugin architecture transforms the desktop application from a simple model runner into an extensible AI platform. While the base application handles model loading and inference, plugins add custom workflows, integrate external tools, and automate complex tasks without writing server code from scratch. The plugin system uses a JavaScript-based API that hooks into LM Studio’s model lifecycle, request pipeline, and UI components. Developers can create plugins that preprocess prompts, post-process responses, connect to external databases, or trigger actions based on model outputs. Unlike building a separate application that calls LM Studio’s OpenAI-compatible API, plugins run inside the application context with direct access to model state and configuration. ...

Run AI Models Locally in Browsers: No-Code Automation Without API Keys

TL;DR Browser-based AI models let you run inference directly in the user’s browser using WebGPU and WebAssembly, eliminating API costs and privacy concerns. Tools like Transformers.js, ONNX Runtime Web, and MediaPipe enable you to deploy models for text generation, image classification, and audio transcription without sending data to external servers. ...

Running Gemma 4 Locally with Ollama: 2026 Setup Guide

TL;DR Gemma 4 represents Google’s latest iteration in efficient, on-device language models, optimized specifically for local deployment scenarios where resource constraints matter. Unlike larger models that demand high-end hardware, Gemma 4 delivers strong performance on consumer GPUs and even CPU-only systems, making it ideal for homelab setups and privacy-focused deployments. ...

Local LLM vs OpenAI API: Cost Calculator and Break-Even Analysis

Local LLM vs OpenAI API: Cost Calculator and Break-Even Analysis TL;DR A local AI server (RTX 3090 + system, ~$1,400) pays for itself versus OpenAI API spending within 3-12 months depending on your usage volume. At 500 queries per day, local hardware breaks even in about 4 months against GPT-4o pricing. At 1,000 queries per day, break-even drops to under 2 months. ...

Multi-GPU Ollama Setup: Running 70B Models on Dual GPUs

Multi-GPU Ollama Setup: Running 70B Models on Dual GPUs TL;DR A single 24GB GPU cannot run a 70B parameter LLM. The model requires approximately 40GB of VRAM at Q4 quantization. Two GPUs solve this by splitting the model across both cards. This guide covers the hardware, configuration, and performance expectations for running 70B models on dual RTX 3090s with Ollama. ...

Running Local LLMs on AMD GPUs with ROCm and Ollama

Running Local LLMs on AMD GPUs with ROCm and Ollama TL;DR AMD GPUs are a viable alternative to NVIDIA for local LLM inference, particularly the RX 7900 XTX with 24GB VRAM. ROCm 6.x on Linux provides the software stack needed to run Ollama and llama.cpp with GPU acceleration. Performance is 15-30% lower than equivalent NVIDIA hardware, but AMD cards often cost significantly less. ...