Docker Pull Issues in Spain: Self-Hosting AI with Ollama

TL;DR Docker Hub rate limits and regional connectivity issues in Spain can block container pulls, disrupting self-hosted AI deployments. The primary workaround is switching to mirror registries or running Ollama natively without Docker. For immediate relief, configure Docker to use alternative registries. Edit /etc/docker/daemon.json to add registry mirrors: { "registry-mirrors": [ "https://mirror.gcr.io" ] } Restart Docker with sudo systemctl restart docker and retry your pull. This routes requests through Google’s mirror, bypassing Docker Hub entirely. ...

April 13, 2026 · 8 min · Local AI Ops

RTX 3090 Used Market 2026: Best Bang for Buck Local AI Setup

TL;DR The RTX 3090 remains a compelling choice for local AI workloads in 2026, particularly on the used market where prices have stabilized considerably below launch MSRP. With 24GB of VRAM, this card handles most local LLM deployments that would otherwise require multiple newer cards or expensive cloud instances. On the secondary market, expect to find RTX 3090s from mining operations, workstation upgrades, and gamers moving to newer architectures. The key advantage is VRAM capacity – running a 70B parameter model quantized to 4-bit requires roughly 40GB, making dual RTX 3090s viable where a single RTX 4090 (24GB) falls short. For 13B to 34B models, a single card provides comfortable headroom. ...

April 11, 2026 · 10 min · Local AI Ops

Running Claude-Style Models in LM Studio: Complete 2026

TL;DR LM Studio provides a GUI-first approach to running Claude-style coding models locally without command-line complexity. Download the application from lmstudio.ai, install it on your Linux, macOS, or Windows system, and you gain immediate access to Hugging Face’s model repository through an integrated browser. The workflow centers on three steps: discover models through LM Studio’s search interface, download your chosen quantization format (Q4_K_M for balanced performance, Q8_0 for accuracy), and launch the built-in OpenAI-compatible API server. Models like DeepSeek Coder V2, Qwen2.5-Coder, and CodeLlama variants work particularly well for development tasks. ...

April 10, 2026 · 9 min · Local AI Ops

MegaTrain: Full Precision Training of 100B+ Models on

TL;DR MegaTrain represents a breakthrough in democratizing large language model training by enabling full-precision training of models exceeding 100 billion parameters on consumer-grade hardware without cloud dependencies. Traditional training approaches require expensive GPU clusters with hundreds of gigabytes of VRAM, but MegaTrain employs aggressive memory optimization techniques including gradient checkpointing, CPU offloading, and dynamic tensor swapping to fit massive models into systems with as little as 24GB of VRAM. The framework integrates seamlessly with local AI stacks like Ollama and LM Studio, allowing you to train custom models on your own hardware while maintaining complete data privacy. Unlike cloud-based training services that charge recurring fees and expose your training data to third parties, MegaTrain runs entirely on your infrastructure using standard PyTorch backends. The system achieves this through a combination of mixed-precision computation scheduling, intelligent layer freezing, and memory-mapped parameter storage that keeps most weights on NVMe drives while actively training only small subsets in GPU memory. For homelab operators and privacy-focused teams, this means you can fine-tune models like Llama 3 70B or Mixtral 8x22B using your existing hardware setup without compromising on training quality or sending proprietary data off-premises. The framework supports distributed training across multiple consumer GPUs using standard networking, so you can scale from a single RTX 4090 to a cluster of gaming cards as your needs grow. MegaTrain outputs standard safetensors and GGUF formats compatible with llama.cpp and Open WebUI, ensuring your trained models integrate directly into your existing local AI deployment pipeline without conversion headaches. ...

April 9, 2026 · 9 min · Local AI Ops

LLM Fine-Tuning with Ollama and llama.cpp in 2026

TL;DR Fine-tuning local LLMs in 2026 means adapting pre-trained models to your specific use case without cloud dependencies. Both Ollama and llama.cpp support running fine-tuned models, but the actual training happens with separate tools like Unsloth, Axolotl, or llama.cpp’s built-in training capabilities. The typical workflow: train or fine-tune using a framework that outputs GGUF format, then serve the resulting model through Ollama or llama-server. Ollama pulls base models from its library, but you can import custom GGUF files using ollama create with a Modelfile. For llama.cpp, point llama-server directly at your fine-tuned GGUF file. ...

April 7, 2026 · 8 min · Local AI Ops

Running Ollama Serve: Complete Setup Guide for Local AI

TL;DR The ollama serve command launches the Ollama daemon that exposes a REST API on port 11434 for running local LLM inference. Unlike the simpler ollama run command for interactive chat, serve mode is designed for persistent server deployments where multiple applications need programmatic access to your models. After installing Ollama with curl -fsSL https://ollama.com/install.sh | sh, the service typically starts automatically via systemd on Linux. You can verify it’s running with systemctl status ollama or by checking if port 11434 responds to API requests. The daemon loads models on-demand when applications request them through the HTTP API. ...

April 6, 2026 · 9 min · Local AI Ops

Building Tiny LLMs Locally: A Beginner's Guide with Ollama

TL;DR Tiny LLMs (1-3 billion parameters) let you run capable AI models on modest hardware without cloud dependencies. Unlike larger models requiring expensive GPUs, tiny models run smoothly on consumer laptops, Raspberry Pi 5 devices, and older workstations with 8GB RAM. This guide shows you how to deploy them locally using Ollama. ...

April 6, 2026 · 9 min · Local AI Ops

Air-Gapped AI Deployment: Running Ollama Without Internet

TL;DR # On connected machine: download everything curl -fsSL https://ollama.com/install.sh -o ollama-install.sh ollama pull llama3.1:8b tar czf ollama-models.tar.gz -C /usr/share/ollama .ollama/ # Transfer to air-gapped machine via USB # On air-gapped machine: install and restore bash ollama-install.sh # works offline if binary is bundled tar xzf ollama-models.tar.gz -C /usr/share/ollama/ sudo systemctl start ollama ollama list # verify models are available The full process involves downloading the Ollama binary, pulling models, packaging everything, transferring via approved media, and restoring on the isolated system. This guide covers each step in detail. ...

April 6, 2026 · 8 min · Local AI Ops

Troubleshooting Ollama: Common Errors and Fixes

TL;DR Quick diagnostic commands for the most common Ollama problems: # Check if Ollama is running systemctl status ollama curl http://localhost:11434/api/version # Check GPU detection ollama ps nvidia-smi # NVIDIA rocm-smi # AMD # Check disk space for model downloads df -h ~/.ollama # Check memory available free -h # View Ollama logs journalctl -u ollama -n 50 --no-pager # Force CPU-only mode if GPU is broken OLLAMA_NUM_GPU=0 ollama serve If you are running into an issue not covered here, the Ollama logs are almost always the fastest path to a diagnosis. Start there. ...

April 6, 2026 · 9 min · Local AI Ops

Local AI on Apple Silicon: Optimizing Ollama for M-Series Macs

TL;DR # Install Ollama on macOS brew install ollama # Or download from https://ollama.com # Start the server ollama serve & # Pull and run a model ollama pull llama3.1:8b ollama run llama3.1:8b # Check Metal GPU utilization sudo powermetrics --samplers gpu_power -i 1000 -n 1 Apple Silicon’s unified memory means your entire RAM pool is available as VRAM. An M1 with 16 GB can comfortably run 7B-13B models. An M3 Max with 96 GB can run 70B models at interactive speeds. Ollama uses Metal acceleration automatically – no configuration required. ...

April 6, 2026 · 9 min · Local AI Ops
Buy Me A Coffee