Posts

LLM Fine-Tuning with Ollama and llama.cpp in 2026

TL;DR Fine-tuning local LLMs in 2026 means adapting pre-trained models to your specific use case without cloud dependencies. Both Ollama and llama.cpp support running fine-tuned models, but the actual training happens with separate tools like Unsloth, Axolotl, or llama.cpp’s built-in training capabilities. The typical workflow: train or fine-tune using a framework that outputs GGUF format, then serve the resulting model through Ollama or llama-server. Ollama pulls base models from its library, but you can import custom GGUF files using ollama create with a Modelfile. For llama.cpp, point llama-server directly at your fine-tuned GGUF file. ...

Running Ollama Serve: Complete Setup Guide for Local AI

TL;DR The ollama serve command launches the Ollama daemon that exposes a REST API on port 11434 for running local LLM inference. Unlike the simpler ollama run command for interactive chat, serve mode is designed for persistent server deployments where multiple applications need programmatic access to your models. After installing Ollama with curl -fsSL https://ollama.com/install.sh | sh, the service typically starts automatically via systemd on Linux. You can verify it’s running with systemctl status ollama or by checking if port 11434 responds to API requests. The daemon loads models on-demand when applications request them through the HTTP API. ...

Building Tiny LLMs Locally: A Beginner's Guide with Ollama

TL;DR Tiny LLMs (1-3 billion parameters) let you run capable AI models on modest hardware without cloud dependencies. Unlike larger models requiring expensive GPUs, tiny models run smoothly on consumer laptops, Raspberry Pi 5 devices, and older workstations with 8GB RAM. This guide shows you how to deploy them locally using Ollama. ...

Air-Gapped AI Deployment: Running Ollama Without Internet

TL;DR # On connected machine: download everything curl -fsSL https://ollama.com/install.sh -o ollama-install.sh ollama pull llama3.1:8b tar czf ollama-models.tar.gz -C /usr/share/ollama .ollama/ # Transfer to air-gapped machine via USB # On air-gapped machine: install and restore bash ollama-install.sh # works offline if binary is bundled tar xzf ollama-models.tar.gz -C /usr/share/ollama/ sudo systemctl start ollama ollama list # verify models are available The full process involves downloading the Ollama binary, pulling models, packaging everything, transferring via approved media, and restoring on the isolated system. This guide covers each step in detail. ...

Troubleshooting Ollama: Common Errors and Fixes

TL;DR Quick diagnostic commands for the most common Ollama problems: # Check if Ollama is running systemctl status ollama curl http://localhost:11434/api/version # Check GPU detection ollama ps nvidia-smi # NVIDIA rocm-smi # AMD # Check disk space for model downloads df -h ~/.ollama # Check memory available free -h # View Ollama logs journalctl -u ollama -n 50 --no-pager # Force CPU-only mode if GPU is broken OLLAMA_NUM_GPU=0 ollama serve If you are running into an issue not covered here, the Ollama logs are almost always the fastest path to a diagnosis. Start there. ...

Local AI on Apple Silicon: Optimizing Ollama for M-Series Macs

TL;DR # Install Ollama on macOS brew install ollama # Or download from https://ollama.com # Start the server ollama serve & # Pull and run a model ollama pull llama3.1:8b ollama run llama3.1:8b # Check Metal GPU utilization sudo powermetrics --samplers gpu_power -i 1000 -n 1 Apple Silicon’s unified memory means your entire RAM pool is available as VRAM. An M1 with 16 GB can comfortably run 7B-13B models. An M3 Max with 96 GB can run 70B models at interactive speeds. Ollama uses Metal acceleration automatically – no configuration required. ...

Ollama on Raspberry Pi: Running Local LLMs on ARM

TL;DR # Install Ollama on Raspberry Pi (ARM64) curl -fsSL https://ollama.com/install.sh | sh # Pull a model that actually works on Pi ollama pull qwen2.5:0.5b ollama pull phi3:mini # Test it ollama run qwen2.5:0.5b "Write a Python function to read a CSV file" # Check memory usage ollama ps free -h Raspberry Pi 5 with 8 GB RAM can run models up to 3B parameters at usable speeds. Stick to 0.5B-1.5B models for interactive use. Anything above 7B is not practical. ...

Tabby: Self-Hosted Code Completion with Local Models

TL;DR # Run Tabby with NVIDIA GPU using Docker docker run -d --name tabby \ --gpus all \ -p 8080:8080 \ -v $HOME/.tabby:/data \ tabbyml/tabby \ serve --model StarCoder-1B --device cuda # Verify it is running curl http://localhost:8080/v1/health # Test a completion curl -X POST http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{"prompt": "def fibonacci(n):\n ", "language": "python"}' Install the Tabby plugin in your IDE, point it at http://localhost:8080, and get Copilot-style completions backed entirely by local hardware. ...

Continue.dev with Ollama: Local AI Coding in VS Code

TL;DR # Install Ollama and pull models curl -fsSL https://ollama.com/install.sh | sh ollama pull qwen2.5-coder:7b ollama pull codellama:7b # Verify Ollama is running curl http://localhost:11434/api/tags Install the Continue extension from the VS Code marketplace, open ~/.continue/config.json, point it at your local Ollama instance, and start coding with zero cloud dependencies. ...

LocalAI Setup: OpenAI API-Compatible Local Inference

LocalAI Setup: OpenAI API-Compatible Local Inference TL;DR # Docker (quickest start) docker run -d --name localai -p 8080:8080 \ -v localai-models:/build/models \ localai/localai:latest-gpu-nvidia-cuda-12 # Install a model from the gallery curl http://localhost:8080/models/apply -d '{"id": "llama-3.1-8b-instruct"}' # Test chat completions (same as OpenAI API) curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.1-8b-instruct", "messages": [{"role": "user", "content": "Hello"}]}' # Generate embeddings curl http://localhost:8080/v1/embeddings \ -H "Content-Type: application/json" \ -d '{"model": "text-embedding-ada-002", "input": "The quick brown fox"}' Caution: LocalAI has no built-in authentication. Any process that can reach port 8080 can use the API. Use firewall rules, bind to localhost only, or put a reverse proxy with auth in front before exposing to a network. ...