Ollama on Raspberry Pi: Running Local LLMs on ARM

TL;DR # Install Ollama on Raspberry Pi (ARM64) curl -fsSL https://ollama.com/install.sh | sh # Pull a model that actually works on Pi ollama pull qwen2.5:0.5b ollama pull phi3:mini # Test it ollama run qwen2.5:0.5b "Write a Python function to read a CSV file" # Check memory usage ollama ps free -h Raspberry Pi 5 with 8 GB RAM can run models up to 3B parameters at usable speeds. Stick to 0.5B-1.5B models for interactive use. Anything above 7B is not practical. ...

April 6, 2026 · 7 min · Local AI Ops

Tabby: Self-Hosted Code Completion with Local Models

TL;DR # Run Tabby with NVIDIA GPU using Docker docker run -d --name tabby \ --gpus all \ -p 8080:8080 \ -v $HOME/.tabby:/data \ tabbyml/tabby \ serve --model StarCoder-1B --device cuda # Verify it is running curl http://localhost:8080/v1/health # Test a completion curl -X POST http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{"prompt": "def fibonacci(n):\n ", "language": "python"}' Install the Tabby plugin in your IDE, point it at http://localhost:8080, and get Copilot-style completions backed entirely by local hardware. ...

April 6, 2026 · 7 min · Local AI Ops

Continue.dev with Ollama: Local AI Coding in VS Code

TL;DR # Install Ollama and pull models curl -fsSL https://ollama.com/install.sh | sh ollama pull qwen2.5-coder:7b ollama pull codellama:7b # Verify Ollama is running curl http://localhost:11434/api/tags Install the Continue extension from the VS Code marketplace, open ~/.continue/config.json, point it at your local Ollama instance, and start coding with zero cloud dependencies. ...

April 6, 2026 · 7 min · Local AI Ops

LocalAI Setup: OpenAI API-Compatible Local Inference

LocalAI Setup: OpenAI API-Compatible Local Inference TL;DR # Docker (quickest start) docker run -d --name localai -p 8080:8080 \ -v localai-models:/build/models \ localai/localai:latest-gpu-nvidia-cuda-12 # Install a model from the gallery curl http://localhost:8080/models/apply -d '{"id": "llama-3.1-8b-instruct"}' # Test chat completions (same as OpenAI API) curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.1-8b-instruct", "messages": [{"role": "user", "content": "Hello"}]}' # Generate embeddings curl http://localhost:8080/v1/embeddings \ -H "Content-Type: application/json" \ -d '{"model": "text-embedding-ada-002", "input": "The quick brown fox"}' Caution: LocalAI has no built-in authentication. Any process that can reach port 8080 can use the API. Use firewall rules, bind to localhost only, or put a reverse proxy with auth in front before exposing to a network. ...

April 6, 2026 · 8 min · Local AI Ops

KoboldCpp Quick Start: Run GGUF Models with One Binary

KoboldCpp Quick Start: Run GGUF Models with One Binary TL;DR # Download the latest release (Linux, CUDA) wget https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-linux-x64-cuda1150 chmod +x koboldcpp-linux-x64-cuda1150 # Run a GGUF model ./koboldcpp-linux-x64-cuda1150 --model llama-3.1-8b-instruct.Q4_K_M.gguf \ --gpulayers 99 --contextsize 4096 --port 5001 # Web UI opens at http://localhost:5001 # KoboldAI API at http://localhost:5001/api/ # OpenAI-compatible API at http://localhost:5001/v1/chat/completions Caution: KoboldCpp binds to localhost by default. If you use --host 0.0.0.0 to allow network access, there is no built-in authentication. Restrict access with firewall rules or a reverse proxy. ...

April 6, 2026 · 7 min · Local AI Ops

Text Generation WebUI Setup Guide for Local LLM Inference

Text Generation WebUI Setup Guide for Local LLM Inference TL;DR # Clone and run the one-click installer git clone https://github.com/oobabooga/text-generation-webui cd text-generation-webui bash start_linux.sh # Installs conda env, dependencies, launches UI # Or use Docker docker compose up -d # Access the web interface # http://localhost:7860 # Enable API mode python server.py --api --listen # API available at http://localhost:5000/v1/chat/completions Caution: The web interface has no authentication by default. Do not use --listen (which binds to 0.0.0.0) on networks you do not control. Use --listen --api-key YOUR_SECRET if exposing the API, and put a reverse proxy with auth in front for production use. ...

April 6, 2026 · 7 min · Local AI Ops

vLLM Local Setup: High-Throughput LLM Serving Guide

vLLM Local Setup: High-Throughput LLM Serving Guide TL;DR # Install vLLM (requires CUDA 12.1+ and Python 3.9+) pip install vllm # Serve a model with OpenAI-compatible API vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 # Test the endpoint curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}' # Docker deployment docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \ --model meta-llama/Llama-3.1-8B-Instruct Caution: vLLM requires a Hugging Face account with accepted model licenses for gated models like Llama. Set HF_TOKEN in your environment before serving. Never expose the API port to untrusted networks without authentication – vLLM has no built-in auth layer. ...

April 6, 2026 · 8 min · Local AI Ops

Ollama Behind Nginx Reverse Proxy: SSL and Multi-User Setup

Ollama Behind Nginx Reverse Proxy: SSL and Multi-User Setup TL;DR # Install Nginx and Certbot sudo apt install nginx certbot python3-certbot-nginx # Get SSL certificate sudo certbot --nginx -d ollama.example.com # Test the proxy curl -s https://ollama.example.com/api/tags \ -u admin:password | jq . By default, Ollama listens on localhost:11434 with no authentication, no encryption, and no rate limiting. This is fine for single-user local development but inadequate for team use or any network-exposed deployment. Nginx solves all three problems as a reverse proxy layer in front of Ollama. ...

April 6, 2026 · 8 min · Local AI Ops

GGUF Quantization Explained: Choosing the Right Format for Local AI

GGUF Quantization Explained: Choosing the Right Format for Local AI TL;DR # Check quantization of an Ollama model ollama show llama3.2:3b --modelfile | grep -i quant # Inspect a GGUF file directly python3 -c "from gguf import GGUFReader; r = GGUFReader('model.gguf'); print([kv for kv in r.fields])" # Or use llama.cpp's built-in info ./llama-quantize --help # Convert and quantize with llama.cpp ./llama-quantize input.gguf output-Q4_K_M.gguf Q4_K_M GGUF is the standard file format for running quantized LLMs locally. Quantization reduces model size and VRAM usage by representing weights with fewer bits. The tradeoff is a small reduction in output quality. Choosing the right quantization level depends on your available VRAM, the model size, and your quality requirements. ...

April 6, 2026 · 8 min · Local AI Ops

Ollama Model Management: Pull, Create, Copy, and Remove

Ollama Model Management: Pull, Create, Copy, and Remove TL;DR # Pull a model ollama pull llama3.2:3b # List all local models ollama list # Show model details (parameters, template, license) ollama show llama3.2:3b # Copy/rename a model ollama cp llama3.2:3b my-llama # Remove a model ollama rm llama3.2:3b # Check disk usage of model storage du -sh /usr/share/ollama/.ollama/models/ Ollama stores models as layered blobs, similar to Docker images. Understanding how models are stored, tagged, and shared lets you manage disk space effectively and avoid downloading duplicate data. ...

April 6, 2026 · 7 min · Local AI Ops
Buy Me A Coffee