Local AI Ops

Troubleshooting Ollama: Common Errors and Fixes

TL;DR Quick diagnostic commands for the most common Ollama problems: # Check if Ollama is running systemctl status ollama curl http://localhost:11434/api/version # Check GPU detection ollama ps nvidia-smi # NVIDIA rocm-smi # AMD # Check disk space for model downloads df -h ~/.ollama # Check memory available free -h # View Ollama logs journalctl -u ollama -n 50 --no-pager # Force CPU-only mode if GPU is broken OLLAMA_NUM_GPU=0 ollama serve If you are running into an issue not covered here, the Ollama logs are almost always the fastest path to a diagnosis. Start there. ...

Local AI on Apple Silicon: Optimizing Ollama for M-Series Macs

TL;DR # Install Ollama on macOS brew install ollama # Or download from https://ollama.com # Start the server ollama serve & # Pull and run a model ollama pull llama3.1:8b ollama run llama3.1:8b # Check Metal GPU utilization sudo powermetrics --samplers gpu_power -i 1000 -n 1 Apple Silicon’s unified memory means your entire RAM pool is available as VRAM. An M1 with 16 GB can comfortably run 7B-13B models. An M3 Max with 96 GB can run 70B models at interactive speeds. Ollama uses Metal acceleration automatically – no configuration required. ...

Ollama on Raspberry Pi: Running Local LLMs on ARM

TL;DR # Install Ollama on Raspberry Pi (ARM64) curl -fsSL https://ollama.com/install.sh | sh # Pull a model that actually works on Pi ollama pull qwen2.5:0.5b ollama pull phi3:mini # Test it ollama run qwen2.5:0.5b "Write a Python function to read a CSV file" # Check memory usage ollama ps free -h Raspberry Pi 5 with 8 GB RAM can run models up to 3B parameters at usable speeds. Stick to 0.5B-1.5B models for interactive use. Anything above 7B is not practical. ...

Tabby: Self-Hosted Code Completion with Local Models

TL;DR # Run Tabby with NVIDIA GPU using Docker docker run -d --name tabby \ --gpus all \ -p 8080:8080 \ -v $HOME/.tabby:/data \ tabbyml/tabby \ serve --model StarCoder-1B --device cuda # Verify it is running curl http://localhost:8080/v1/health # Test a completion curl -X POST http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{"prompt": "def fibonacci(n):\n ", "language": "python"}' Install the Tabby plugin in your IDE, point it at http://localhost:8080, and get Copilot-style completions backed entirely by local hardware. ...

Continue.dev with Ollama: Local AI Coding in VS Code

TL;DR # Install Ollama and pull models curl -fsSL https://ollama.com/install.sh | sh ollama pull qwen2.5-coder:7b ollama pull codellama:7b # Verify Ollama is running curl http://localhost:11434/api/tags Install the Continue extension from the VS Code marketplace, open ~/.continue/config.json, point it at your local Ollama instance, and start coding with zero cloud dependencies. ...

LocalAI Setup: OpenAI API-Compatible Local Inference

LocalAI Setup: OpenAI API-Compatible Local Inference TL;DR # Docker (quickest start) docker run -d --name localai -p 8080:8080 \ -v localai-models:/build/models \ localai/localai:latest-gpu-nvidia-cuda-12 # Install a model from the gallery curl http://localhost:8080/models/apply -d '{"id": "llama-3.1-8b-instruct"}' # Test chat completions (same as OpenAI API) curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.1-8b-instruct", "messages": [{"role": "user", "content": "Hello"}]}' # Generate embeddings curl http://localhost:8080/v1/embeddings \ -H "Content-Type: application/json" \ -d '{"model": "text-embedding-ada-002", "input": "The quick brown fox"}' Caution: LocalAI has no built-in authentication. Any process that can reach port 8080 can use the API. Use firewall rules, bind to localhost only, or put a reverse proxy with auth in front before exposing to a network. ...

KoboldCpp Quick Start: Run GGUF Models with One Binary

KoboldCpp Quick Start: Run GGUF Models with One Binary TL;DR # Download the latest release (Linux, CUDA) wget https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-linux-x64-cuda1150 chmod +x koboldcpp-linux-x64-cuda1150 # Run a GGUF model ./koboldcpp-linux-x64-cuda1150 --model llama-3.1-8b-instruct.Q4_K_M.gguf \ --gpulayers 99 --contextsize 4096 --port 5001 # Web UI opens at http://localhost:5001 # KoboldAI API at http://localhost:5001/api/ # OpenAI-compatible API at http://localhost:5001/v1/chat/completions Caution: KoboldCpp binds to localhost by default. If you use --host 0.0.0.0 to allow network access, there is no built-in authentication. Restrict access with firewall rules or a reverse proxy. ...

Text Generation WebUI Setup Guide for Local LLM Inference

Text Generation WebUI Setup Guide for Local LLM Inference TL;DR # Clone and run the one-click installer git clone https://github.com/oobabooga/text-generation-webui cd text-generation-webui bash start_linux.sh # Installs conda env, dependencies, launches UI # Or use Docker docker compose up -d # Access the web interface # http://localhost:7860 # Enable API mode python server.py --api --listen # API available at http://localhost:5000/v1/chat/completions Caution: The web interface has no authentication by default. Do not use --listen (which binds to 0.0.0.0) on networks you do not control. Use --listen --api-key YOUR_SECRET if exposing the API, and put a reverse proxy with auth in front for production use. ...

vLLM Local Setup: High-Throughput LLM Serving Guide

vLLM Local Setup: High-Throughput LLM Serving Guide TL;DR # Install vLLM (requires CUDA 12.1+ and Python 3.9+) pip install vllm # Serve a model with OpenAI-compatible API vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 # Test the endpoint curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}' # Docker deployment docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \ --model meta-llama/Llama-3.1-8B-Instruct Caution: vLLM requires a Hugging Face account with accepted model licenses for gated models like Llama. Set HF_TOKEN in your environment before serving. Never expose the API port to untrusted networks without authentication – vLLM has no built-in auth layer. ...

Ollama Behind Nginx Reverse Proxy: SSL and Multi-User Setup

Ollama Behind Nginx Reverse Proxy: SSL and Multi-User Setup TL;DR # Install Nginx and Certbot sudo apt install nginx certbot python3-certbot-nginx # Get SSL certificate sudo certbot --nginx -d ollama.example.com # Test the proxy curl -s https://ollama.example.com/api/tags \ -u admin:password | jq . By default, Ollama listens on localhost:11434 with no authentication, no encryption, and no rate limiting. This is fine for single-user local development but inadequate for team use or any network-exposed deployment. Nginx solves all three problems as a reverse proxy layer in front of Ollama. ...