Posts

how do i serve a 70b llm efficiently on a multi-gpu server?

TL;DR Serving a 70B parameter model across multiple GPUs requires careful orchestration of tensor parallelism, memory management, and framework selection. The key challenge is that a 70B model in FP16 precision needs approximately 140GB of memory just for weights, plus additional overhead for KV cache and activations – far exceeding single GPU capacity. ...

ollama mtp

TL;DR MTP (Model Transfer Protocol) is an experimental feature in Ollama that enables direct model transfers between Ollama instances without re-downloading from the registry. Instead of each server pulling a 7GB model from ollama.com, one instance can push it directly to another over your local network. This matters for air-gapped environments, bandwidth-constrained deployments, and multi-node setups where you want consistent model versions across machines. ...

companies migrating gpt-4 openai to llama mistral self-hosted production case study

TL;DR Major enterprises are moving production AI workloads from GPT-4 to self-hosted Llama and Mistral models, achieving substantial cost reductions while maintaining acceptable quality for most use cases. This migration requires careful planning around API compatibility, prompt engineering adjustments, and performance validation. The typical migration path involves running both systems in parallel during a transition period, using an API compatibility layer that translates OpenAI-formatted requests to local model endpoints. Tools like LiteLLM and OpenAI-compatible servers in Ollama handle this translation, letting teams test self-hosted models without rewriting application code. ...

can ollama models access the internet

TL;DR No, Ollama models cannot access the internet directly. Models running through Ollama are completely offline and operate only on the data they were trained on plus whatever context you provide in your prompts. When you run ollama run llama3.2 or send requests to the API on port 11434, the model generates responses based purely on its training data and your conversation history – it has no mechanism to fetch live web content, query APIs, or retrieve current information. ...

Essential llama.cpp Command Line Flags for Local AI in 2026

TL;DR llama.cpp remains the fastest way to run quantized LLMs locally in 2026, but choosing the right command-line flags makes the difference between a sluggish 2 tokens/second and a responsive 30+ tokens/second experience. This guide covers the essential flags you need for optimal performance on consumer hardware. The most impactful flags control resource allocation: --n-gpu-layers offloads model layers to your GPU (start with -ngl 35 for 8GB VRAM), --threads sets CPU cores for processing (use physical cores minus 2), and --ctx-size defines context window length (2048 for chat, 8192 for document analysis). Getting these three right solves most performance issues. ...

How to Move Ollama Models to Another Drive in 2026

TL;DR Moving Ollama models to another drive requires changing the OLLAMA_MODELS environment variable and relocating your existing model files. By default, Ollama stores models in ~/.ollama/models on Linux systems, but you can point it to any directory with sufficient space. The fastest approach: stop the Ollama service, set OLLAMA_MODELS to your new location, move the existing models directory, then restart. For systemd-managed installations, edit /etc/systemd/system/ollama.service to add Environment=“OLLAMA_MODELS=/mnt/storage/ollama-models” under the [Service] section. After running systemctl daemon-reload and systemctl restart ollama, verify the new path with ollama list. ...

Odysseus: Complete Self-Hosted AI Workspace with Ollama

TL;DR Odysseus transforms your self-hosted infrastructure into a unified AI workspace that coordinates multiple capabilities – chat interfaces, code completion, image generation, and document analysis – through a single web interface. Unlike single-purpose tools that handle one task, Odysseus provides workspace management features designed for teams and complex projects that span multiple AI modalities. ...

Llama Models on AMD ROCm: Complete Self-Hosting Setup Guide

TL;DR Running Llama models on AMD GPUs requires ROCm-specific optimizations that differ significantly from NVIDIA CUDA workflows. This guide covers the complete setup for self-hosting Llama 2, Llama 3, and Code Llama variants on AMD hardware using ROCm 6.0+, with focus on memory management, compilation flags, and performance tuning that existing NVIDIA guides do not address. ...

llama.cpp Multi-GPU Support for Mixed Graphics Cards in 2026

TL;DR llama.cpp supports heterogeneous multi-GPU configurations, letting you mix NVIDIA, AMD, and even Intel Arc cards in the same system for local LLM inference. Unlike Ollama’s automatic GPU detection, llama.cpp requires explicit layer distribution using the -ngl flag combined with --split-mode and --tensor-split parameters. This gives you fine-grained control over which layers run on which card, essential when mixing a high-VRAM card with lower-capacity GPUs. ...

Fix Ollama Model Switching Causing 100% SSD Usage in 2026

TL;DR When you switch between models in Ollama, the service unloads the current GGUF file from memory and loads the new one from disk. Large models like llama3.1:70b or mixtral:8x7b can exceed 40GB, causing sustained disk reads that pin your SSD at maximum utilization. This becomes especially problematic when multiple users or applications trigger rapid model switches, creating a cascade of disk I/O that degrades system responsiveness. ...