Performance & Hardware

Nvidia Vera CPU: Self-Hosted AI with Ollama

TL;DR Nvidia’s Vera CPU architecture brings ARM-based processing designed specifically for AI workloads to self-hosted environments. Unlike traditional x86 chips, Vera integrates neural processing units directly into the CPU die, making it particularly effective for running multiple Ollama instances simultaneously without GPU bottlenecks. For homelab operators, this means you can run agent frameworks like AutoGen or LangChain with local LLMs while maintaining responsive system performance. A typical setup might run three Ollama instances – one for code generation with codellama:13b, another for general tasks with llama2:13b, and a third for function calling with mistral:7b – all on a single Vera-based system without thermal throttling. ...

Running Qwen2.5 Locally with Ollama: Setup Guide

Running Qwen2.5 Models Locally with Ollama TL;DR Qwen2.5 models from Alibaba Cloud offer exceptional bilingual performance in Chinese and English, with particular strengths in coding, mathematics, and multilingual reasoning tasks. Unlike Llama models, Qwen2.5 variants excel at code generation across multiple programming languages and demonstrate superior performance on mathematical problem-solving benchmarks. The model family ranges from the compact 0.5B parameter version suitable for edge devices to the powerful 72B parameter variant for complex reasoning tasks. ...

Linux GPU Hotplug: Optimizing Detection for Ollama

Linux GPU Hotplug: Optimizing Detection for Ollama TL;DR Linux hardware hotplug events let your system detect and configure GPUs automatically when they appear or change state. For local LLM deployments with Ollama and LM Studio, proper hotplug handling ensures your models can leverage GPU acceleration without manual intervention after driver updates, system reboots, or hardware changes. ...

Running Local LLMs with Ollama and llama.cpp

Running Local LLMs with Ollama and llama.cpp TL;DR Running LLMs locally gives you privacy, control, and cost savings compared to cloud APIs. This comprehensive guide covers everything you need to deploy production-ready local AI infrastructure using Ollama and llama.cpp. Both tools use GGUF format models with quantization to run efficiently on consumer hardware. Ollama provides a simple REST API and automatic model management, while llama.cpp offers fine-grained control and bleeding-edge features. You can run a 7B parameter model in 4-6GB RAM using Q4_K_M quantization, or larger models with GPU acceleration. ...

Advanced LLM Parameter Tuning for Production Workloads

Advanced LLM Parameter Tuning for Production Workloads TL;DR This guide covers advanced parameter tuning techniques beyond basic temperature and top-p settings. For foundational concepts, installation, and basic parameter explanations, see our Complete Guide to Running Local LLMs. Advanced topics covered: dynamic temperature scheduling based on task type, repeat penalty optimization for long-form content, mirostat sampling for consistent output quality, batch processing configuration, and A/B testing parameter combinations in production. ...

Building llama.cpp from GitHub for Local AI Models

Building llama.cpp from GitHub for Local AI Models TL;DR Building llama.cpp from source gives you a high-performance C/C++ inference engine for running GGUF-format language models locally without cloud dependencies. The process involves cloning the GitHub repository, installing build dependencies like cmake and a C++ compiler, then compiling with hardware acceleration flags for your CPU or GPU. ...

What is Ollama: Complete Guide to Running AI Models Locally

What is Ollama: Guide to Running AI Models Locally TL;DR Ollama is a command-line tool that lets you run large language models like Llama, Mistral, and CodeLlama directly on your Linux machine without sending data to external APIs. Install it with a single command, pull models from the ollama.com library, and interact via REST API on port 11434 or through the CLI. ...

RTX 3090 for AI: Best Value GPU for Local LLM Hosting

RTX 3090 for AI: Best Value GPU for Local LLM Hosting TL;DR The NVIDIA RTX 3090 is the best price-to-performance GPU for local AI work in 2026. At $700-900 used, it delivers 24GB of VRAM — the same amount as GPUs costing 2-3x more. That 24GB is the critical spec: it determines which models you can run and how many customers you can serve. ...

Jan AI: Guide to Self-Hosting LLMs on Your Machine

Jan AI: Guide to Self-Hosting LLMs on Your Machine TL;DR Jan AI is an open-source desktop application that lets you run large language models entirely on your local machine—no cloud dependencies, no data leaving your network. Think of it as a polished alternative to Ollama with a ChatGPT-like interface built in. ...

GPU vs CPU Inference with Ollama: Performance Guide

GPU vs CPU Inference with Ollama: Performance Guide TL;DR GPU inference with Ollama delivers dramatically faster token generation compared to CPU-only setups on consumer hardware. The exact speedup depends on your specific GPU, CPU, and model, but the difference is immediately noticeable. The performance gap widens with larger models. Key takeaways for your hardware decisions: ...