Performance

Nvidia Vera CPU: Self-Hosted AI with Ollama

TL;DR Nvidia’s Vera CPU architecture brings ARM-based processing designed specifically for AI workloads to self-hosted environments. Unlike traditional x86 chips, Vera integrates neural processing units directly into the CPU die, making it particularly effective for running multiple Ollama instances simultaneously without GPU bottlenecks. For homelab operators, this means you can run agent frameworks like AutoGen or LangChain with local LLMs while maintaining responsive system performance. A typical setup might run three Ollama instances – one for code generation with codellama:13b, another for general tasks with llama2:13b, and a third for function calling with mistral:7b – all on a single Vera-based system without thermal throttling. ...

LLM Architectures for Ollama and Local AI in 2026

TL;DR Modern LLMs running on Ollama use three primary architectures: decoder-only (GPT-style), encoder-decoder (T5-style), and encoder-only (BERT-style). For local deployment in 2026, decoder-only models dominate because they handle both understanding and generation with a single unified architecture, making them memory-efficient and straightforward to quantize. Decoder-only models like Llama, Mistral, and Qwen use causal attention – each token only sees previous tokens. This unidirectional flow means you can cache key-value pairs during generation, reducing compute for long conversations. When you run ollama run llama3.2:3b, you’re loading a decoder-only model optimized for streaming text generation with minimal VRAM overhead. ...

Advanced LLM Parameter Tuning for Production Workloads

Advanced LLM Parameter Tuning for Production Workloads TL;DR This guide covers advanced parameter tuning techniques beyond basic temperature and top-p settings. For foundational concepts, installation, and basic parameter explanations, see our Complete Guide to Running Local LLMs. Advanced topics covered: dynamic temperature scheduling based on task type, repeat penalty optimization for long-form content, mirostat sampling for consistent output quality, batch processing configuration, and A/B testing parameter combinations in production. ...

RTX 3090 for AI: Best Value GPU for Local LLM Hosting

RTX 3090 for AI: Best Value GPU for Local LLM Hosting TL;DR The NVIDIA RTX 3090 is the best price-to-performance GPU for local AI work in 2026. At $700-900 used, it delivers 24GB of VRAM — the same amount as GPUs costing 2-3x more. That 24GB is the critical spec: it determines which models you can run and how many customers you can serve. ...

GPU vs CPU Inference with Ollama: Performance Guide

GPU vs CPU Inference with Ollama: Performance Guide TL;DR GPU inference with Ollama delivers dramatically faster token generation compared to CPU-only setups on consumer hardware. The exact speedup depends on your specific GPU, CPU, and model, but the difference is immediately noticeable. The performance gap widens with larger models. Key takeaways for your hardware decisions: ...