Quantization

TL;DR GPU inference with Ollama delivers 5-15x faster token generation compared to CPU-only setups on consumer hardware. A mid-range NVIDIA RTX 4060 (8GB VRAM) generates ~40-60 tokens/second with Llama 3.1 8B, while a modern CPU (Ryzen 7 5800X) manages only ~8-12 tokens/second. The performance gap widens dramatically with larger models. ...