Vram-Optimization

TL;DR Modern LLMs running on Ollama use three primary architectures: decoder-only (GPT-style), encoder-decoder (T5-style), and encoder-only (BERT-style). For local deployment in 2026, decoder-only models dominate because they handle both understanding and generation with a single unified architecture, making them memory-efficient and straightforward to quantize. Decoder-only models like Llama, Mistral, and Qwen use causal attention – each token only sees previous tokens. This unidirectional flow means you can cache key-value pairs during generation, reducing compute for long conversations. When you run ollama run llama3.2:3b, you’re loading a decoder-only model optimized for streaming text generation with minimal VRAM overhead. ...