TL;DR
Modern LLMs running on Ollama use three primary architectures: decoder-only (GPT-style), encoder-decoder (T5-style), and encoder-only (BERT-style). For local deployment in 2026, decoder-only models dominate because they handle both understanding and generation with a single unified architecture, making them memory-efficient and straightforward to quantize.
Decoder-only models like Llama, Mistral, and Qwen use causal attention – each token only sees previous tokens. This unidirectional flow means you can cache key-value pairs during generation, reducing compute for long conversations. When you run ollama run llama3.2:3b, you’re loading a decoder-only model optimized for streaming text generation with minimal VRAM overhead.
Encoder-decoder models split the work: the encoder processes input bidirectionally (seeing full context), while the decoder generates output autoregressively. This architecture excels at translation and summarization but requires loading two separate transformer stacks into memory. Few encoder-decoder models exist in GGUF format because the dual-stack design complicates quantization and increases memory pressure on consumer hardware.
Architecture choice directly impacts your deployment constraints. A 7B decoder-only model typically needs 4-6GB VRAM at Q4_K_M quantization, while an equivalent encoder-decoder model requires 7-9GB because you’re running two transformers. Multi-head attention (MHA) in older architectures consumes more memory than grouped-query attention (GQA) or multi-query attention (MQA) found in newer models like Mistral and Llama 3.
For local deployment, prioritize decoder-only models with GQA or MQA. These architectures cache fewer key-value pairs during generation, letting you run larger context windows on limited hardware. Set OLLAMA_NUM_GPU to control layer offloading when mixing CPU and GPU inference – critical for running 13B+ models on 8GB VRAM cards.
Understanding these architectural tradeoffs helps you select models that actually fit your hardware rather than discovering memory limits at runtime.
Understanding Transformer Architecture Fundamentals
Modern transformer architectures form the foundation of every LLM you’ll run through Ollama, but understanding their internal structure helps you make informed deployment decisions. At the core, transformers process text through self-attention mechanisms that weigh the relevance of each token against every other token in the context window.
The attention mechanism operates through three learned matrices: Query, Key, and Value. When you load a 7B parameter model via Ollama, roughly 40-50% of those parameters live in these attention layers. The remaining parameters distribute across feed-forward networks and embedding layers. This distribution directly impacts your RAM requirements – a model with more attention heads needs more memory during inference, even at the same parameter count.
Multi-head attention splits the attention computation across parallel heads, typically 32 heads for 7B models and 64+ for larger variants. Each head learns different token relationships. When you run ollama run llama3.2:7b, you’re loading a model with 32 attention heads that process your prompt simultaneously.
Position Encoding Methods
Transformers need explicit position information since attention is permutation-invariant. Older architectures like GPT-2 used learned positional embeddings with fixed context lengths. Modern models available through Ollama use Rotary Position Embedding (RoPE), which enables dynamic context extension. This matters for local deployment because RoPE-based models can handle longer contexts without retraining, though memory usage scales quadratically with context length.
The feed-forward networks between attention layers typically expand dimensionality by 4x, then compress back down. A 7B model might use 4096-dimensional hidden states that expand to 16384 in the feed-forward layer. Understanding this expansion helps predict GPU memory pressure when running models locally with limited VRAM.
Decoder-Only Architectures: The Local Deployment Standard
Decoder-only architectures dominate local LLM deployment because they offer the best balance between capability and resource efficiency. Models like Llama, Mistral, and Qwen use this design exclusively, making them the de facto standard for Ollama deployments.
The architecture processes tokens autoregressively, predicting one token at a time based on all previous tokens. This unidirectional attention pattern requires less memory during inference compared to encoder-decoder designs. Each layer only maintains key-value caches for past tokens, not separate encoder states.
For local deployment, this translates to predictable memory scaling. A 7B parameter decoder-only model typically needs 14GB VRAM at fp16 precision, while an equivalent encoder-decoder model requires additional memory for cross-attention mechanisms between encoder and decoder stacks.
Parameter Distribution Patterns
Decoder-only models concentrate parameters in the attention and feed-forward layers. The Llama architecture dedicates roughly two-thirds of parameters to feed-forward networks, one-third to attention mechanisms. This distribution makes them highly compatible with quantization – feed-forward layers tolerate aggressive quantization better than attention layers.
When running quantized models through Ollama, decoder-only architectures maintain coherence even at Q4_K_M quantization levels:
ollama pull llama3.2:3b-instruct-q4_K_M
ollama run llama3.2:3b-instruct-q4_K_M
The Q4_K_M format applies mixed quantization strategies, preserving attention layer precision while aggressively compressing feed-forward weights.
Deployment Constraints
Decoder-only models scale linearly with context length during inference. Doubling context from 4K to 8K tokens roughly doubles KV cache memory requirements. This predictable scaling helps capacity planning for local hardware.
Most decoder-only models in Ollama’s library support context windows between 4K and 128K tokens. Longer contexts require proportionally more VRAM but maintain the same architectural simplicity that makes local deployment practical.
Encoder-Decoder and Hybrid Architectures
While decoder-only architectures dominate the local LLM landscape, encoder-decoder and hybrid models serve specialized use cases where bidirectional context understanding matters. These architectures excel at translation, summarization, and structured text transformation tasks that benefit from encoding the entire input before generating output.
Traditional encoder-decoder models like T5 and BART use separate transformer stacks for input processing and output generation. The encoder builds bidirectional representations of the input text, while the decoder generates output autoregressively using both the encoded input and previously generated tokens. This separation creates higher memory overhead compared to decoder-only models – a T5-3B model requires roughly 50% more VRAM than a comparable decoder-only architecture during inference.
For local deployment, encoder-decoder models present quantization challenges. The encoder and decoder components respond differently to aggressive quantization, often requiring separate quantization strategies. When running these models through Ollama, expect longer cold-start times as both components load into memory.
Practical Deployment Considerations
Few encoder-decoder models appear in Ollama’s library because the format optimizes for decoder-only architectures. If you need encoder-decoder capabilities locally, consider running specialized models through llama.cpp with custom GGUF conversions or using frameworks that support the original model format.
Hybrid architectures like Flan-T5 combine encoder-decoder structure with instruction tuning, making them effective for task-specific deployments where you need reliable structured output. These models work well for document processing pipelines, API response generation, and data extraction tasks where the input-output relationship follows predictable patterns.
Caution: Encoder-decoder models consume significantly more memory than their parameter count suggests. A 3B encoder-decoder model may require resources comparable to a 7B decoder-only model. Test memory requirements thoroughly before production deployment, especially on systems with limited VRAM.
Architecture Selection Criteria for Self-Hosted AI
When selecting an LLM architecture for local deployment, your hardware constraints and use case requirements drive the decision more than benchmark scores. Decoder-only architectures like Llama and Mistral dominate self-hosted scenarios because they excel at text generation with predictable memory footprints – a 7B parameter decoder-only model typically requires 4-6GB VRAM at 4-bit quantization.
Decoder-only models load the entire parameter set into memory once, then generate tokens autoregressively. This makes memory requirements calculable: multiply parameter count by bits-per-weight, add KV cache overhead. For Ollama deployments, this predictability matters when sizing hardware. A system with 16GB RAM can comfortably run quantized 13B models, while 32GB opens 30B+ territory.
Encoder-decoder architectures like T5 or BART require loading both encoder and decoder networks, effectively doubling memory overhead for equivalent parameter counts. These shine for translation or summarization tasks but rarely justify the resource cost for general-purpose local AI.
Quantization Compatibility
Architecture choice affects quantization tolerance. Models with grouped-query attention (GQA) like Llama 3 maintain quality better under aggressive quantization than older multi-head attention designs. When running Ollama with limited VRAM, GQA models at Q4_K_M quantization often outperform larger non-GQA models at Q3_K_S.
Test quantization impact before committing to an architecture:
ollama pull llama3:8b-instruct-q4_K_M
ollama pull llama3:8b-instruct-q8_0
Run identical prompts against both, comparing output quality versus memory usage. The Q4 variant uses roughly half the VRAM while maintaining coherent responses for most tasks.
Context Window Considerations
Architectures with rotary position embeddings (RoPE) like Llama support context extension through simple scaling factors, making them ideal for document analysis workloads. Absolute position embeddings limit context windows more rigidly, requiring retraining for extension.
Parameter Distribution and Layer Analysis
Understanding how parameters distribute across model layers helps you predict memory requirements and optimize quantization strategies for local deployment. Modern LLMs concentrate parameters differently depending on their architecture, directly impacting your hardware choices.
Decoder-only models like Llama and Mistral allocate substantial parameters to multi-head attention layers. A 7B parameter model typically dedicates 40-50% of its parameters to attention mechanisms across all layers. This concentration means aggressive quantization of attention weights degrades output quality faster than quantizing feed-forward layers.
When running models through Ollama, you can verify parameter distribution by examining model metadata:
ollama show llama3.2:3b --modelfile
The output reveals layer structure and quantization applied to different components. Models with grouped-query attention (GQA) reduce attention parameters while maintaining quality, making them ideal for systems with limited VRAM.
Feed-Forward Network Sizing
Feed-forward networks in transformer layers consume the remaining parameter budget. Most architectures use an expansion ratio where the intermediate dimension is 4x the model dimension. A model with 4096 hidden dimensions creates feed-forward layers with 16384 intermediate dimensions, multiplied across all layers.
This expansion creates opportunities for aggressive quantization. Feed-forward weights tolerate 4-bit quantization better than attention weights, allowing mixed-precision strategies where you quantize FFN layers more aggressively while preserving attention precision.
Embedding Layer Considerations
Vocabulary embeddings represent 5-15% of total parameters depending on vocabulary size. Models with 32K token vocabularies require significantly more embedding parameters than those with 8K vocabularies. When memory-constrained, consider models with smaller vocabularies – they load faster and consume less VRAM while maintaining reasonable performance for English-dominant tasks.
Caution: Always validate model architecture details from official model cards before making deployment decisions based on parameter distribution assumptions.
Installation and Configuration Steps
Install Ollama on your Linux system with the official installer:
curl -fsSL https://ollama.com/install.sh | sh
Verify the installation by checking the service status:
systemctl status ollama
The service listens on port 11434 by default. Test connectivity:
curl http://localhost:11434/api/tags
Architecture-Specific Model Selection
Different architectures require different resource allocations. Decoder-only models like Llama and Mistral work well for most local deployments. For encoder-decoder architectures, verify your system has sufficient VRAM before pulling models.
Pull a decoder-only model optimized for local inference:
ollama pull llama3.2:3b
For systems with limited GPU memory, use quantized variants. The architecture determines quantization compatibility – decoder-only models generally quantize better than encoder-decoder variants.
GPU Configuration for Architecture Types
Configure GPU allocation based on your model’s architecture. Transformer decoder layers benefit from GPU acceleration:
export OLLAMA_NUM_GPU=1
export OLLAMA_HOST=0.0.0.0:11434
systemctl restart ollama
Store models in a custom directory if your root partition has limited space:
export OLLAMA_MODELS=/mnt/storage/ollama-models
Testing Architecture Performance
Run inference tests to validate your architecture choice meets latency requirements:
time ollama run llama3.2:3b "Explain transformer attention mechanisms"
Monitor memory usage during inference to ensure your selected architecture fits within system constraints. Decoder-only architectures typically show more predictable memory patterns than encoder-decoder models.
Caution: Always validate model outputs before using them in production systems. Architecture selection affects output characteristics – decoder-only models excel at generation tasks while encoder-decoder architectures handle translation and summarization differently.
