# GPU Comparison Cheat Sheet for Local AI (2026)

Source: localaiops.com

## Consumer GPUs for Local LLM Inference

| GPU | VRAM | Used Price | $/GB VRAM | Llama 3.1 8B (tok/s) | Max Model (Q4) | Power Draw |
|-----|------|-----------|-----------|----------------------|-----------------|------------|
| RTX 3060 12GB | 12 GB | $200-250 | $17-21 | ~30 | 13B | 170W |
| RTX 3070 Ti | 8 GB | $250-350 | $31-44 | ~35 | 7B | 290W |
| RTX 3080 10GB | 10 GB | $350-450 | $35-45 | ~40 | 7B | 320W |
| **RTX 3090 24GB** | **24 GB** | **$700-900** | **$29-38** | **~50** | **Mixtral 8x7B** | **350W** |
| RTX 4070 Ti 12GB | 12 GB | $600-700 | $50-58 | ~45 | 13B | 285W |
| RTX 4080 16GB | 16 GB | $800-950 | $50-59 | ~55 | 13B | 320W |
| RTX 4090 24GB | 24 GB | $1,600-1,900 | $67-79 | ~75 | Mixtral 8x7B | 450W |

## VRAM Requirements by Model Size (Q4_K_M Quantization)

| Model | Parameters | VRAM Needed | Fits on 8GB? | Fits on 12GB? | Fits on 24GB? |
|-------|-----------|-------------|-------------|---------------|---------------|
| Phi-3 Mini | 3.8B | ~2.5 GB | YES | YES | YES |
| Llama 3.1 | 8B | ~5 GB | YES | YES | YES |
| Mistral | 7B | ~5 GB | YES | YES | YES |
| CodeLlama | 13B | ~8 GB | Tight | YES | YES |
| Llama 3.1 | 70B | ~40 GB | NO | NO | NO (need 2x 24GB) |
| Mixtral | 8x7B | ~26 GB | NO | NO | Tight |

## Quick Decision Guide

- **Budget ($200-300):** RTX 3060 12GB -- runs 7B-13B models, best entry point
- **Best value ($700-900):** RTX 3090 24GB -- 24GB VRAM at lowest $/GB, runs almost everything
- **Maximum speed ($1,600+):** RTX 4090 24GB -- fastest consumer GPU, same VRAM as 3090
- **70B models:** Need 2x RTX 3090 ($1,400-1,800 total) or 1x A100 80GB ($8,000+)

## Monthly Electricity Cost Estimates (US Average $0.16/kWh)

| GPU | Idle | Light Use (4h/day) | Heavy Use (12h/day) |
|-----|------|-------------------|---------------------|
| RTX 3060 | ~$3 | ~$13 | ~$25 |
| RTX 3090 | ~$5 | ~$22 | ~$50 |
| RTX 4090 | ~$6 | ~$28 | ~$65 |

## Key Takeaways

1. VRAM is the bottleneck -- not compute speed. Buy the most VRAM you can afford.
2. Used RTX 3090s offer the best price-per-GB of VRAM in 2026.
3. Q4_K_M quantization gives ~90% of full-precision quality at ~25% of the VRAM.
4. CPU inference works but is 10-50x slower than GPU. Fine for testing, not production.
