Multi-GPU Ollama Setup: Running 70B Models on Dual GPUs

TL;DR

A single 24GB GPU cannot run a 70B parameter LLM. The model requires approximately 40GB of VRAM at Q4 quantization. Two GPUs solve this by splitting the model across both cards. This guide covers the hardware, configuration, and performance expectations for running 70B models on dual RTX 3090s with Ollama.

Key numbers for dual RTX 3090 (48GB total):

ModelQuantizationVRAM RequiredFits?Generation Speed
Llama 3.1 70BQ4_K_M~40GBYes~12 tok/s
Llama 3.1 70BQ5_K_M~48GBTight~10 tok/s
Llama 3.1 70BQ6_K~55GBNoNeeds 3 GPUs
Mixtral 8x22BQ4_K_M~44GBYes~10 tok/s
Command R+ 104BQ3_K_M~45GBTight~7 tok/s

Hardware cost:

ComponentPrice
2x RTX 3090 (used)$1,600
CPU, motherboard, RAM, PSU, case$800-1,000
Total$2,400-2,600

When You Need Multi-GPU

A single GPU limits you to models that fit in its VRAM. For a 24GB card:

  • Fits easily: 7B, 8B, 13B models at Q4-Q8 quantization
  • Fits tight: 34B models at Q4, Mixtral 8x7B at Q4
  • Does not fit: 70B models at any reasonable quantization

The math is straightforward. A model’s VRAM requirement is approximately:

VRAM (GB) ≈ Parameters (B) × Bits per weight / 8

70B at Q4 (4.5 bits avg): 70 × 4.5 / 8 ≈ 39.4 GB
70B at Q5 (5.5 bits avg): 70 × 5.5 / 8 ≈ 48.1 GB
70B at FP16 (16 bits):    70 × 16 / 8  = 140 GB

Add 1-2GB for KV cache and runtime overhead, and a 70B Q4 model needs ~41GB. Two 24GB GPUs provide 48GB total, which fits comfortably.

Why 70B Models Matter

70B parameter models represent a significant quality jump over 7B-13B models:

  • Stronger reasoning and multi-step problem solving
  • Better code generation and debugging
  • More coherent long-form writing
  • Closer to GPT-4 quality for many tasks
  • Better instruction following and fewer hallucinations

If your use case demands higher quality than 8B-13B models provide, multi-GPU 70B is the local path to get there.

Hardware Requirements

GPUs

Recommended: 2x RTX 3090 (24GB each)

The RTX 3090 is the best value for multi-GPU AI builds:

  • 24GB VRAM per card = 48GB total
  • ~$800 each used = $1,600 total
  • Same cost as one RTX 4090 but with 2x the VRAM
  • Proven reliable for 24/7 AI workloads

Alternative: 2x RTX 4090 (24GB each)

  • Faster per-card performance
  • $3,200-3,600 total
  • 48GB combined VRAM (same as dual 3090)
  • Only worthwhile if you need maximum speed

Budget option: RTX 3090 + RTX 3060 12GB

  • 36GB total VRAM
  • Can run 70B at Q3 quantization (tight)
  • Performance limited by the slower 3060
  • Not ideal but workable for experimentation

NVLink: The RTX 3090 has an NVLink connector that provides direct GPU-to-GPU communication at 112.5 GB/s (total bidirectional). However, NVLink bridges for the 3090 are scarce and expensive ($60-100 when available).

PCIe: Without NVLink, GPUs communicate through the PCIe bus via system memory. PCIe 4.0 x16 provides 32 GB/s per direction (64 GB/s bidirectional).

Does NVLink matter for inference? Minimally. During autoregressive token generation, inter-GPU communication is a small fraction of total compute time. Benchmarks show less than 5% speed difference between NVLink and PCIe for LLM inference. NVLink matters more for training where gradient synchronization requires heavy GPU-to-GPU data transfer.

Recommendation: Do not spend extra to get NVLink for an inference-only setup. PCIe is sufficient.

Motherboard

You need a motherboard with two physical PCIe x16 slots that both run at x8 or higher electrical speed. Common configurations:

ChipsetPCIe SlotsConfigurationRecommendation
AMD B5501x16 + 1x4Second slot too slowAvoid
AMD X5702x16Both full speedGood
AMD X670E2x16PCIe 5.0Overkill but works
Intel Z690/Z7901x16 + 1x4Check specific modelVaries
Intel X299Multiple x16Good PCIe lane countGood, older platform

The critical factor is PCIe lane count. AMD X570 provides 24 usable PCIe 4.0 lanes from the CPU, enough for two GPUs at x8 each. Running a GPU at PCIe 4.0 x8 vs x16 has negligible impact on AI inference performance (under 2% difference).

Power Supply

Two RTX 3090s require serious power delivery:

2x RTX 3090:        700W peak (350W each)
CPU (5900X):         105W
RAM, storage, fans:  50W
Total peak:          ~855W

Minimum: 1000W PSU with two sets of dual 8-pin PCIe power cables.

Recommended: 1200W PSU for headroom and efficiency. PSUs operate most efficiently at 50-80% load, so a 1200W unit running at 700-850W is in the sweet spot.

Cable requirements per 3090:

  • 2x 8-pin PCIe power connectors (some models need 3x 8-pin)
  • Use separate cables from the PSU, not daisy-chained splitters
  • Daisy chains can cause voltage drops and instability under load

Cooling

Two GPUs in adjacent slots create significant heat. Solutions:

  • Leave one slot of space between the GPUs if your motherboard layout allows it.
  • Open-frame or open-case builds provide the best cooling for multi-GPU setups.
  • Blower-style coolers exhaust heat out the back of the case. Better for multi-GPU than open-air coolers that dump heat inside the case.
  • Additional case fans: At minimum, two 140mm intake fans and two 140mm exhaust fans.
  • Temperature target: Keep both GPUs under 85C under sustained load. Throttling begins at 83-90C depending on the card model.

Monitor temperatures during initial testing:

# NVIDIA
watch -n 1 nvidia-smi

# Check both GPUs are being used and temperature is reasonable

RAM

System RAM matters for multi-GPU setups:

  • Minimum: 64GB DDR4 – enough for OS, Ollama, and KV cache overflow
  • Recommended: 128GB DDR4 – provides buffer for large context windows and concurrent operations
  • Model loading: Ollama loads the model file from disk into system RAM first, then transfers to GPU VRAM. Large models (40GB+) need enough free RAM for this transfer.

Configuring Ollama for Multi-GPU

Automatic GPU Detection

Ollama automatically detects all NVIDIA GPUs and distributes model layers across them. For most cases, no configuration is needed:

# Pull a 70B model
ollama pull llama3.1:70b

# Run it — Ollama auto-splits across available GPUs
ollama run llama3.1:70b

Verify both GPUs are being used:

nvidia-smi

Both GPUs should show VRAM usage. If only one GPU is used, check that both are visible:

nvidia-smi -L
# Should list:
# GPU 0: NVIDIA GeForce RTX 3090 (UUID: ...)
# GPU 1: NVIDIA GeForce RTX 3090 (UUID: ...)

Controlling GPU Assignment

To specify which GPUs Ollama should use:

# Use only GPU 0
CUDA_VISIBLE_DEVICES=0 ollama serve

# Use only GPU 1
CUDA_VISIBLE_DEVICES=1 ollama serve

# Use both (default behavior)
CUDA_VISIBLE_DEVICES=0,1 ollama serve

For systemd service configuration:

sudo systemctl edit ollama
[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"

Layer Distribution

Ollama distributes model layers proportionally based on available VRAM on each GPU. With two identical 3090s, layers split roughly 50/50.

With mismatched GPUs (e.g., 3090 + 3060), Ollama assigns more layers to the GPU with more free VRAM. You can verify the distribution in Ollama’s verbose output:

OLLAMA_DEBUG=1 ollama run llama3.1:70b

Look for lines showing layer assignment:

llm_load_tensors: offloading 40 layers to GPU 0
llm_load_tensors: offloading 40 layers to GPU 1

Running Multiple Models Simultaneously

With 48GB total VRAM, you can run different configurations:

Option 1: One 70B model across both GPUs

ollama run llama3.1:70b
# Uses ~40GB across both GPUs
# ~8GB free for KV cache

Option 2: Two separate models, one per GPU

# In two separate terminals or via API
CUDA_VISIBLE_DEVICES=0 ollama run llama3.1:8b    # GPU 0, ~5GB
CUDA_VISIBLE_DEVICES=1 ollama run codellama:13b   # GPU 1, ~8GB

Option 2 gives you better throughput for serving multiple users with different model requirements.

llama.cpp Layer Splitting

If you use llama.cpp directly (instead of through Ollama), you have finer control over layer distribution.

Building llama.cpp with Multi-GPU Support

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

Running with Explicit Layer Splitting

The -ngl flag controls how many layers go to GPU. Use --tensor-split to control distribution between GPUs:

# Split evenly between two GPUs
./build/bin/llama-server \
  -m /path/to/llama-3.1-70b-Q4_K_M.gguf \
  -ngl 99 \
  --tensor-split 0.5,0.5 \
  --host 0.0.0.0 \
  --port 8080

# Give GPU 0 60% of layers, GPU 1 40%
./build/bin/llama-server \
  -m /path/to/llama-3.1-70b-Q4_K_M.gguf \
  -ngl 99 \
  --tensor-split 0.6,0.4 \
  --host 0.0.0.0 \
  --port 8080

The --tensor-split values are proportional. 0.5,0.5 splits evenly. 0.7,0.3 puts 70% of layers on GPU 0. Adjust based on available VRAM if your GPUs have different amounts.

Monitoring Layer Assignment

llama.cpp logs layer distribution at startup:

llm_load_tensors:   CUDA0 buffer size = 19832.81 MiB
llm_load_tensors:   CUDA1 buffer size = 19832.81 MiB
llm_load_tensors:    CPU buffer size   =   256.00 MiB

If you see a large CPU buffer size, some layers are running on CPU, which will significantly slow inference. Reduce the model size (use a smaller quantization) or add more GPU VRAM.

Optimal Parameters for 70B on Dual 3090

./build/bin/llama-server \
  -m llama-3.1-70b-Q4_K_M.gguf \
  -ngl 99 \
  --tensor-split 0.5,0.5 \
  -c 4096 \
  --threads 8 \
  --host 0.0.0.0 \
  --port 8080

Parameter notes:

  • -ngl 99: Offload all layers to GPU (99 exceeds the actual layer count, so all layers go to GPU).
  • -c 4096: Context window size. Increase for longer conversations but each doubling uses more VRAM for KV cache.
  • --threads 8: CPU threads for non-GPU operations. Match to your physical core count.

Performance Scaling: Is 2x GPU = 2x Speed?

No. Multi-GPU inference introduces overhead that reduces scaling efficiency.

Single GPU vs Dual GPU (RTX 3090, models that fit on one card)

Model1x 30902x 3090Scaling
Llama 3.1 8B (Q4)50 tok/s55 tok/s1.1x
CodeLlama 13B (Q4)25 tok/s28 tok/s1.12x

For models that fit on a single GPU, adding a second GPU provides minimal benefit. The inter-GPU communication overhead nearly cancels out the additional compute. Keep small models on a single GPU.

Models That Require Two GPUs

Model2x 3090Expected MaxEfficiency
Llama 3.1 70B (Q4)12 tok/s~16 tok/s~75%
Mixtral 8x22B (Q4)10 tok/s~14 tok/s~71%
Command R+ 104B (Q3)7 tok/s~10 tok/s~70%

Scaling efficiency for models that require splitting is typically 70-80%. The overhead comes from:

  1. PCIe transfer latency: Each token generation requires data transfer between GPUs.
  2. Synchronization: GPUs must wait for each other at layer boundaries.
  3. Load imbalance: Even with 50/50 splits, some layers are computationally heavier.

Prompt Processing Scales Better

Prompt evaluation (processing the input text) scales better than token generation because it is more compute-bound:

Llama 3.1 70B prompt eval:
  1x 3090 (if it fit): ~100 tok/s (theoretical)
  2x 3090:             ~200 tok/s (actual)
  Scaling:             ~1.7-1.8x

Prompt processing involves batch matrix multiplications that parallelize well across GPUs.

Step-by-Step: Dual RTX 3090 Build

1. Hardware Assembly

Install both GPUs with at least one slot of space between them if possible. Connect separate power cables to each GPU (do not daisy-chain).

2. Install the OS

Debian 12 or Ubuntu 22.04/24.04 are recommended. Install with minimal packages:

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential curl wget git

3. Install NVIDIA Drivers

# Add NVIDIA repository
sudo apt install -y nvidia-driver-550

# Reboot
sudo reboot

After reboot, verify both GPUs are detected:

nvidia-smi

Expected output shows two GPUs with their VRAM:

+-------------------------+
| NVIDIA-SMI 550.x.x     |
+-------------------------+
| GPU  Name        ...    |
|  0   RTX 3090    24GB   |
|  1   RTX 3090    24GB   |
+-------------------------+

4. Install Ollama

curl -fsSL https://ollama.ai/install.sh | sh

5. Pull and Run a 70B Model

# Pull the model (this downloads ~40GB)
ollama pull llama3.1:70b

# Run it
ollama run llama3.1:70b "Explain the difference between TCP and UDP in detail."

While running, verify GPU utilization:

# In another terminal
nvidia-smi

Both GPUs should show VRAM usage (~20GB each) and non-zero GPU utilization.

6. Test Under Load

Run a longer generation to measure sustained performance:

ollama run llama3.1:70b "Write a comprehensive guide to setting up a PostgreSQL database cluster with replication."

Monitor temperatures and power draw during generation:

watch -n 1 nvidia-smi

Target: both GPUs under 85C, stable power draw, no throttling messages.

7. Configure for Production

Set up Ollama as a systemd service with optimal settings:

sudo systemctl edit ollama
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="CUDA_VISIBLE_DEVICES=0,1"

Restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

OLLAMA_NUM_PARALLEL=2 allows two concurrent requests. For a 70B model on 48GB, two parallel requests is the practical limit before VRAM becomes constrained by KV cache growth.

Troubleshooting

Only One GPU Is Being Used

Check CUDA_VISIBLE_DEVICES is not set to a single GPU:

env | grep CUDA

Verify both GPUs are visible to Ollama:

OLLAMA_DEBUG=1 ollama run llama3.1:70b

Look for lines mentioning both GPU 0 and GPU 1 in the debug output.

Model Loads But Runs Extremely Slowly

Some layers may be on CPU. Check with nvidia-smi – if total VRAM usage across both GPUs is significantly less than the model size, layers are spilling to CPU.

Fix: use a smaller quantization that fits entirely in GPU VRAM.

# Switch from Q5 to Q4 to save ~8GB
ollama run llama3.1:70b-instruct-q4_K_M

System Runs Out of RAM During Model Loading

Ollama needs system RAM to load the model file before transferring to GPU. A 40GB model needs approximately 40GB of free system RAM during loading.

Fix: ensure you have at least 64GB system RAM. Close other applications during model loading. Once loaded, the system RAM is freed.

GPU Temperatures Too High

With two GPUs generating 250W+ each of heat:

# Check temperatures
nvidia-smi --query-gpu=temperature.gpu --format=csv

If temperatures exceed 85C:

  1. Increase case fan speeds
  2. Remove the side panel temporarily to test if airflow is the issue
  3. Consider a GPU support bracket if the card is sagging (reduces contact with heatsink)
  4. In extreme cases, repaste the thermal compound

Bottom Line

Dual RTX 3090s are the most cost-effective way to run 70B parameter models locally. At approximately $2,500 for a complete build, you get access to near-GPT-4 quality models running entirely on your hardware with no API costs and no data leaving your network.

The setup is straightforward: Ollama handles multi-GPU automatically, and the performance is practical for interactive use at 10-14 tokens per second. Expect roughly 75% scaling efficiency compared to theoretical dual-GPU performance. For models that fit on a single GPU, keep them there – multi-GPU overhead is not worth it for small models.


FAQ

No. Consumer GPUs communicate over PCIe for multi-GPU inference, and this works well for LLM workloads. NVLink provides higher bandwidth (112.5 GB/s vs 32-64 GB/s for PCIe), which matters for training but has minimal impact on inference. The RTX 3090 had an NVLink connector, but the performance difference for inference is under 5% compared to PCIe. The RTX 4090 dropped NVLink entirely.

Is dual GPU performance exactly 2x a single GPU?

No. Dual GPU inference is typically 70-85% of theoretical 2x performance. Overhead comes from inter-GPU communication over PCIe, synchronization between GPUs, and the fact that some operations cannot be parallelized. For a 70B model split across two RTX 3090s, expect roughly 10-14 tok/s compared to the theoretical maximum.

Can I use two different GPU models together?

Yes, with caveats. Ollama and llama.cpp support splitting layers across different GPUs. The slower GPU becomes the bottleneck for overall performance. Mixing a 3090 (24GB) with a 3060 (12GB) gives you 36GB total VRAM but performance will be limited by the 3060’s speed. Ideally, use matching GPUs.

How much power does a dual GPU system draw?

Two RTX 3090s under AI inference load draw approximately 500-600W combined (250-300W each). Add 100-150W for the rest of the system (CPU, RAM, storage, fans), and total system draw is 600-750W. A 1200W PSU provides adequate headroom. Under full training load, both cards can draw 700W combined, pushing total system draw to 850W+.

What happens if one GPU runs out of VRAM during model loading?

Ollama and llama.cpp automatically spill remaining layers to CPU RAM when GPU VRAM is exhausted. Layers on CPU run dramatically slower (often 10x or more), creating a significant bottleneck. To avoid this, choose a model quantization that fits entirely across your GPUs. For two 24GB GPUs (48GB total), any model under 45GB at the chosen quantization will fit entirely on GPU.